Python数据分析 知识量:13 - 56 - 232
drop_duplicates()函数用于删除DataFrame表格对象中的重复值(行)。其默认会对所有重复值进行检测,保留第一行,删除其他重复行。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam.xlsx") print(df,'\n') print(df.drop_duplicates())
运行结果为:
Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 2 Noah male 90 50 66 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 6 Liam male 55 88 69 7 Isabella female 66 85 55 Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 7 Isabella female 66 85 55
每行值作为一组,只有一行中所有的值均重复时,才删除该行。
可以只检查某些列,对列中存在重复值的进行删除。需要将列名传给参数subset来实现。当依据多列去重时,多个列名以列表的形式给出。
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam.xlsx") print(df,'\n') print(df.drop_duplicates(subset='Chinese'),'\n') print(df.drop_duplicates(subset=['Chinese','Math']))
运行结果为:
Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 2 Noah male 90 50 66 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 6 Liam male 55 88 69 7 Isabella female 66 85 55 Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 3 Olivia female 86 87 44 4 Liam male 55 88 69 7 Isabella female 66 85 55 Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 7 Isabella female 66 85 55
按多列去重时,每行对应列数据将作为一组,全组数据重复时才会删除。
在删除重复行时,默认保留第一个(行),通过设置参数keep,可以自定义保留最后一个(行)或全不保留。
keep='first' 保留第一行(默认)
keep='last' 保留最后一行
keep=False 不保留重复行(全部删除重复行)
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam.xlsx") print('DataFrame:') print(df,'\n') print('默认保留第一个重复值:') print(df.drop_duplicates(subset='Math'),'\n') print('保留最后一个重复值:') print(df.drop_duplicates(subset='Math',keep='last'),'\n') print('不保留重复值:') print(df.drop_duplicates(subset='Math',keep=False))
运行结果为:
DataFrame: Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 2 Noah male 90 50 66 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 6 Liam male 55 88 69 7 Isabella female 66 85 55 默认保留第一个重复值: Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 保留最后一个重复值: Name Sex Chinese English Math 2 Noah male 90 50 66 3 Olivia female 86 87 44 5 Sophia female 90 66 96 6 Liam male 55 88 69 7 Isabella female 66 85 55 不保留重复值: Name Sex Chinese English Math 3 Olivia female 86 87 44 5 Sophia female 90 66 96
删除重复值或查询唯一值时还可以使用unique()函数,示例如下:
import pandas as pd df=pd.read_excel(r"D:\PythonTestFile\exam.xlsx") print(df,'\n') print(df['Chinese'].unique()) # 获取Chinese列的唯一值
运行结果为:
Name Sex Chinese English Math 0 Noah male 90 50 66 1 Emma female 56 56 55 2 Noah male 90 50 66 3 Olivia female 86 87 44 4 Liam male 55 88 69 5 Sophia female 90 66 96 6 Liam male 55 88 69 7 Isabella female 66 85 55 [90 56 86 55 66]
Copyright © 2017-Now pnotes.cn. All Rights Reserved.
编程学习笔记 保留所有权利
MARK:3.0.0.20240214.P35
From 2017.2.6