1 问题描述
假设我们有两个dataframe(这一段代码)来自transbigdata 笔记:官方文档案例1(出租车GPS数据处理)-CSDN博客
data = tbd.clean_outofshape(data, sz, col=['Lng', 'Lat'], accuracy=500)
data
data2 = tbd.clean_taxi_status(data, col=['VehicleNum', 'Time', 'OpenStatus'])
data2
我们希望找到在data中但不在data2中的index
2 方法1 index.difference
data.index
#RangeIndex(start=0, stop=543138, step=1)
data2.index
'''
Index([452072, 444077, 444078, 444075, 444079, 444073, 444074, 444076, 452073,
446704,
...
64415, 64402, 64413, 64411, 64405, 64390, 64406, 64393, 64391,
64396],
dtype='int64', length=542224)
'''
diff_index = data.index.difference(data2.index)
diff_index
'''
Index([ 710, 807, 844, 1372, 1564, 1684, 1690, 1753, 2842,
4150,
...
532055, 533757, 534219, 540261, 540471, 540481, 541260, 541263, 541889,
542487],
dtype='int64', length=914)
'''
3 方法2:使用merge
这个其实更灵活,可以通过设置on参数来指定用哪一列合并(不设置则默认是index)
merge几个参数的说明,可见:pandas 笔记:合并操作_pandas 字符合并-CSDN博客
merged=pd.merge(data,data2,how='outer',indicator=True)
merged
merged[merged['_merge']=='left_only'].index
'''
Index([ 710, 807, 844, 1372, 1564, 1684, 1690, 1753, 2842,
4150,
...
532055, 533757, 534219, 540261, 540471, 540481, 541260, 541263, 541889,
542487],
dtype='int64', length=914)
'''