Easy data reconciliation in python

Charlie
3 min readJan 13, 2020

In this post, I would like to share all the common methods in python for data reconciliation. Consider scenarios like below: 1. Finding symmetric differences, 2. Combining data with overlaps. All the code below are for illustration purposes only, but the fundamentals can be scaled to more complex datasets.

import pandas as pdimport numpy as np# turn jupyter notebook warnings offimport warningswarnings.filterwarnings("ignore")

Working example one: Two dataframes with reconciliation column key containing non-duplicated elements

df1 = pd.DataFrame({'tradeid':range(5),'profit':range(1000,2000,200)})df2 = pd.DataFrame({'tradeid':range(2,7,1), 'stock':   ['APL','MST','JNJ','TSL','BAB']})display(df1)display(df2)

Solution one: Since ‘tradeid’ is always unique, drop_duplicates() is the best way to spot the “exclusive-or” differences between two dataframes​

pd.concat([df1,df2],axis=0).drop_duplicates(subset='tradeid',keep=False)

Solution two: Since ‘tradeid’ is always unique, we can also use a set function to find our differences

diff = set(df1.tradeid).symmetric_difference(set(df2.tradeid))​

Then use .isin() method to select the rows from a list. Also python set is not searchable by hash values; thus converting to list

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])

Solution three: Similar to solution two, this time we are finding symmetric differences using np.array​

​diff = np.setxor1d(np.array(df1.tradeid), np.array(df2.tradeid))​pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])

Working example two: Two dataframes with reconciliation column key containing duplicated elements

df1 = pd.DataFrame({'tradeid':[0,2,2,3,4],'profit':range(1000,2000,200)})df2 = pd.DataFrame({'tradeid':[2,3,4,5,5], 'stock':['APL','MST','JNJ','TSL','BAB']})display(df1)display(df2)

Solution: Use the pandas merge function with the indicator flag

df3 = df1.merge(df2, on='tradeid', how='outer', indicator=True)display(df3)df3 = df3.loc[df3["_merge"] != 'both', :]# cleanup to make df3 our final resultdel df3["_merge"]df3

Working example three: Finding symmetric differences for 2 lists containing duplicates

list1 = [0,1,1,2,3]
list2 = [0,1,3,4,4]​

Solution is embedded using functional programming

list3 = list(filter(lambda x:x not in list2, list1))
list4 = list(filter(lambda x:x not in list1, list2))list3 + list4​

Out: [2, 4, 4]

p.s. list comprehension does not work here somehow [list1.remove(x) for x in list1 if x in list2]

Working example four: Combining data with overlaps

d1 = pd.DataFrame({'tradeid':[0,1,2],'profit':[1000,np.nan,2000],'stock':['APL','JNJ',np.nan]})d2 = pd.DataFrame({'tradeid':[3,1,2],'profit':[1000,1500,np.nan],'stock':['APL',np.nan,np.nan]})display(d1)display(d2)​

Subsequent dataset will combine and overwrite the original

oned2.combine_first(d1)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response