Easy data reconciliation in python

3 min readJan 13, 2020

In this post, I would like to share all the common methods in python for data reconciliation. Consider scenarios like below: 1. Finding symmetric differences, 2. Combining data with overlaps. All the code below are for illustration purposes only, but the fundamentals can be scaled to more complex datasets.

import pandas as pdimport numpy as np# turn jupyter notebook warnings offimport warningswarnings.filterwarnings("ignore")

Working example one: Two dataframes with reconciliation column key containing non-duplicated elements

df1 = pd.DataFrame({'tradeid':range(5),'profit':range(1000,2000,200)})df2 = pd.DataFrame({'tradeid':range(2,7,1), 'stock':   ['APL','MST','JNJ','TSL','BAB']})display(df1)display(df2)

Solution one: Since ‘tradeid’ is always unique, drop_duplicates() is the best way to spot the “exclusive-or” differences between two dataframes

pd.concat([df1,df2],axis=0).drop_duplicates(subset='tradeid',keep=False)

Solution two: Since ‘tradeid’ is always unique, we can also use a set function to find our differences

diff = set(df1.tradeid).symmetric_difference(set(df2.tradeid))

Then use .isin() method to select the rows from a list. Also python set is not searchable by hash values; thus converting to list

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])

Solution three: Similar to solution two, this time we are finding symmetric differences using np.array

diff = np.setxor1d(np.array(df1.tradeid), np.array(df2.tradeid))pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])

Working example two: Two dataframes with reconciliation column key containing duplicated elements

df1 = pd.DataFrame({'tradeid':[0,2,2,3,4],'profit':range(1000,2000,200)})df2 = pd.DataFrame({'tradeid':[2,3,4,5,5], 'stock':['APL','MST','JNJ','TSL','BAB']})display(df1)display(df2)

Solution: Use the pandas merge function with the indicator flag

df3 = df1.merge(df2, on='tradeid', how='outer', indicator=True)display(df3)df3 = df3.loc[df3["_merge"] != 'both', :]# cleanup to make df3 our final resultdel df3["_merge"]df3

Working example three: Finding symmetric differences for 2 lists containing duplicates

list1 = [0,1,1,2,3]
list2 = [0,1,3,4,4]

Solution is embedded using functional programming

list3 = list(filter(lambda x:x not in list2, list1))
list4 = list(filter(lambda x:x not in list1, list2))list3 + list4

Out: [2, 4, 4]

p.s. list comprehension does not work here somehow [list1.remove(x) for x in list1 if x in list2]

Working example four: Combining data with overlaps

d1 = pd.DataFrame({'tradeid':[0,1,2],'profit':[1000,np.nan,2000],'stock':['APL','JNJ',np.nan]})d2 = pd.DataFrame({'tradeid':[3,1,2],'profit':[1000,1500,np.nan],'stock':['APL',np.nan,np.nan]})display(d1)display(d2)

Subsequent dataset will combine and overwrite the original

oned2.combine_first(d1)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Charlie

27 Followers

12 Following

https://www.linkedin.com/in/changyangxu/

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

Lu Zhenna

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

How the Covid-19 pandemic has affected the cost of living, unemployment rate, and population structure of Singapore using open datasets.

Sep 25, 2024

Optimizing Storage Using Snowflake Copy Command

Sumit Gangwar

Optimizing Storage Using Snowflake Copy Command

Purge=True Option in Copy Command

Oct 26, 2024

Lists

Predictive Modeling w/ Python

20 stories1856 saves

Coding & Development

11 stories1033 saves

Practical Guides to Machine Learning

10 stories2225 saves

ChatGPT prompts

51 stories2643 saves

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

The Data Beast

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

6d ago

20 Essential Git Commands Every Data Engineer / Data Scientist Should Know

Mayurkumar Surani

20 Essential Git Commands Every Data Engineer / Data Scientist Should Know

As an Data Engineer, version control is crucial for managing data pipelines, Infrastructure as Code (IaC), and ETL scripts. Here’s your…

Oct 31, 2024

ETL Pipeline with AWS Glue and PySpark: A Hands-on PoC

Jeeva-AWSLabsJourney

ETL Pipeline with AWS Glue and PySpark: A Hands-on PoC

1. Setup AWS Glue Resources

Sep 23, 2024

Alternatives to Pandas for Data Wrangling in Python

Bragadeesh Sundararajan

Alternatives to Pandas for Data Wrangling in Python

Pandas has long been the go-to library for data wrangling and manipulation in Python. Its user-friendly interface and powerful DataFrame…

Oct 23, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams