Chris Moffitt: Updated: Using Pandas To Create an ExcelDiff
source link: https://www.tuicool.com/articles/hit/imaiim2
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Introduction
Several years ago, I wrotean article about using pandas to creating a diff of two excel files. Over the years, the pandas API has changed and the diff script no longer works with the latest pandas releases. Through the magic of search engines, people are still discovering the article and are asking for help in getting it to work with more recent versions of pandas. Since pandas is closing in on a 1.0 release, I think this is a good time to get an updated version out there.
Get Started
I encourage you to read the theearlier article in order to learn more about the goals and potential limitations of this approach. This article would not have been possible without all the comentors that provided fixes and updates to the code. To be perfectly honest, there are a couple of tricky operations in this code that I probably would not have figured out if not for the kindess of the people that read this blog.
I am going to go through this fairly quickly. If there is interest, I may cover a couple of these concepts in a future post. Feel free to follow along in the notebook . The data files are on github as well.
The first step is to import pandas and read in the files:
import pandas as pd # Read in the two files but call the data old and new and create columns to track old = pd.read_excel('sample-address-1.xlsx', 'Sheet1', na_values=['NA']) new = pd.read_excel('sample-address-2.xlsx', 'Sheet1', na_values=['NA']) old['version'] = "old" new['version'] = "new"
Here’s what the top of the new
dataframe looks like:
Our diff process is looking for three types of changes:
- What are the new accounts?
- What are the removed/dropped accounts?
- What accounts have changed information?
Since everything is keyed on the account number
we can use python sets to make the process simpler to understand. This change from the original approach makes the overall code a little easier to understand in my opinion. I have not tested this on large datasets to evaluate the performance.
old_accts_all = set(old['account number']) new_accts_all = set(new['account number']) dropped_accts = old_accts_all - new_accts_all added_accts = new_accts_all - old_accts_all
Next we join all the data together and get a clean list of unique data and keep all changed rows by using drop_duplicates
all_data = pd.concat([old,new],ignore_index=True) changes = all_data.drop_duplicates(subset=["account number", "name", "street", "city","state", "postal code"], keep='last')
Next, we need to figure out which account numbers have duplicate entries. A duplicate account number is an indication that they have changed values in a field that we need to flag. We can use the duplicated
function to get a list of all those account numbers and filter out just those duplicated accounts:
dupe_accts = changes[changes['account number'].duplicated() == True]['account number'].tolist() dupes = changes[changes["account number"].isin(dupe_accts)]
Here is what the dupes look like now:
account number name street city state postal code version 24 595932 Kuhic, Eichmann and West 4059 Tobias Inlet New Rylanfurt Illinois 89271 old 30 558879 Watsica Group 95616 Enos Grove Suite 139 West Atlas Iowa 47419 old 96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 old 123 595932 Kuhic, Eichmann and West 4059 Tobias St New Rylanfurt Illinois 89271 new 129 558879 Watsica Group 829 Big street Smithtown Ohio 47919 new 195 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64918 new Now we break out the old and new data, remove the unnecesary version column and set the account number
as the index. These steps set up the data for the final comparison.
# Pull out the old and new data into separate dataframes change_new = dupes[(dupes["version"] == "new")] change_old = dupes[(dupes["version"] == "old")] # Drop the temp columns - we don't need them now change_new = change_new.drop(['version'], axis=1) change_old = change_old.drop(['version'], axis=1) # Index on the account numbers change_new.set_index('account number', inplace=True) change_old.set_index('account number', inplace=True) # Combine all the changes together df_all_changes = pd.concat([change_old, change_new], axis='columns', keys=['old', 'new'], join='outer')
Now we have a dataframe that looks like this:
old new name street city state postal code name street city state postal code account number 595932 Kuhic, Eichmann and West 4059 Tobias Inlet New Rylanfurt Illinois 89271 Kuhic, Eichmann and West 4059 Tobias St New Rylanfurt Illinois 89271 558879 Watsica Group 95616 Enos Grove Suite 139 West Atlas Iowa 47419 Watsica Group 829 Big street Smithtown Ohio 47919 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64918Before we do our final big combine, we need to define a function that will show us what has changed from column to column:
# Define the diff function to show the changes in each field def report_diff(x): return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
We now use the swaplevel
function to get the old and new columns next to each other:
df_all_changes = df_all_changes.swaplevel(axis='columns')[change_new.columns[0:]]
Which makes this change to the dataframe:
name street city state postal code old new old new old new old new old new account number 595932 Kuhic, Eichmann and West Kuhic, Eichmann and West 4059 Tobias Inlet 4059 Tobias St New Rylanfurt New Rylanfurt Illinois Illinois 89271 89271 558879 Watsica Group Watsica Group 95616 Enos Grove Suite 139 829 Big street West Atlas Smithtown Iowa Ohio 47419 47919 880043 Beatty Inc Beatty Inc 3641 Schaefer Isle Suite 171 3641 Schaefer Isle Suite 171 North Gardnertown North Gardnertown Wyoming Wyoming 64318 64918 The final tricky command is to use a groupby
on the columns then apply, our custom report_diff
function to compare the two corresponding columns to each other.
df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1)) df_changed = df_changed.reset_index()
Which yields a nice summary of the changed columns:
account number city name postal code state street 0 595932 New Rylanfurt Kuhic, Eichmann and West 89271 Illinois 4059 Tobias Inlet —-> 4059 Tobias St 1 558879 West Atlas —-> Smithtown Watsica Group 47419 —-> 47919 Iowa —-> Ohio 95616 Enos Grove Suite 139 —-> 829 Big street 2 880043 North Gardnertown Beatty Inc 64318 —-> 64918 Wyoming 3641 Schaefer Isle Suite 171The final analysis step is to figure out what has been removed and added:
df_removed = changes[changes["account number"].isin(dropped_accts)] df_added = changes[changes["account number"].isin(added_accts)]
We can output everything to an Excel file with a separate tab for changes, additions and removals:
output_columns = ["account number", "name", "street", "city", "state", "postal code"] writer = pd.ExcelWriter("my-diff.xlsx") df_changed.to_excel(writer,"changed", index=False, columns=output_columns) df_removed.to_excel(writer,"removed",index=False, columns=output_columns) df_added.to_excel(writer,"added",index=False, columns=output_columns) writer.save()
Here is what it looks like:
Conclusion
Thanks again to those that commented on the original article and provided ideas on how to improve. I hope some may find this useful. As always, if you find a great use for this script, let me know in the comments.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK