Debugging Joins with Pandas
Pandas is a wonderful open source data analysis library in Python. Here’s one of many examples of how I use it in my day-to-day.
Assumptions
- The data you are working with can fit in memory. Processing a data set that is too large to fit in memory requires some tricks to processing it with Pandas or using other frameworks. I made a note to myself to create a post for handling such cases.
Setup
I generated some dummy data using Mockaroo.
Here’s a preview of the first DataFrame. A primary key, id; a first name, and a last name.
Command
names.head()
Output
id first_name last_name
0 1 Horatio Lamborn
1 2 Beret Francecione
2 3 Gherardo Bulgen
3 4 Gregg Willgoose
4 5 Felicity Affron
And here’s a preview of the second DataFrame. A primary key, id, and their corresponding email address.
Command
emails.head()
Output
id email
0 1 abernardez0@springer.com
1 2 ktrueman1@icq.com
2 3 hposer2@opensource.org
3 4 jwillmer3@unblog.fr
4 5 glempke4@google.ru
Joining the data
We know our customers names but for our marketing campaign we also need their email addresses. As in SQL and with many other data analysis frameworks, Pandas supports left, outer, inner, and cross joins. When debugging joins, I often use an outer join.
Command
names.merge(emails, on="id", how="outer", indicator=True)
Output
id first_name last_name email _merge
0 1 Horatio Lamborn abernardez0@springer.com both
1 2 Beret Francecione ktrueman1@icq.com both
2 3 Gherardo Bulgen hposer2@opensource.org both
3 4 Gregg Willgoose jwillmer3@unblog.fr both
4 5 Felicity Affron glempke4@google.ru both
.. ... ... ... ... ...
498 499 Olivia Casellas NaN left_only
499 500 Joyce Hibbart NaN left_only
500 598 NaN NaN asodeaudg@t-online.de right_only
501 599 NaN NaN mlicciardidh@cbsnews.com right_only
502 600 NaN NaN egiberddi@wikimedia.org right_only
Adding the indicator=True
argument adds another field to the DataFrame, _merge
. If you pass a string, that will be the resulting field name. This field provides information on the source of each row. E.g., If the value is both
, the merge key was found in both DataFrames.
I can now easily determine for which customers I am missing their email address and vice versa for which email addresses I am missing name information.
Find customers without an email address
Command
df[df['_merge'] == 'left_only']
Output
id first_name last_name email _merge
484 485 Janeen Steiner NaN left_only
485 486 Elton Holston NaN left_only
486 487 Stefano McParlin NaN left_only
487 488 Cleveland Guitt NaN left_only
488 489 Errick O Mullen NaN left_only
489 490 Linn Thompson NaN left_only
490 491 Murray Mibourne NaN left_only
491 492 Saw Kryszkiecicz NaN left_only
492 493 Yoshiko Tixall NaN left_only
493 494 Humphrey MacArd NaN left_only
494 495 Eartha Rash NaN left_only
495 496 Rora Sherringham NaN left_only
496 497 Whitaker Waith NaN left_only
497 498 Tobye Fonzo NaN left_only
498 499 Olivia Casellas NaN left_only
499 500 Joyce Hibbart NaN left_only
Find email addresses missing name information
Command
df[df['_merge'] == 'right_only']
Output
id first_name last_name email _merge
500 598 NaN NaN asodeaudg@t-online.de right_only
501 599 NaN NaN mlicciardidh@cbsnews.com right_only
502 600 NaN NaN egiberddi@wikimedia.org right_only
Closing thoughts
This is just a fraction of what Pandas is capable of. For a deeper dive, I recommend referencing their great documentation.