Apr 13, 2022

Debugging Joins with Pandas

Pandas is a wonderful open source data analysis library in Python. Here’s one of many examples of how I use it in my day-to-day.

Assumptions

The data you are working with can fit in memory. Processing a data set that is too large to fit in memory requires some tricks to processing it with Pandas or using other frameworks. I made a note to myself to create a post for handling such cases.

Setup

I generated some dummy data using Mockaroo.

Here’s a preview of the first DataFrame. A primary key, id; a first name, and a last name.

Command

names.head()

Output

   id first_name    last_name
0   1    Horatio      Lamborn
1   2      Beret  Francecione
2   3   Gherardo       Bulgen
3   4      Gregg    Willgoose
4   5   Felicity       Affron

And here’s a preview of the second DataFrame. A primary key, id, and their corresponding email address.

Command

emails.head()

Output

   id                     email
0   1  abernardez0@springer.com
1   2         ktrueman1@icq.com
2   3    hposer2@opensource.org
3   4       jwillmer3@unblog.fr
4   5        glempke4@google.ru

Joining the data

We know our customers names but for our marketing campaign we also need their email addresses. As in SQL and with many other data analysis frameworks, Pandas supports left, outer, inner, and cross joins. When debugging joins, I often use an outer join.

Command

names.merge(emails, on="id", how="outer", indicator=True)

Output

      id first_name    last_name                     email      _merge
0      1    Horatio      Lamborn  abernardez0@springer.com        both
1      2      Beret  Francecione         ktrueman1@icq.com        both
2      3   Gherardo       Bulgen    hposer2@opensource.org        both
3      4      Gregg    Willgoose       jwillmer3@unblog.fr        both
4      5   Felicity       Affron        glempke4@google.ru        both
..   ...        ...          ...                       ...         ...
498  499     Olivia     Casellas                       NaN   left_only
499  500      Joyce      Hibbart                       NaN   left_only
500  598        NaN          NaN     asodeaudg@t-online.de  right_only
501  599        NaN          NaN  mlicciardidh@cbsnews.com  right_only
502  600        NaN          NaN   egiberddi@wikimedia.org  right_only

Adding the indicator=True argument adds another field to the DataFrame, _merge. If you pass a string, that will be the resulting field name. This field provides information on the source of each row. E.g., If the value is both, the merge key was found in both DataFrames.

I can now easily determine for which customers I am missing their email address and vice versa for which email addresses I am missing name information.

Find customers without an email address

Command

df[df['_merge'] == 'left_only']

Output

      id first_name     last_name email     _merge
484  485     Janeen       Steiner   NaN  left_only
485  486      Elton       Holston   NaN  left_only
486  487    Stefano      McParlin   NaN  left_only
487  488  Cleveland         Guitt   NaN  left_only
488  489     Errick      O Mullen   NaN  left_only
489  490       Linn      Thompson   NaN  left_only
490  491     Murray      Mibourne   NaN  left_only
491  492        Saw  Kryszkiecicz   NaN  left_only
492  493    Yoshiko        Tixall   NaN  left_only
493  494   Humphrey        MacArd   NaN  left_only
494  495     Eartha          Rash   NaN  left_only
495  496       Rora   Sherringham   NaN  left_only
496  497   Whitaker         Waith   NaN  left_only
497  498      Tobye         Fonzo   NaN  left_only
498  499     Olivia      Casellas   NaN  left_only
499  500      Joyce       Hibbart   NaN  left_only

Find email addresses missing name information

Command

df[df['_merge'] == 'right_only']

Output

      id first_name last_name                     email      _merge
500  598        NaN       NaN     asodeaudg@t-online.de  right_only
501  599        NaN       NaN  mlicciardidh@cbsnews.com  right_only
502  600        NaN       NaN   egiberddi@wikimedia.org  right_only

Closing thoughts

This is just a fraction of what Pandas is capable of. For a deeper dive, I recommend referencing their great documentation.