Apr 12, 2022

Useful Awk Commands

Awk is a great tool for investigating data quality issues. Here are some commands which I frequently run.

Assumptions

The sample file contains a header row and 1000 records.
The fields are pipe-delimited.

Previewing the data

Let’s examine the first five rows.

Command

head -5 sample_data.txt

Output

id|first_name|last_name|email|gender|ip_address
1|Nerita|Vedishchev|nvedishchev0@techcrunch.com|Female|226.208.76.169
2|Elmer|Atty|eatty1@mapquest.com|Male|70.77.234.160
3|Emiline|Setter|esetter2@google.de|Female|100.71.154.94
4|Erroll|Follis|efollis3@woothemes.com|Male|46.170.240.200

Number of fields on each line

Command

awk -F\| '{print NF}' sample_data.txt | sort | uniq -c

Output

1001 6

If there were any records that had more fields than we expect, we could add a predicate to our Awk command to focus exclusively on those. E.g., to examine the first 5 records that didn’t have 6 fields we’d run:

awk -F\| 'NF!=6{print}' sample_data.txt | head -5

Focus on a specific line

It’s worth noting that the line numbers start from 1, not 0.

Command

awk 'NR==100' sample_data.txt

Output

99|Maiga|Denney|mdenney2q@omniture.com|Female|147.216.143.69

Extract field names and their corresponding positions

Command

head -1 sample_data.txt | awk -F\| '{ for (i = 1; i <= NF; ++i) print i, $i; exit

Output

1 id
2 first_name
3 last_name
4 email
5 gender
6 ip_addres

This can be useful when you are working with a file with many fields and need to construct an external table in your database:

Command

head -1 sample_data.txt | awk -F\| '{ for (i=1; i<=NF; ++i) print $i" "; exit}' | tr [a-z] [A-Z] |sed s/$/STRING,/

Output

ID STRING,
FIRST_NAME STRING,
LAST_NAME STRING,
EMAIL STRING,
GENDER STRING,
IP_ADDRESS STRING,

Most of the work is done for your DDL query, you just need to remove the , in the last field.

Finding the position of a specific field

Command

head -1 sample_data.txt | awk -F\| '{ for (i=1; i<=NF; ++i) {if ($i == "email") print i, $i} }'

Output

4 email

Extracting output for a particular field

Let’s preview some of the records from the email field. From the command above we know this is the fourth field.

Command

awk -F\| '{print $4}' sample_data.txt | sort | uniq -c | sort -hr | head -5

Output

      1 zmccookc9@cyberchimps.com
      1 zbromage33@pcworld.com
      1 zboissieuxi2@163.com
      1 yspatonigl@gnu.org
      1 ynesbyp@twitter.com

All of the emails are unique; however, if there were any duplicates we would of detected them with the above command.

Extracting record(s) where a condition is met

To obtain records for the first five females we’d run the following command.

Command

gawk -F\| '$5=="Female" {print}' sample_data.txt | head -5

Output

1|Nerita|Vedishchev|nvedishchev0@techcrunch.com|Female|226.208.76.169
3|Emiline|Setter|esetter2@google.de|Female|100.71.154.94
6|Annabelle|Housaman|ahousaman5@nyu.edu|Female|233.188.201.46
7|Wilmette|Creddon|wcreddon6@ocn.ne.jp|Female|253.149.230.188
8|Chrystal|Tivnan|ctivnan7@geocities.jp|Female|57.200.157.156

Closing thoughts

This is just a preview of what Awk is capable of. I hope I conveyed the value in learning this tool. For a deeper dive, I recommend referencing the documentation.