Useful Awk Commands

Awk is a great tool for investigating data quality issues. Here are some commands which I frequently run.

Assumptions

  • The sample file contains a header row and 1000 records.
  • The fields are pipe-delimited.

Previewing the data

Let’s examine the first five rows.

Command
head -5 sample_data.txt
Output
id|first_name|last_name|email|gender|ip_address
1|Nerita|Vedishchev|nvedishchev0@techcrunch.com|Female|226.208.76.169
2|Elmer|Atty|eatty1@mapquest.com|Male|70.77.234.160
3|Emiline|Setter|esetter2@google.de|Female|100.71.154.94
4|Erroll|Follis|efollis3@woothemes.com|Male|46.170.240.200

Number of fields on each line

Command
awk -F\| '{print NF}' sample_data.txt | sort | uniq -c
Output
1001 6

If there were any records that had more fields than we expect, we could add a predicate to our Awk command to focus exclusively on those. E.g., to examine the first 5 records that didn’t have 6 fields we’d run:

awk -F\| 'NF!=6{print}' sample_data.txt | head -5

Focus on a specific line

It’s worth noting that the line numbers start from 1, not 0.

Command
awk 'NR==100' sample_data.txt
Output
99|Maiga|Denney|mdenney2q@omniture.com|Female|147.216.143.69

Extract field names and their corresponding positions

Command
head -1 sample_data.txt | awk -F\| '{ for (i = 1; i <= NF; ++i) print i, $i; exit
Output
1 id
2 first_name
3 last_name
4 email
5 gender
6 ip_addres

This can be useful when you are working with a file with many fields and need to construct an external table in your database:

Command
head -1 sample_data.txt | awk -F\| '{ for (i=1; i<=NF; ++i) print $i" "; exit}' | tr [a-z] [A-Z] |sed s/$/STRING,/
Output
ID STRING,
FIRST_NAME STRING,
LAST_NAME STRING,
EMAIL STRING,
GENDER STRING,
IP_ADDRESS STRING,

Most of the work is done for your DDL query, you just need to remove the , in the last field.

Finding the position of a specific field

Command
head -1 sample_data.txt | awk -F\| '{ for (i=1; i<=NF; ++i) {if ($i == "email") print i, $i} }'
Output
4 email

Extracting output for a particular field

Let’s preview some of the records from the email field. From the command above we know this is the fourth field.

Command
awk -F\| '{print $4}' sample_data.txt | sort | uniq -c | sort -hr | head -5
Output
      1 zmccookc9@cyberchimps.com
      1 zbromage33@pcworld.com
      1 zboissieuxi2@163.com
      1 yspatonigl@gnu.org
      1 ynesbyp@twitter.com

All of the emails are unique; however, if there were any duplicates we would of detected them with the above command.

Extracting record(s) where a condition is met

To obtain records for the first five females we’d run the following command.

Command
gawk -F\| '$5=="Female" {print}' sample_data.txt | head -5
Output
1|Nerita|Vedishchev|nvedishchev0@techcrunch.com|Female|226.208.76.169
3|Emiline|Setter|esetter2@google.de|Female|100.71.154.94
6|Annabelle|Housaman|ahousaman5@nyu.edu|Female|233.188.201.46
7|Wilmette|Creddon|wcreddon6@ocn.ne.jp|Female|253.149.230.188
8|Chrystal|Tivnan|ctivnan7@geocities.jp|Female|57.200.157.156

Closing thoughts

This is just a preview of what Awk is capable of. I hope I conveyed the value in learning this tool. For a deeper dive, I recommend referencing the documentation.


linuxawk

435 Words

2022-04-12 16:27 -0400