Small Data: Processing CSV In Unix

Oct 10, 2012

Unix provides a wealth of great tools for performing simple analysis of small datasets. awk in particular is useful for crunching CSV files.

Why use Unix tools and not Excel? The commands you create can be saved as a script and applied to updated files, multiple files at once, combined with curl to process a file from the web.

Here’s a few examples using a fictional data file, emails.csv, which contains data about a number of email campaigns sent to customers.

Get Started

First, let’s inspect the file format by printing the first few rows.

$ head -5 emails.csv
ReportRunOn,MemberID,StartDate,EndDate
10/10/2012 13:13:10,1038277,10/09/2012 00:00:00,10/10/2012 23:59:59

UserName,UserID,JobID,EmailName,EmailSubject,PickupTime,BCC,CcEmail,NormalSends,NormalSendsBCC,NormalSendsCC,TriggeredSends,TriggeredSendsBCC,TriggeredSendsCC,NormalSends_1,NormalSendsBCC_1,NormalSendsCC_1,TriggeredSends_1,TriggeredSendsBCC_1,TriggeredSendsCC_1
Ben Godfrey,ben.godfrey,1839429,Example Email,Give us your feedback!,05/10/2012 15:22:55,,,1,0,0,0,0,0,470,0,0,0,0,0

This file has a metadata section at the top. Let’s chop that off.

$ tail +4 emails.csv > emails-no-meta.csv

Introducing `awk`

The file has a lot of columns. We can use awk to extract only the columns we’re interested in.

As an intermediate result, let’s just print out the first column.

$ awk -F, '{ print $1 }' emails-no-meta.csv | head
UserName
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Other User
Other User
Other User
Other User

awk’s -F argument sets the field separator. We’re using the comma character. Use -F\\t for tabs. The default separator is the space character.

We provide a program to awk on the command line. awk programs operate on each line, with BEGIN and END sections to do pre- and post-processing on the whole file.

This simple program contains a block which is applied to all lines and prints $1, which is awk’s syntax for column 1.

We can go a step futher and add a pattern match to our program so we print the first column of lines matching “Ben Godfrey.”

$ awk -F, '/Ben Godfrey/ { print $1 }' emails-no-meta.csv
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey

This is functionally equivalent to using grep and piping the result to awk.

$ grep "Ben Godfrey" emails-no-meta.csv | awk -F, '{ print $1 }'
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey

The awk-only solution involves less overhead - only one pass through the file and only program is used - but grep is such a powerful tool and familiar to so many that it is often used for the filtering part of a pipeline. Combining simple tools to create powerful results is a key part of the Unix philosophy.

More Columns

We can also use awk to construct a new CSV file with a subset of columns.

$ awk -F, '/Ben Godfrey/ { print $1","$4","$6","$9 }' emails-no-meta.csv
Ben Godfrey,Example Email,05/10/2012 15:22:55,1
Ben Godfrey,Example Email,05/10/2012 17:06:59,1
Ben Godfrey,Example Email,08/10/2012 10:36:45,2
Ben Godfrey,Example Email,08/10/2012 11:32:56,469
Ben Godfrey,Example Email,10/10/2012 12:02:53,4232

This is a powerful technique, because we can easily rerun that command on an updated file. This beats performing the same manual steps in Excel for every new file. We can also run the awk program again many files (potentially sourced from different locations with find or curl).

As a final trick, let’s total up how many emails I’ve sent.

$ awk -F, '/Ben Godfrey/ { total = total + $9 } END { print total }' emails-no-meta.csv
4705

We add the values from column 9 to the total variable as we proceed through the file (as long as the line matches the expression /Ben Godfrey/). Finally, we use an END block to print the value of total.

Vim

Most Unix tools, including awk, can operate on either a file or stdin and stdout. Within Vim, you can filter the contents using an external program.

If we open emails.csv in Vim, we can issue the following command to reduce the list of columns in the current buffer without modifying the original file.

:%! awk -F, '{ print $1","$4","$6","$9 }'

%! is a Vim command to filter the whole file through an external command. You can also filter a specified range. We could filter every line except the header.

:2,$! awk -F, '{ print $1","$4","$6","$9 }'

Also, you can select some lines in visual mode then filter those.

:'<,'>! awk -F, '{ print $1","$4","$6","$9 }'

Small Data: Processing CSV In Unix

Get Started

Introducing awk

More Columns

Vim

Introducing `awk`