Small Data: Processing CSV In Unix
Unix provides a wealth of great tools for performing simple analysis of small datasets. awk
in particular is useful for crunching CSV files.
Why use Unix tools and not Excel? The commands you create can be saved as a script and applied to updated files, multiple files at once, combined with curl to process a file from the web.
Here’s a few examples using a fictional data file, emails.csv
, which contains data about a number of email campaigns sent to customers.
Get Started
First, let’s inspect the file format by printing the first few rows.
$ head -5 emails.csv
ReportRunOn,MemberID,StartDate,EndDate
10/10/2012 13:13:10,1038277,10/09/2012 00:00:00,10/10/2012 23:59:59
UserName,UserID,JobID,EmailName,EmailSubject,PickupTime,BCC,CcEmail,NormalSends,NormalSendsBCC,NormalSendsCC,TriggeredSends,TriggeredSendsBCC,TriggeredSendsCC,NormalSends_1,NormalSendsBCC_1,NormalSendsCC_1,TriggeredSends_1,TriggeredSendsBCC_1,TriggeredSendsCC_1
Ben Godfrey,ben.godfrey,1839429,Example Email,Give us your feedback!,05/10/2012 15:22:55,,,1,0,0,0,0,0,470,0,0,0,0,0
This file has a metadata section at the top. Let’s chop that off.
$ tail +4 emails.csv > emails-no-meta.csv
Introducing awk
The file has a lot of columns. We can use awk
to extract only the columns
we’re interested in.
As an intermediate result, let’s just print out the first column.
$ awk -F, '{ print $1 }' emails-no-meta.csv | head
UserName
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Other User
Other User
Other User
Other User
awk
’s -F
argument sets the field separator. We’re using the comma character.
Use -F\\t
for tabs. The default separator is the space character.
We provide a program to awk
on the command line. awk
programs operate on
each line, with BEGIN
and END
sections to do pre- and post-processing on the
whole file.
This simple program contains a block which is applied to all lines and prints
$1
, which is awk
’s syntax for column 1.
We can go a step futher and add a pattern match to our program so we print the first column of lines matching “Ben Godfrey.”
$ awk -F, '/Ben Godfrey/ { print $1 }' emails-no-meta.csv
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
This is functionally equivalent to using grep
and piping the result to awk
.
$ grep "Ben Godfrey" emails-no-meta.csv | awk -F, '{ print $1 }'
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
Ben Godfrey
The awk
-only solution involves less overhead - only one pass through the file
and only program is used - but grep
is such a powerful tool and familiar to so
many that it is often used for the filtering part of a pipeline. Combining
simple tools to create powerful results is a key part of the Unix philosophy.
More Columns
We can also use awk
to construct a new CSV file with a subset of columns.
$ awk -F, '/Ben Godfrey/ { print $1","$4","$6","$9 }' emails-no-meta.csv
Ben Godfrey,Example Email,05/10/2012 15:22:55,1
Ben Godfrey,Example Email,05/10/2012 17:06:59,1
Ben Godfrey,Example Email,08/10/2012 10:36:45,2
Ben Godfrey,Example Email,08/10/2012 11:32:56,469
Ben Godfrey,Example Email,10/10/2012 12:02:53,4232
This is a powerful technique, because we can easily rerun that command on an
updated file. This beats performing the same manual steps in Excel for every new
file. We can also run the awk
program again many files (potentially sourced
from different locations with find
or curl
).
As a final trick, let’s total up how many emails I’ve sent.
$ awk -F, '/Ben Godfrey/ { total = total + $9 } END { print total }' emails-no-meta.csv
4705
We add the values from column 9 to the total
variable as we proceed through
the file (as long as the line matches the expression /Ben Godfrey/
). Finally,
we use an END
block to print the value of total
.
Vim
Most Unix tools, including awk
, can operate on either a file or stdin
and
stdout
. Within Vim, you can filter the contents using an external program.
If we open emails.csv
in Vim, we can issue the following command to reduce the
list of columns in the current buffer without modifying the original file.
:%! awk -F, '{ print $1","$4","$6","$9 }'
%!
is a Vim command to filter the whole file through an external command. You
can also filter a specified range. We could filter every line except the
header.
:2,$! awk -F, '{ print $1","$4","$6","$9 }'
Also, you can select some lines in visual mode then filter those.
:'<,'>! awk -F, '{ print $1","$4","$6","$9 }'