Aggregate operations in shell

Feb 27, 2006

You might be familiar with aggregate clauses in SQL, you know, count(*) and friends. I was interested in performing aggregate functions on files of tab-separated data. I’d do an aggregate query in SQL, get a set of result rows, then I needed to total one column.

I output the data to TSV, using a mysql -e command and redirecting the output to a file. The data looks something like:

id      viewcount       comments
127067  341     44
127076  66      2
127077  158     6
127111  379     25
127112  83      3
127119  105     10
127131  47      0
127135  51      1
127137  133     17

I found that awk has the required mix of unix command line tool sensibilities and numeric functions. Here awk accumulates the 3rd value of each row and displays the total.

cat data.tsv | awk '{x+=$3} END {print x}'

awk is basically C-like and has a small core set of numeric functions built-in. To calculate mean, divide the accumulator result by the built-in variable NR, which is incremented each time a record (line) is processed and so contains the total line count by the time we reach the END block.