Aggregate operations in shell
You might be familiar with aggregate clauses in SQL, you know, count(*)
and friends. I was interested in performing aggregate functions on files of tab-separated data. I’d do an aggregate query in SQL, get a set of result rows, then I needed to total one column.
I output the data to TSV, using a mysql -e
command and redirecting the output to a file. The data looks something like:
id viewcount comments 127067 341 44 127076 66 2 127077 158 6 127111 379 25 127112 83 3 127119 105 10 127131 47 0 127135 51 1 127137 133 17
I found that awk
has the required mix of unix command line tool sensibilities and numeric functions. Here awk
accumulates the 3rd value of each row and displays the total.
cat data.tsv | awk '{x+=$3} END {print x}'
awk
is basically C-like and has a small core set of numeric functions built-in. To calculate mean, divide the accumulator result by the built-in variable NR, which is incremented each time a record (line) is processed and so contains the total line count by the time we reach the END block.