Adventures with awk and big data

It took me a long time to appreciate the brilliance of the way command line tools on Unix fit together, but one tool that will hasten that appreciation for anyone is awk. My first use of awk, some years ago, was to find and terminate process ids.

You might use `ps aux` to see the list of running processes.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   3780  2032 ?        Ss   Mar13   1:24 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Mar13   0:01 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Mar13   4:23 [ksoftirqd/0]
root         4  0.0  0.0      0     0 ?        S    Mar13   0:00 [kworker/0:0]
root         5  0.0  0.0      0     0 ?        S<   Mar13   0:00 [kworker/0:0H]
root         6  0.0  0.0      0     0 ?        S    Mar13   0:00 [kworker/u:0]
root         7  0.0  0.0      0     0 ?        S<   Mar13   0:00 [kworker/u:0H]
root         8  0.0  0.0      0     0 ?        S    Mar13   0:06 [migration/0]
root         9  0.0  0.0      0     0 ?        S    Mar13   0:00 [rcu_bh]

If you pipe the result to grep you might isolate the ones with ‘kworker’ in the name.

$ ps aux | grep 'kworker'

root         4  0.0  0.0      0     0 ?        S    Mar13   0:00 [kworker/0:0]
root         5  0.0  0.0      0     0 ?        S<   Mar13   0:00 [kworker/0:0H]
root         6  0.0  0.0      0     0 ?        S    Mar13   0:00 [kworker/u:0]
root         7  0.0  0.0      0     0 ?        S<   Mar13   0:00 [kworker/u:0H]

Then you’d want the PID for each of them, but it’s just a column and you need to run `kill 4` and so on. This is where awk can be magical.

$ ps aux | grep 'kworker' | awk '{ print $2 }'

4
5
6
7

Now we have the PIDs we need and we can pass them to kill using xargs. Some commands are finicky about xargs and you could opt for a simple inline Unix for loop.

$ ps aux | grep 'kworker' | awk '{ print $2 }' | xargs kill

Awk is killer at manipulating columnar text data and we can also output the information in formats other than one thing per line. The separator in configurable.

$ ps aux | grep 'kworker' | awk '{ print $2 }' ORS=','
4,5,6,7,

This whole post actually belies the nature of awk. It’s a complete language unto itself. We could sum the PIDs.

$ ps aux | grep 'kworker' | awk '{ sum += $2 } END { print sum }' 
22

We can skip the first line. Say we wanted a sum of ALL running PIDs for example:

$ ps aux | awk 'NR > 1 { sum += $2 } END { print sum }' 
345623

There we used NR (I think this is ‘natural rank’ or row number) to only pay attention to rows after the initial titles. We can also select one such row specifically:

$ ps aux | awk 'NR==1 { print $2 }'
PID

I’m still adding tricks to my knowledge of awk, but combining it with sed, wc, redis-cli, and grep lends tremendous power in working with large datasets.

Leave a Reply

Your email address will not be published. Required fields are marked *