Thursday, November 4, 2010

HOWTO: Split file into multiple files based on a column with awk

Sometimes you have to split a file into multiple files, based on the value of a certain column.

For example, consider the following data, from a file called "data.txt", which is filled with the output of "ps aux":

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  23824  1828 ?        Ss   Oct24   0:01 /sbin/init
root         2  0.0  0.0      0     0 ?        S    Oct24   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Oct24   0:00 [migration/0]
root         4  0.0  0.0      0     0 ?        S    Oct24   0:53 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Oct24   0:00 [watchdog/0]
www-data   982  0.0  0.0 142728  5840 ?        S    Nov01   0:00 /usr/sbin/apache2 root       989  0.0  0.0  21268  5616 ?        Ss   Oct24   0:13 /sbin/mount.ntfs root      1000  0.0  0.0      0     0 ?        S    Oct24   0:00 [kjournald]
root      1003  0.0  0.0      0     0 ?        S    Oct24   0:00 [kjournald]
syslog    1027  0.0  0.0 192284  1336 ?        Sl   Oct24   0:01 rsyslogd -c4
106       1038  0.0  0.0  24644  2112 ?        Ss   Oct24   0:04 dbus-daemon root      1046  0.0  0.0  28804   924 ?        Ss   Oct24   0:00 kdm
root      1057  0.0  0.0  84776  3712 ?        Ss   Oct24   0:00 NetworkManager
avahi     1058  0.0  0.0  34060  1468 ?        S    Oct24   0:03 avahi-daemon: avahi     1059  0.0  0.0  33928   312 ?        Ss   Oct24   0:00 avahi-daemon: root      1065  0.0  0.0  57868  2004 ?        S    Oct24   0:00 /usr/sbin/modem
daemon    1220  0.0  0.0  18884   308 ?        Ss   Oct24   0:00 atd
root      1221  0.0  0.0  21076   852 ?        Ss   Oct24   0:00 cron
mysql     1283  0.0  0.2 178828 13420 ?        Ssl  Oct24   3:26 /usr/sbin/mysqld
proxy     1461  0.0  0.0   6552   784 ?        Ss   Oct24   0:00 /usr/bin/polipo root      1472  0.0  0.0  97040  1032 ?        Ss   Oct24   0:00 /usr/bin/svnserve 120       1480  0.0  0.3  51336 21180 ?        S    Oct24   1:01 /usr/sbin/tor
yossarian     1505  0.1  0.3 400928 23180 ?        Ssl  Oct24  29:06 /usr/bin/mpd 

Now suppose that you wanted to put each user's processes into a separate file. This example is a little contrived, but works just as well when you have a file with tons of entries with the first column as the key and you wish to split them into multiple files based on the key.

The first row is the header, and we can eliminate that with sed. The splitting part is the coll part is accomplished with awk's field variables:

awk '{print > $1".txt"}

All this says is print the line a text file named after the value of the first column, effectively splitting the original text file into multiple text files, each named after the column entry and with all rows that start with that column entry.

A quick ls shows us that exactly the files we wanted were created:

-rw-r--r--  1 yossarian yossarian      235 2010-11-04 19:30 yossarian.txt
-rw-r--r--  1 yossarian yossarian       92 2010-11-04 19:30 www-data.txt
-rw-r--r--  1 yossarian yossarian       78 2010-11-04 19:30 syslog.txt
-rw-r--r--  1 yossarian yossarian     1178 2010-11-04 19:30 root.txt
-rw-r--r--  1 yossarian yossarian      135 2010-11-04 19:30 proxy.txt
-rw-r--r--  1 yossarian yossarian       82 2010-11-04 19:30 mysql.txt
-rw-r--r--  1 yossarian yossarian       69 2010-11-04 19:30 daemon.txt
-rw-r--r--  1 yossarian yossarian      206 2010-11-04 19:30 avahi.txt
-rw-r--r--  1 yossarian yossarian       79 2010-11-04 19:30 120.txt
-rw-r--r--  1 yossarian yossarian       93 2010-11-04 19:30 106.txt

Combining the two commands into one gives us:

sed 1d data.txt | awk '{print > $1".txt"}'

That's some catch indeed. Enjoy your one-liner text processing.

No comments:

Post a Comment