December 11, 2010

How to remove a column from a file

I often need to remove a column from a character delimited file so I can do some additional processing, such as counting up values. I'm going to use a file with the following contents as an example.


mike@shiner $ cat name_age_city.txt
Mike|35|New York
Jason|28|Dallas
Fox|32|Washington
Mike|28|Los Angeles
Ann|23|Denver
Dana|32|Louisville
Mike|32|Atlantic City
Ryan|30|Austin

We can use the cut command to extract the column with the age information.

mike@shiner $ cat name_age_city.txt | cut -d '|' -f 2
35
28
32
28
23
32
32
30

The -d option tells cut that the delimiter for the columns is the '|' character and the -f option specifies the second column. Let's say we want to know how many people have the same age. We can use the following command to enumerate these values for us.

mike@shiner $ cat name_age_city.txt | cut -d '|' -f 2 | sort | uniq -c
      1 23
      2 28
      1 30
      3 32
      1 35


If we need to extract multiple columns, cut gives us this functionality as well.

mike@shiner $ cat name_age_city.txt | cut -d '|' -f 1,3
Mike|New York
Jason|Dallas
Fox|Washington
Mike|Los Angeles
Ann|Denver
Dana|Louisville
Mike|Atlantic City
Ryan|Austin

Unfortunately, cut won't allow us to reorder the columns. We get the same results if we try "cut -d '|' -f 3,1". On a side note, we can use awk to give us functionality similar to cut. One advantage awk has over cut is the ability to reorder the columns.

mike@shiner $ cat name_age_city.txt | awk -F '|' '{print $3"|"$1}'
New York|Mike
Dallas|Jason
Washington|Fox
Los Angeles|Mike
Denver|Ann
Louisville|Dana
Atlantic City|Mike
Austin|Ryan


The awk syntax requires us to put the '|' character in the print output to get the same behavior as cut.

No comments:

Post a Comment