Monday, October 01, 2012

My one liner to extracting lines using sed, perl or awk


While working on extracting data from large amount of files, i have compiled some commands over the years to really helps a lot.

Most of the time, we use head, tail, grep. However, these commands are good at wholesale extracting or just by some keywords. For more complex extraction, we may use sed, perl or awk instead.

Using myfile as example,

myserver:/tmp/:>head -10 myfile
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# bos61D src/bos/usr/sbin/netstart/hosts 1.2
#
# Licensed Materials - Property of IBM
#
# COPYRIGHT International Business Machines Corp. 1985,1989
# All Rights Reserved
#


myserver:/tmp:>tail -2 myfile
10.1.1.123     host1

10.2.1.124     host2


myserver:/tmp/:>grep host1 myfile
10.1.1.123     host1


Say, for more complicated stuffs, like extracting 2nd line PLUS 5th to 7th line, i find it tough to code using the above commands.

h2. sed, perl or awk?

Do note that sed will transverse the entire file, hence if you have a very large file, this might take some time.

Say, we want to extract the 2nd line, we can use sed or awk

myserver:/tmp/:>sed 2p myfile
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
# This is an automatically generated prolog.
#
# bos61D src/bos/usr/sbin/netstart/hosts 1.2
...
...


myserver:/tmp/:>awk 'NR==2' myfile
# This is an automatically generated prolog.






If you have try it out, you will see that for sed, the 2nd line is indeed extracted but the rest of the file is also printed out! Use the following to disable printing out the old file.

myserver:/tmp/:>sed -n 2p myfile
# This is an automatically generated prolog.


Alternatively, you might want to 'delete' whatever that you don't want by using the '!d' parameter.

myserver:/tmp/:>sed '2!d' myfile
# This is an automatically generated prolog.


I wouldn't want to use this method as i have difficulty converting the line to use variables. Do give me suggestions or advice if you think otherwise. I don't claim to be expert in writing scripts. :)

IMPORTANT: Note that the single quotes are required. Else '!d' will bring back the last command you have executed with the letter 'd'.



If you only want one and only line from the file, you can get awk to exit after getting that line, otherwise the awk will transverse through the whole file.

myserver:/tmp/:>awk 'NR==6 {print; exit}' myfile
# Licensed Materials - Property of IBM


If we try to extract line 5 to 7 using sed or awk

myserver:/tmp/:>sed -n 5,7p myfile
#
# Licensed Materials - Property of IBM
#


myserver:/tmp/:>awk 'NR==5,NR==7' myfile
#
# Licensed Materials - Property of IBM
#


Here's another trick that i read from Mr Google. If you want to extract every 5th line of a file starting from the top of the file, perl or awk does the job easily.


myserver:/tmp/:>perl -ne 'print unless (0 != $. % 5)' myfile
#
#
# IBM_PROLOG_END_TAG
#
# Licensed Materials - Property of IBM
#  /etc/hosts
#
#

...
...


myserver:/tmp/:>awk '0 == NR % 5'   myfile
#
#
# IBM_PROLOG_END_TAG
#
# Licensed Materials - Property of IBM
#  /etc/hosts
#
#
...
...


Tip: If you don't want to start from the top of the file, you can put (NR + 1), which means to start from line 1.

Thats all folks.

No comments: