Examples with awk: A short introduction

ArticleCategory:[Article Category]

UNIX Basics

AuthorImage:[Author Photo]

TranslationInfo:[Author Name]

original in es Javier Palacios Bermejo es to en Javier Palacios Bermejo

AboutTheAuthor:[Über den Autor]

Abstract:[Abstract]

ArticleIllustration:[Article head Illustration]

ArticleBody:[Article Body]

Originally, the idea of writing this text came to me after the reading of a couple of articles published in _LF_ and written by Guido Socher. One of them, about find and related commands, showed me that looks like I was not the only one which used the command line yet, instead pretty GUI that makes you never know how the things are really done (that's the way that Windows takes years ago). The other article was about regular expressions, that, althoug are slightly mentioned in this article, you need to know to get the maximum from awk and some other commands (sed and grep mainly) about I wanted to talk when writing this article.

The key question is whether this command is really useful. And the answer is yes. It could be useful for a normal user, depending on his type of work but, as an administration tool, comands like this are unvaluable. Just walk around /var/yp/Makefile or the initialization scripts of every system to realize about that.

Introduction to `awk`

My first news about it are old enough for being forgoten. A colleage that needed to work with some really big outputs from a small Cray, and he was looking for many posibilities for clasification. The manual page for awk on the Cray was really small, but he said that it looks very good for that task, although it was not possible to deal with it.
A long time later, it cames to my life again, by mean of a casual comment (another place, another colleage), who used it for extract the first column from a table:
awk '{print $1}' file
Easy, isn't it? This simple task needs small amounts of programming in C or any other compiled or interpreted language.

Once whe have learned the lesson extracting a column we cab do things as rename files (althought not very much) using sequences as
ls -1 pattern | awk '{print "mv "$1" "$1".new"}' | sh

And more. Using sed or grep too the previous example becames more powerful.

Renaming within the name
ls -1 *old* | awk '{print "mv "$1" "$1}' | sed s/old/new/2 | sh
(altought in some cases it will fail, as in file_old_and_old)
remove only files (it can be done using rm alone, but what about an alias as 'rm -r')
ls -l * | grep -v drwx | awk '{print "rm "$9}' | sh
(again it could fail with strange names or access permisions)
remove only directories
ls -l | grep '^d' | awk '{print "rm -r "$9}' | sh
(I thinks this works in every case, and we can do with ls -p | grep /$ | ...)

When, for example, same calculations are repeated with different initial parameters and we want select some output files for additional processing, this tools helps more than a little (actually, they are the only help to my known).

Actually, altough we will use that name, awk is not the kind of thing that usually is called command, instruction, etc, in the same way that gcc is not. awk is a programming language, with a syntax close to C in many aspects, which interpreter is called with the instruction awk.

About the syntax of the command itself, everything has been said

# gawk --help
Usage: gawk [POSIX or GNU style options] -f progfile [--] file ...
        gawk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options:
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs
        -v var=val              --assign=var=val
        -m[fr] val
        -W compat               --compat
        -W copyleft             --copyleft
        -W copyright            --copyright
        -W help                 --help
        -W lint                 --lint
        -W lint-old             --lint-old
        -W posix                --posix
        -W re-interval          --re-interval
        -W source=program-text  --source=program-text
        -W traditional          --traditional
        -W usage                --usage
        -W version              --version

Report bugs to bug-gnu-utils@prep.ai.mit.edu,
with a Cc: to arnold@gnu.ai.mit.edu

Just mention that instead of simple quoting (') the programs in the command line, we can wrote them into a file, and call it with the option -f, and that command line defined variables using -v var=val we can add some versatility to the programs we write.

Awk is, roughly speaking, a language oriented to manage tables, in the sense of information that can be grouped inside fields and records, in the way of the more traditional databases. With the advantage that the record definition (and the field one too) is extremely flexible.

But awk is more powerful. It's designed for work with one-line records, but that point could be relaxed. In order to deep in some of its aspects, we are going to look some illustrative (and real) examples.

Printing tables in a slightly prettier way
Maybe, you have had to print som ASCII table obtained from somewhere as, for example, the hostnames, ethernet and IP numbers associations. When those tables are really big, the reading becames difficult, and we began to miss how easy to read is a table printed with LaTeX or, at least, with a better format. If the table is simple (and/or we know awk well), it's not too dificult, altough it could get bored:

BEGIN {
  printf "LaTeX preample"
  printf "\\begin{tabular}"
  printf "{|c|c|...|c|}"
  }

{ printf $1" & "
  printf $2" & "
  .
  .
  .
  printf $n" \\\\ "
  printf "\\hline"
  }

END {
  print "\\end{document}"
  }

Certainly, this is not a generic program, but we're just beginning ...
(The double \ are necessary because it's the shell scape character)

Slicing output files
SIMBAD is an astronomical objects database that, among other things, gives their positions on the sky plane. Once in the past I needed perform searches to draw charts around some objects. The interface allowed to save the results in text files, and I had two approaches: create one file for each object, or feed it with the whole input list, getting an unique and big output log file with the query results. As I decided second approach, I used awk for slicing it. Obviously, I needed to take advantage on some output characteristics.

Each request produces a header line with a format like
====> name : nlines <====
The first header allow us to know when a new object begans, and the fourth how many entries the object contains (altought that data is not strictly necessary)
The character used in the output lists to mark different columns was '|'. That fact required two additional code lines to filter to the output only the fields of my interest.

( $1 == "====>" ) {
  NomObj = $2
  TotObj = $4
  if ( TotObj > 0 ) {
    FS = "|"
    for ( cont=0 ; cont<TotObj ; cont++ ) {
        getline
        print $2 $4 $5 $3  >> NomObj
        }
    FS = " "
    }
  }

NOTE: Acutally, the object name was not returned, and it was sligthly more complicated, but this pretend to be an illustrative example.

Playing with the mail spool

BEGIN {
  BEGIN_MSG  = "From"
  BEGIN_BDY  = "Precedence:"
  MAIN_KEY   = "Subject:"
  VALIDATION = "[MONTH REPORT]"
 
  HEAD = "NO"; BODY = "NO"; PRINT="NO"
  OUT_FILE = "Month_Reports"
  }
 
  {
 
  if ( $1 == BEGIN_MSG ) {
    HEAD = "YES"; BODY = "NO"; PRINT="NO"
    }
 
  if ( $1 == MAIN_KEY ) {
    if ( $2 == VALIDATION ) {
      PRINT = "YES"
      $1 = ""; $2 = ""
      print "\n\n"$0"\n" > OUT_FILE
      }
    }
 
  if ( $1 == BEGIN_BDY ) {
    getline
    if ( $0 == "" ) {
      HEAD = "NO"; BODY = "YES"
    } else {
      HEAD = "NO"; BODY = "NO"; PRINT="NO"
      }
    }
 
  if ( BODY == "YES" && PRINT == "YES" ) {
    print $0 >> OUT_FILE
    }
  }

Maybe we are the administers of a mailing list. Maybe, from time to time, some special messages are submitted to the list (for example, month reports) with some specific format (subject as '[MONTH REPORT] month , dept'). And, suddenly, we decide at the end of the year put together all these messages, saving aside the others.
This task can be done working the mail spool with the awk program on the left.

Make each report being write to an individual file means three extra lines to the code, and make each department reports being write to individual files means only some extra characters.

NOTE: This example assumes that the mail spool is structured as I think it is. Actually I don't know the real format, but this programs works in my installation (again, in some strange cases, it could fail).

Programs like these only takes 5 minutes thinking and 5 more writing (or more than 20 minutes without thinking, using trial and error, in the funniest way).
If there is a less time consuming way, I want to know.

I've used awk for many other tasks (automatic generation of web pages with information from simple databases) and I know enough about programming as to be sure that a lot of things can be done, even many that I've never think about.
Let's fly your imagination.

A problem

(and a solution)

The only problem of awk is it's need of perfect tabular data, without any holes: It cannot work with the so common fixed width columns. If the awk input is generated by ourselves, this fact is not a problem: Just choosing something really strange to mark the fields, defining it with the FS variable should be enough. But if we have only the input, this becames a real problem, because some fields could contain the field-separator character (and, asl FS is usually a white space, the presence of names could be unconfortable). Look, for example, the table
1234 HD 13324 22:40:54 .... 1235 HD12223 22:43:12 ....
That could not be worked out with awk. Entries like this are sometimes necessary, and they are very common, because the data typing is not too much homogeneous. But, even in this case, if whe have only one of those columns, not everything is losed (if anybody knows how to deal with more than one column in a general case, just tell it). Once I need to deal with such a table, quite close to the one described above. Second column was a name, with a non-fixed ammount of white spaces. And, as use to be, I needed to sort using a later column. Some trials using sort +/-n.m showed the same problem with the embedded spaces. And, suddenly, I realized that the column I wanted to sort was the last one. And that awk knows how many fields are content whithin the actual record, and access the last was enough (sometimes $9, sometimes $11, but everytime NF). A couple of trials taken me to where I wanted to arrive:
{ printf $NF $NF = "" printf " "$0"\n" }
And we obtain an output equal to the input, but with the last column moved to first place, and we can sort without any problem. Obviously, this method is easily applied to the third field from the end, o the next one to that control field that is equally valued because it was the key for our subtable, extracted from a bigger database...
Just let's fly again the imagination.

Working over matched lines

Deeper `awk`

Conclussions

Certainly, it might not be as poweful as many other tools designed with similar goals. But it has the big advantage that in a really short time, allow you to write programs that, although maybe one-shot ones, are fully tailored to our needs, in so many times very simple.
A clear example, not involving directly to awk, is substituting string within a text file: with really elementary notions of sed we can do it in any unix system in any thinkable circumstance, because we don't need even a text editor. Including vi. On the other side, system files as /etc/password and many other are very easily worked with awk, without involving anything else.

Y desde luego que awk no es el mejor. Hay varios lenguajes de scripting con capacidades mucho mayores, y la caracter'istica com'un de ser interpretados, que es lo que permite tiempos de desarrollo rid'iculos para proyectos sin grandes ambiciones aparte de la eficacia. Pero awk sigue teniendo la ventaja de ser siempre accesible en cualquier instalaci'on, por m'inima que esta sea.

Additional information

This kind of very basic commands don't use to be well documented, but you ever can find something looking around.

awk syntax is not the same in every *nix, but there is a way for learning how is it in our particular system:
man awk
It could not be in other way. O'Reilly had published a book on it Sed & Awk (Nutshell handbook) by Dale Dougherty.
Looking on Amazon, we find more titles as Effective Awk Programming: A User's Guide, oriented to gawk, and half a dozen titles more.

Usually, all books on unix mention this command, but only some of them treat it with some detail giving useful information. The best we can do, browse any book we see, because you never know where useful information can be found.

A problem		(and a solution)
The only problem of `awk` is it's need of perfect tabular data, without any holes: It cannot work with the so common fixed width columns. If the `awk` input is generated by ourselves, this fact is not a problem: Just choosing something really strange to mark the fields, defining it with the `FS` variable should be enough. But if we have only the input, this becames a real problem, because some fields could contain the field-separator character (and, asl `FS` is usually a white space, the presence of names could be unconfortable). Look, for example, the table 1234 HD 13324 22:40:54 .... 1235 HD12223 22:43:12 .... That could not be worked out with `awk`. Entries like this are sometimes necessary, and they are very common, because the data typing is not too much homogeneous. But, even in this case, if whe have only one of those columns, not everything is losed (if anybody knows how to deal with more than one column in a general case, just tell it). Once I need to deal with such a table, quite close to the one described above. Second column was a name, with a non-fixed ammount of white spaces. And, as use to be, I needed to sort using a later column. Some trials using `sort +/-n.m` showed the same problem with the embedded spaces.		And, suddenly, I realized that the column I wanted to sort was the last one. And that `awk` knows how many fields are content whithin the actual record, and access the last was enough (sometimes `$9`, sometimes `$11`, but everytime `NF`). A couple of trials taken me to where I wanted to arrive: { printf $NF $NF = "" printf " "$0"\n" } And we obtain an output equal to the input, but with the last column moved to first place, and we can `sort` without any problem. Obviously, this method is easily applied to the third field from the end, o the next one to that control field that is equally valued because it was the key for our subtable, extracted from a bigger database... Just let's fly again the imagination.