1. File Operations and Filtering

Most command-line work is done on files. In this section we will show you how to watch and filter file content, how to take required information from files using a single command, and how to easily sort a file's content.

1.1. cat, tail, head, tee: File-Printing Commands

These commands have almost the same syntax: command_name [option(s)] [file(s)], and may be used in a pipe. All of them are used to print part of a file according to certain criteria.

The cat utility concatenates files and prints the results to the standard output, which is usually the screen of your computer. This is one of the most widely used commands. For example you can use:

# cat /var/log/mail/info

to print the content of a mailer daemon log file to the standard output[23]. The cat command has a very useful option (-n) which allows you to print the line numbers.

Some files such as daemon log files (if they are running) are usually huge in size[24] and printing them completely on the screen is not very useful. Generally speaking you only need to see some lines of the file. You can use the tail command to do so. The following command will print, by default, the last 10 lines of the /var/log/mail/info file:

# tail /var/log/mail/info

Files such as logs usually vary dynamically because the daemon associated to that log constantly adds actions and events to the log file. To interactively watch these changes you can take advantage of the -f option:

# tail -f /var/log/mail/info

In this case all changes in the /var/log/mail/info file will be printed on screen immediately. Using the tail command with option -f is very helpful when you want to know how your system works. For example, looking through the /var/log/messages log file, you can keep up with system messages and various daemons.

If you use tail with more than one file it will print the name of the file on a line by itself before printing its contents. It also works with the -f option and is a valuable addition to see how different parts of the system interact.

You can use the -n option to display the last n lines of a file. For example, to display the last 2 lines, you would issue:

# tail -n2 /var/log/mail/info

Just as for other commands, you can use different options at the same time. For example, using both -n2 and -f at the same time, you start with the two last lines of the file and keep on seeing new lines as they are written to the log file.

The head command is similar to tail, but it prints the first lines of a file. The following command will print, by default, the first 10 lines of the /var/log/mail/info file:

# head /var/log/mail/info

As with tail you can use the -n option to specify the number of lines to be printed. For example, to print the first two, issue:

# head -n2 /var/log/mail/info

You can also use these commands together. For example, if you wish to display only lines 9 and 10, you can use a command where the head command will select the first 10 lines from a file and pass them through a pipe to the tail command.

# head /var/log/mail/info | tail -n2

The last part will then select the last 2 lines and will print them to the screen. In the same way you can select line number 20, starting from the end of the file:

# tail -n20 /var/log/mail/info |head -n1

In this example we tell tail to select the file's last 20 lines and pass them through a pipe to head. Then the head command prints to the screen the first line of the data obtained.

Lets suppose we want to print the result of the last example to the screen and save it to the results.txt file. The tee utility can help us. Its syntax is:

tee [option(s)] [file]

Now we can change the previous command this way:

# tail -n20 /var/log/mail/info |head -n1|tee results.txt

Lets take yet another example. We want to select the last 20 lines, save them to results.txt, but print on screen only the first of the 20 selected lines. Then we should type:

# tail -n20 /var/log/mail/info |tee results.txt |head -n1

The tee command has a useful option (-a) which enables you to append data to an existing file.

In the next section we will see how we can use the grep command as a filter to separate Postfix messages from other messages coming from different services.

1.2. grep: Locating Strings in Files

Neither the name nor the acronym (“General Regular Expression Parser”) is very intuitive, but what it does and its use are simple: grep looks for a pattern given as an argument in one or more files. Its syntax is

grep [options] <pattern> [one or more file(s)]

If several files are mentioned, their names will precede each matching line displayed in the result. You can use the -h option to prevent displaying these names or you can use the -l option to get nothing but the matching file names. The pattern is a regular expression, even though most of the time it consists of a simple word. The most frequently used options are the following:

  • -i: make a case insensitive search (i.e. ignore differences between lower and uppercase);

  • -v: invert search. display lines which do not match the pattern;

  • -n: display the line number for each line found;

  • -w: tells grep that the pattern should match a whole word.

So let's go back to analyze the mailer daemon's log file. We want to find all lines in the /var/log/mail/info file which contain the postfix pattern. Then we type this command:

# grep postfix /var/log/mail/info

If we want to find all lines which do NOT contain the postfix pattern, we would use the -v option:

# grep -v postfix /var/log/mail/info

The grep command can be used in a pipe.

Let's suppose we want to find all messages about successfully sent mails. In this case we have to filter all lines which were added to the log file by the mailer daemon (contains the postfix pattern) and they must contain a message about successful sending (status=sent)[25]:

# grep postfix /var/log/mail/info |grep status=sent

In this case grep is used twice. It is allowed, but not very elegant. The same result can be achieved by using the fgrep utility. fgrep is actually a simpler method to call grep -F. First we need to create a file containing patterns written out one to a line. Such a file can be created this way (we use patterns.txt as the file name):

# echo -e 'status=sent\npostfix' >./patterns.txt

Check the results with the cat command. \n is a special pattern which means “new line”.

Then we call the next command where we use the patterns.txt file and the fgrep utility instead of “double calling” of grep:

# fgrep -f ./patterns.txt /var/log/mail/info

The file ./patterns.txt may contain as many patterns as you wish. For example, to select messages about successfully sent mails to peter@mandriva.com, it would be enough to add this e-mail address into our ./patterns.txt file by running this command:

# echo 'peter@mandriva.com' >>./patterns.txt

It is clear that you can combine grep with tail and head. If we want to find messages about the second-to-last e-mail sent to peter@mandriva.com we would type:

# fgrep -f ./patterns.txt /var/log/mail/info | tail -n2 | head -n1

Here we apply the filter described above and place the result in a pipe for the tail and head commands. They select the last but one value from the data.

1.3. Regular Expressions and Filtering egrep

With grep we are stuck with patterns and fixed data. How would we find all e-mails sent to each and every employee of “ABC Company”? Listing all their e-mails would not be an easy task since we might end up missing someone or having to dig in the log file by hand.

As with fgrep, grep has a shortcut to the command grep -E: egrep. It takes regular expressions instead of patterns, providing us with a more powerful interface to “grep” text.

Besides what we mentioned in Section 3, “Shell Globbing Patterns” while talking about globbing patterns, here are some additional regular expressions:

  • [:alnum:] (all letters plus all digits), [:alpha:] (all uppercase and lowercase letters) and [:digit:] (all digits) can be used instead of defining the classes of characters yourself. They have an additional bonus: they include internationalized characters and respect the localization of the system.

  • [:print:] represents all characters which can be printed on screen.

  • [:lower:] and [:upper:] represent all lowercase and uppercase letters.

There are more classes available and you can see all of them in egrep(1). The above are the most commonly used ones.

A regular expression may be followed by one of several repetition operators:

?

The preceding item is optional, that is: it is matched zero times or once, but not more than once.

*

The preceding item will be matched zero or more times.

+

The preceding item will be matched one or more times.

{n}

The preceding item is matched exactly n times.

{n,}

The preceding item is matched n or more times.

{n,m}

The preceding item is matched at least n times, but not more than m times.

If you put a regular expression inside parenthesis you can recover it later. Lets say that you specified the [:alpha:]+ expression. It might represent a word. If you want to detect words which occur twice you can put this inside parenthesis and reuse it with \1 if it is the first group. You can have up to 9 of these “memories”.

$ echo -e "abc def\nabc abc def\nabc1 abc1\nabcdef\nabcdabcd\nabcdef abcef" > testfile
$ egrep "([[:alpha:]]+) \1" testfile
abc abc def
$
[Note]Note

The [ and ] characters are part of the group name so we have to include them to use that class of characters. The first [ says that we will be using a group of characters, the second is part of the name of the group, and then there are the corresponding closing ] characters.

The only line returned is the one which exclusively matched two groups of letters separated by a space. No other group matched the regular expression.

You can also use the | character to match the expression to the left of the | or one to the right of it. It is an operator which joins those expressions. Using the same testfile created above, you can try looking for expressions which contain only double words or contains double words with numbers:

$ egrep "([[:alpha:]]+) \1|([[:alpha:][:digit:]]+) \2" testfile
abc abc def
abc1 abc1
$

Note that for the second group using parenthesis we had to use \2, otherwise it would not match with what we wanted. A more efficient expression would be, in this particular case:

$ egrep "([[:alnum:]]+) \1" testfile
abc abc def
abc1 abc1
$

Finally to match certain characters you have to “escape” them, preceding them with a backslash. Those characters are: ?, +, {, |, (, ) and of course \. To match those you have to write: \?, \+, \{, \|, \(, \), and \\.

This simple trick might help to prevent you typing repeated words in “your your” text.

Regular expressions on all tools should follow these, or very similar, rules. Taking some time to understand theses rules will help a lot with other tools such as sed. sed allows you to manipulate text, changing it using regular expressions as rules amongst other things.

1.4. wc: Counting Elements in Files

The wc command (Word Count) is used to count the number of lines, words and characters in files. It can also compute the length of the longest line. Its syntax is:

wc [option(s)] [file(s)]

The following options are useful:

  • -l: print the number of lines;

  • -w: print the number of words;

  • -m: print the total number of characters;

  • -c: print the number of bytes;

  • -L: print the length of the longest line in the text.

The wc command prints the number of lines, words and characters by default. Here are some usage examples:

If we want to find the number of users in our system, we can type:

$ wc -l /etc/passwd 

If we want to know the number of CPU's in our system, we write:

$ grep "model name" /proc/cpuinfo |wc -l

In the previous section we obtained a list of messages about successfully sent mails to e-mail addresses listed in our ./patterns.txt file. If we want to know how many messages it contains, we can redirect our filter's results in a pipe to the wc command:

# fgrep -f ./patterns.txt /var/log/mail/info | wc -l

1.5. sort: Sorting File Content

Here is the syntax of this powerful sorting utility[26]:

sort [option(s)] [file(s)]

Let's consider sorting on part of the /etc/passwd file. As you can see this file is not sorted:

$ cat /etc/passwd

If we want to sort it by login field. Then we type:

$ sort /etc/passwd

The sort command sorts data in ascending order starting by the first field (in our case, the login field) by default. To sort data in descending order, use the -r option:

$ sort -r /etc/passwd

Every user has his or her own UID written in the /etc/passwd file. The following command sorts a file in ascending order using the UID field:

$ sort /etc/passwd -t":" -k3 -n

Here we use the following sort options:

  • -t":": tells sort that the field separator is the ":" symbol;

  • -k3: means that sorting must be done on the third column;

  • -n: says that the sort is to occur on numerical data, not alphabetical.

The same can be done in reverse:

$ sort /etc/passwd -t":" -k3 -n -r

Note that sort has two other important options:

  • -u: perform a strict ordering: duplicate sort fields are discarded;

  • -f: ignore case (treat lowercase characters the same way as uppercase ones).

Finally, if we want to find the user with the highest UID we can use this command:

$ sort /etc/passwd -t":" -k3 -n |tail -n1

where we sort the /etc/passwd file in ascending order according to the UID column, and redirect the result through a pipe to the tail command. The latter will print out the first value of the sorted list.



[23] Some examples in this section are based on real work and server log files (services, daemons, etc.). Make sure syslogd (which allows the logging of a daemon activity), and the corresponding daemon (in our case Postfix) are running, and that you work as root. Of course, you can always apply our examples to other files.

[24] For example, the /var/log/mail/info file contains information about all sent mails, messages about fetching mail by users with the POP protocol, etc.

[25] Although it is possible to filter just by the status pattern please stay with us as we want to show you a new command with this example.

[26] We will only discuss sort briefly here. Whole books can be written about its features.