Short Intro into AWK

AWK is a great scripting language when you have to work with textual data and it is a shame that most people have not heard of it.

Before we begin, we will need to have the preliminary “my history with {insert thing}” part. So here we go …

I have two “anecdotes” of using AWk. The first one was when, as a working student, I got dropped a 5GB server log file and was told that there was some problem last week. What to do? No editor could open the file, because of its size and the logs weren’t structured, so I needed some way to explore the data. That is where AWK came into play. Using some patterns that I knew should be there, I was able to extract all the information needed to find that nasty bug.

The second time that AWK came truly handy was when I was migrating my bookmarks. I know that is not something people usually do and even when they need to, usually you can just export them to a format that any bookmarking software supports. But I have had the privilege of using software that did not provide this functionality, so in the end I used AWK to migrate all bookmarks from the text files used by the software to a more general format.

What I mean to say with the stories is that AWK saved my ass.

As you can tell, I use AWk more as a one-liner and do not build complex programs. And this intro is build in the same way. The idea is to get a feel for the language and not actually learn all its features. Please refer to the references below, they are a treasure trove of information and you will learn much more there than here. We are going to look at a couple of separate examples, build some one-liners and hopefully by the end of the article you will be on your way to integrate AWK in your toolkit.

Warning: There are multiple implementation of AWK and the language itself has been significantly extended. For the POSIX compliant version check https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html. I am using gawk (GNU awk) which comes default on most if not all Linux distributions. For the list of extensions added on top of POSIX-awk see https://www.gnu.org/software/gawk/manual/html_node/POSIX_002fGNU.html

Example

Let us say we have the following file:

cat > emp.txt<< EOF
Alice 20 40
Bob   13 35
Ivy   9  20
John  8  42
EOF

We have three columns: the employee’s name, their hourly rate and the hours they work in a week.

Open the command line and paste the following:

awk ' $3 < 40 { print $1 } ' emp.txt

You should get the following output:

Bob
Ivy

Most of you, who have any programming experience, will have an intuitive understanding of what is going on when looking at the code and output.

$3 < 40 seems to check if the weekly hours are under 40. We know that the third column in our file is the weekly working hours, so we can assume that $3 references it.

{ print $1 } looks like it’s printing the first column of our file. Well, only the users who had less than 40 weekly working hours are printed, so it seems that this statement is only executed if $3 < 40 is true.

Congratulations! You have just executed your first AWK program.

Structure

The basic structure of an AWK program is one or more pattern-action statements, with neither the pattern nor the action being necessary.

pattern {action} # action triggered only when pattern matches
pattern          # print the record that was matched (default behaviour)
        {action} # every record will match and the action will be triggered.

In the example above the $3 < 40 would be the pattern and { print $1 } would be the action.

Fields and Records

To understand how AWK works we will first need to talk about fields and records.

AWK splits the stream of data into records that are then run through the program. Normally, records are text lines separated by a newline character. As per the example above, you can imagine that every line of our input (E.g. Alice 20 40) will run against our program (E.g. $3 < 40 { print $1 })

I know what you are thinking: what if you are working with a file that has just a massive single entry and you would like to split by some other character? Worry not, because there is a variable called RS (Record Separator) where you can choose what the separator is. As per the POSIX specification, we can split our textual data stream by newlines, single characters or paragraphs. This behavior is rather limiting, so gawk adds the ability for RS to be a regular expression.

Each record is further subdivided into fields. By default, the field separator is whitespace characters. Each of these fields can then be addressed through $n with n being the number of the field. There is, however, one special field $0that refers to the current record without any fields being separated.

And as for the records, there is a FS (Field Selector) variable where you can specify how to split them (It can also be a regular expression).

Here is a practical example, print all the groups from /etc/group

awk -v FS=":" ' { print $1 } ' /etc/group

Or maybe we would like to know the members of specific group:

awk -v FS=":" ' $1 == "docker" { print $4 } ' /etc/group

Build-in Variables

AWK provides several built-in variables. Here we will take a look at the more important ones in my opinion:

Variable Description
ARGV An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1.
ARGC The number of elements in the ARGV array.
ENVIRON An array representing the value of the environment.
FILENAME pathname of the current input file
FNR Record number in the current input file
FS Field separator
NF Number of fields in current record
NR Record number in the job
OFS The print statement output field separator by default.
ORS The print statement output record separator; a by default.
RS Record separator.

 

Let’s try one out. Why don’t we insert a line number in our test file. To do this we can use the NR variable.

awk '  { print NR $0 } ' emp.txt

The output will be

1Alice 20 40
2Bob   13 35
3Ivy   9  20
4John  8  42

It seems that the line numbers are added without any space. Well to fix this we can use another build in variable OFS to insert the field separator:

awk '  { print NR OFS $0 } ' emp.txt

Output:

1 Alice 20 40
2 Bob   13 35
3 Ivy   9  20
4 John  8  42

This looks much better.

Patterns

Patterns, as we have seen above, control the execution of actions.

Expressions/Regular Expressions

Most patterns are expressions/regular expressions or compound of multiple expressions. Let’s look at an example:

The command above will return only Bob

Special patterns BEGIN and END

The BEGIN and END are special patterns that do not process any data from the input stream. BEGIN is executed at the beginning of the program and is normally used for initialization tasks. END is executed after all input has been read and is primarily used for clean up or to produce summary reports.

Let’s look at an example. We would want to know who runs the script and calculate how much we need to pay our workers this week:

awk ' BEGIN \ 
    { print "Current user running the script is " ENVIRON["USER"]; pay = 0; } \
    { pay = pay + $2*$3; } \
    END { print "Currently our employees are earning " pay " per week"}' \
    emp.txt

And we will get:

Current user running the script is {your username}
Currently our employees are earning 1771 per week

Range Patterns

A range pattern will log each record between the occurrence of the first pattern and the second.

Actions

There are too many actions to list them all, but here is a small set of what I find useful:

Build-in Funktions

You are definitely doing to use them, so check the man pages: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

The one I use most often is gsub to substitute all occurrences. For example lets replace 20 with 22 in our emp.txt file.

awk ' { gsub('20,22'); print } ' emp.txt

The output should be

Alice 22 40
Bob   13 35
Ivy   9  22
John  8  42

Thus all values of 20 were replaced with 22

Further Reading

Conclusion

This is everything you need, in my opinion, to get started in using AWK. In the end, you can just grep and sed and get the same things done. But AWK is fun.