on
Short Intro into AWK
AWK is a great scripting language when you have to work with textual data and it is a shame that most people have not heard of it.
Before we begin, we will need to have the preliminary “my history with {insert thing}” part. So here we go …
I have two “anecdotes” of using AWk. The first one was when, as a working student, I got dropped a 5GB server log file and was told that there was some problem last week. What to do? No editor could open the file, because of its size and the logs weren’t structured, so I needed some way to explore the data. That is where AWK came into play. Using some patterns that I knew should be there, I was able to extract all the information needed to find that nasty bug.
The second time that AWK came truly handy was when I was migrating my bookmarks. I know that is not something people usually do and even when they need to, usually you can just export them to a format that any bookmarking software supports. But I have had the privilege of using software that did not provide this functionality, so in the end I used AWK to migrate all bookmarks from the text files used by the software to a more general format.
What I mean to say with the stories is that AWK saved my ass.
As you can tell, I use AWk more as a one-liner and do not build complex programs. And this intro is build in the same way. The idea is to get a feel for the language and not actually learn all its features. Please refer to the references below, they are a treasure trove of information and you will learn much more there than here. We are going to look at a couple of separate examples, build some one-liners and hopefully by the end of the article you will be on your way to integrate AWK in your toolkit.
Warning: There are multiple implementation of AWK and the language itself has been significantly extended. For the POSIX compliant version check https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html. I am using gawk (GNU awk) which comes default on most if not all Linux distributions. For the list of extensions added on top of POSIX-awk see https://www.gnu.org/software/gawk/manual/html_node/POSIX_002fGNU.html
Example
Let us say we have the following file:
cat > emp.txt<< EOF
Alice 20 40
Bob 13 35
Ivy 9 20
John 8 42
EOF
We have three columns: the employee’s name, their hourly rate and the hours they work in a week.
Open the command line and paste the following:
awk ' $3 < 40 { print $1 } ' emp.txt
You should get the following output:
Bob
Ivy
Most of you, who have any programming experience, will have an intuitive understanding of what is going on when looking at the code and output.
$3 < 40 seems to check if the weekly hours are under 40. We know that the third column in our file is the weekly working hours, so we can assume that $3 references it.
{ print $1 } looks like it’s printing the first column of our file. Well, only the users who had less than 40 weekly working hours are printed, so it seems that this statement is only executed if $3 < 40 is true.
Congratulations! You have just executed your first AWK program.
Structure
The basic structure of an AWK program is one or more pattern-action statements, with neither the pattern nor the action being necessary.
pattern {action} # action triggered only when pattern matches
pattern # print the record that was matched (default behaviour)
{action} # every record will match and the action will be triggered.
In the example above the $3 < 40 would be the pattern and { print $1 } would be the action.
Fields and Records
To understand how AWK works we will first need to talk about fields and records.
AWK splits the stream of data into records that are then run through the program. Normally, records are text lines separated by a newline character. As per the example above, you can imagine that every line of our input (E.g. Alice 20 40) will run against our program (E.g. $3 < 40 { print $1 })
I know what you are thinking: what if you are working with a file that has just a massive single entry and you would like to split by some other character? Worry not, because there is a variable called RS (Record Separator) where you can choose what the separator is. As per the POSIX specification, we can split our textual data stream by newlines, single characters or paragraphs. This behavior is rather limiting, so gawk adds the ability for RS to be a regular expression.
Each record is further subdivided into fields. By default, the field separator is whitespace characters. Each of these fields can then be addressed through $n with n being the number of the field. There is, however, one special field $0that refers to the current record without any fields being separated.
And as for the records, there is a FS (Field Selector) variable where you can specify how to split them (It can also be a regular expression).
Here is a practical example, print all the groups from /etc/group
awk -v FS=":" ' { print $1 } ' /etc/group
Or maybe we would like to know the members of specific group:
awk -v FS=":" ' $1 == "docker" { print $4 } ' /etc/group
Build-in Variables
AWK provides several built-in variables. Here we will take a look at the more important ones in my opinion:
| Variable | Description |
|---|---|
| ARGV | An array of command line arguments, excluding options and the program argument, numbered from zero to ARGC-1. |
| ARGC | The number of elements in the ARGV array. |
| ENVIRON | An array representing the value of the environment. |
| FILENAME | pathname of the current input file |
| FNR | Record number in the current input file |
| FS | Field separator |
| NF | Number of fields in current record |
| NR | Record number in the job |
| OFS | The print statement output field separator by default. |
| ORS | The print statement output record separator; a by default. |
| RS | Record separator. |
Let’s try one out. Why don’t we insert a line number in our test file. To do this we can use the NR variable.
awk ' { print NR $0 } ' emp.txt
The output will be
1Alice 20 40
2Bob 13 35
3Ivy 9 20
4John 8 42
It seems that the line numbers are added without any space. Well to fix this we can use another build in variable OFS to insert the field separator:
awk ' { print NR OFS $0 } ' emp.txt
Output:
1 Alice 20 40
2 Bob 13 35
3 Ivy 9 20
4 John 8 42
This looks much better.
Patterns
Patterns, as we have seen above, control the execution of actions.
Expressions/Regular Expressions
Most patterns are expressions/regular expressions or compound of multiple expressions. Let’s look at an example:
- Find every user that starts with “B” and works exactly 35 hours
awk ' $1 ~ /^B/ && $3 == 35 { print $1 } ' emp.txt
The command above will return only Bob
Special patterns BEGIN and END
The BEGIN and END are special patterns that do not process any data from the input stream. BEGIN is executed at the beginning of the program and is normally used for initialization tasks. END is executed after all input has been read and is primarily used for clean up or to produce summary reports.
Let’s look at an example. We would want to know who runs the script and calculate how much we need to pay our workers this week:
awk ' BEGIN \
{ print "Current user running the script is " ENVIRON["USER"]; pay = 0; } \
{ pay = pay + $2*$3; } \
END { print "Currently our employees are earning " pay " per week"}' \
emp.txt
And we will get:
Current user running the script is {your username}
Currently our employees are earning 1771 per week
Range Patterns
A range pattern will log each record between the occurrence of the first pattern and the second.
- Get every user between Bob and Ivy:
And it will return:
awk ' /Alice/, /Ivy/ { print $1 } ' emp.txtAlice Bob Ivy
Actions
There are too many actions to list them all, but here is a small set of what I find useful:
- print - prints all arguments that are given.
- if (expression) statement; else statement;
- exit - exits the program.
- for (expression; expression; expression) statement
- for (variable in array) statement
- while (expression) statement
- do statement while (expression)
- break
- continue
Build-in Funktions
You are definitely doing to use them, so check the man pages: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
The one I use most often is gsub to substitute all occurrences. For example lets replace 20 with 22 in our emp.txt file.
awk ' { gsub('20,22'); print } ' emp.txt
The output should be
Alice 22 40
Bob 13 35
Ivy 9 22
John 8 42
Thus all values of 20 were replaced with 22
Further Reading
- Man pages (GAWK) everything you need to know: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
- The original AWK book (my God there is a second edition, I need to check it): https://awk.dev/
- Classic Shell Scripting has a fun section about AWK “Enough AWK to be dangerous”: https://www.oreilly.com/library/view/classic-shell-scripting/0596005954/
Conclusion
This is everything you need, in my opinion, to get started in using AWK. In the end, you can just grep and sed and get the same things done. But AWK is fun.