# Awk: The Power and Promise of a 40-Year-Old Language
By Andy Oram, 19 May, 2021
Languages don't enjoy long lives. Very few people still code with
the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk. (The C
language is still widely used but in significantly updated versions.)
Bucking that trend, the 1977 Unix utility Awk can boast of a loyal
band of users and seems poised to continue far into the future. In
this article, I’ll explain what makes Awk special and keeps it
relevant.
# A Descriptive Language
Awk runs on inputs and a script. The inputs can be files, but the
command is often used as part of a pipeline, taking input from the
previous command's output:
```
ls | awk '/SAMPLES_[1-9][0-9]/ { ++counter }'
```
The long quoted text in the above command is the script, which can be
included on the command line or read from files. Each script
comprises a set of conditions and actions. The condition is often a
regular expression enclosed by slashes. The action appears as one or
more statements between braces. If the condition matches a part of
the input, the action is executed. Here is my trivial, one-line
script:
```
/SAMPLES_[1-9][0-9]/ { ++counter }
```
The script searches for strings like SAMPLES_19 or SAMPLES_20 and
increments a counter each time a string is found. Of course, a real
script would use the counter in further calculations.
This is basically how Awk operates: evaluate a condition, then take
action when it matches. The script runs in what David Kerns, in an
email exchange with me, called an implied loop. In his review of
this article, Arnold Robbins, maintainer of the GNU version of Awk
(Gawk), calls the programs data-driven.
I see Awk as more of a declarative language than a procedural one.
You describe what you want to happen and the conditions under which
it happens, instead of specifying a series of sequential statements.
Awk certainly executes statements in sequence and offers control flow statements (if, while), so it can serve quite well as a procedural
language. Nelson H. F. Beebe, in his review of this article,
mentioned writing a program with 23,981 lines of actions in just 12
patterns.
But overall, sequences of statements execute within a framework of
declaring the conditions under which these things should happen. The
concept of a declarative language has been around almost since the
beginning of high-level programming languages and can be found in the
popular notion of a promise, invented by Mark Burgess.
http://markburgess.org/promises.html
Awk documentation usually calls the condition a "pattern" because
regular expressions are so often used as conditions. Janis
Papanagnou, in his review of this article, explained that he has
recommended the word "condition" instead. I realized that this word
choice matches my own view of Awk at a high level as a descriptive
language. Aleksey Cheusov, in email, said that Awk programs can be
viewed as finite state machines, which declare how to move from one
state to another.
Neil Ormos, in an email exchange with me, offered an interesting
perspective on when to use Awk:
I'd put Awk in a special category of general-purpose programming
languages that are especially well adapted for: (1) personal
computing; and (2) programmer-time-efficient prototype development,
where the prototype artifact can evolve advantageously into a production-worthy tool with a little incremental effort.
Awk also maintains a delicate balance between being a line-oriented
utility like grep and a full programming language. Normally, Awk
just applies your script to each line of input, like grep, acting on
what matches your condition.
Furthermore, Awk is focused on lines divided into fields that are
separated by white space or by any character or regular expression
you choose. All behavior is subject to customizations—as Ed Morton
suggested in his review, we should speak more generally of "records"
instead of "lines"—but traditionally Awk is used on files where each
line consists of a regular set of fields. It has proven very useful
for parsing log files, for instance.
In 1988, Kernighan put a set of bug fixes and major new features into
a version released under the name Nawk (although he wanted it to
replace the original Awk), and the standard version has not changed
much since then.
# It's Not Just About the Language
Languages are part of a larger environment that often plays more of a
role in the choice of language than its actual features. For
instance, many people use Python because so many important libraries
have been written for that language. Other people use a language for
legacy reasons: they have an existing application to maintain or work
in an organization that has historically depended on a language.
Many of the people who responded to my outreach for this article
focused their appreciation of Awk on factors other than language
features. Besides being deeply embedded in many Unix scripts, Awk's
presence is guaranteed on every Unix-style system, including
GNU/Linux, BSD, and macOS. The utility's suitability for widespread
use is bolstered by its ability to accomplish complex tasks without
requiring the installation of outside libraries or packages. The
language's behavior is also guaranteed in a POSIX standard, which
turns out to be surprisingly important to a lot of users. However,
many variants have added non-standard features. Gawk and mawk are in
common use.
https://pubs.opengroup.org/onlinepubs/009696799/utilities/awk.html
Among people who use Awk on large projects, it's a critical part of
their toolkit because it's fast. Michael May and Glaudiston Gomes da
Silva told me that they had ported some Java data processing programs
to Awk with more than ten-fold reductions in CPU and RAM consumption.
One researcher clocked Awk on 25 TB of data with impressive results.
Another advised Awk’s use for some tasks, along with other classic
Unix tools, instead of Hadoop. And one of the most active sites in
data science, Analytics Vidhya, published an article praising Awk.
<
https://livefreeordichotomize.com/2019/06/04/
using_awk_and_r_to_parse_25tb/>
<
https://adamdrake.com/ command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html>
<
https://medium.com/analytics-vidhya/ use-awk-to-save-time-and-money-in-data-science-eb4ea0b7523f>
Cheusov, in correspondence with me, provided more evidence of Awk's
speed:
When I worked in computational linguistics, we often parsed
gigabytes of text. Programs written in GNU Awk and mawk were much
faster than equivalent programs written in Ruby, Python and Perl.
Because AWK is so simple, its interpreter can be optimized much
more easily than for much more complex languages.
Awk is fast because it has stayed simple and avoided features that
are considered necessities in other languages. It concentrates on
what it can do well. Several correspondents told me that they
appreciated being able to do what they wanted without downloading
large modules as they would do for other languages.
Computer science professor Tim Menzies, in his article "Why Gawk?",
cited the simplicity and regularity of Awk syntax, which allows it to
be learned quickly and to ward off overly complex code. Other
correspondents also cited the GNU Awk debugger as a boon for Awk
development.
https://web.archive.org/web/20150929033218/http://awk.info/?whygawk
https://www.gnu.org/software/gawk/manual/html_node/Debugger.html
Last but not least, we shouldn't ignore the importance of good
documentation. Awk documentation is easy to find on the web. The
manual for Gawk, written by the software's maintainer, Arnold D.
Robbins, is particularly helpful. For example, the Gawk manual
carefully distinguishes Gawk extensions from standard features, so
that you can avoid the extensions if you want to conform to the
standard. I have noticed that GNU tools in general have good
manuals, perhaps because Richard M. Stallman and his collaborators
have always assigned a high value to documentation.
https://www.gnu.org/software/gawk/
# Expansion Without Bloat
The classic Awk, as created by Alfred Aho, Peter J. Weinberger, and
Brian Kernighan (who drew on their initials to create the name of the
utility), was informal. It didn't make users declare variables but
simply assumed the variables' values to be zero or null the first
time they were used. Data types were implied. This kind of casual
scripting was common in the 1970s, and anything more formal would
have undermined the tool's appeal.
Every language evolves, usually by incorporating popular features
from other languages. The trick is to avoid throwing in features of
little value that degrade the language by making it hard to use, slow
to compile or run, etc. In this regard, Awk has done well. It has
resisted modernization in the form of data declarations and objects.
Because Awk is very different from general-purpose languages, it
doesn't have space for callbacks, polymorphism, and other fads that
have become central to application design in many languages. But
some variants of Awk added functionality of real value while
maintaining Awk's sleek performance and small footprint. Gawk, like
many GNU utilities, has upgraded aggressively.
Many dedicated Awk users don't strive for large programs or make use
of extended features. Some love Awk for one-liners like the one I
showed earlier. Like most Unix and GNU/Linux users, these casual
adherents of Awk prefer bigger languages such as Perl (yes, still!)
and Python for large tasks. Others, however, write large Awk
programs with the help of its newer additions.
Here are some of the features postdating the original 1977 release
that users tell me are most useful. I focus on the features that
allow Awk programs to grow large and allow programmers to reuse and
share code.
Two features were added fairly early to standard Awk:
multi-dimensional arrays and user-defined functions. Recent
computing algorithms, especially in data science, depend heavily on
matrices and higher-dimensional arrays called tensors. So the
addition of multi-dimensional arrays to Awk prepared it for modern
data processing. User-defined functions provided Awk with a whole
new level of reusability. You can call complex code from different
statements, and share your functions with colleagues.
The other features promoting reuse and large programs are extensions in Gawk:
## Namespaces
Once Awk offered user-defined functions, this Gawk extension allowed
even more sharing and growth. As in C++ or Java, namespaces in Awk
prevent clashes between function names or other symbols defined in
different functions.
## BEGINFILE and ENDFILE
Awk provides BEGIN and END actions to let you do initial processing
(before all files are read) and terminal processing (after all files
are read). Gawk extends this with BEGINFILE and ENDFILE, which let
specify actions to take before reading or after processing each file
in a set of multiple files.
## Two-way pipelines
These streamline the operation of coprocesses, which allow you to
delegate operations to a separate program and get results back. This
form of multiprocessing has been around in other languages for quite
a while, most notably in Go. The original form of Awk allowed
coprocesses, but only through the cumbersome use of temporary files.
## Network programming
This capability takes multiprocessing past the local system, using
classic internet sockets to communicate with programs on remote
hosts. The remote programs could be coded in any language, not just
Awk.
## Arbitrary-precision arithmetic
Like multi-dimensional arrays, this feature appeals to scientists who
need to go beyond the limitations of conventional integers and
floating-point numbers, constrained by microprocessor design.
Plugins/extensions—These allow intrepid programmers to extend Gawk
without messing around in the core code.
# Recent Examples of Awk in Action
An article in LWN.net discusses the continued appeal of Awk along
with some recent large projects that use it. Other projects
mentioned by people I corresponded with include:
https://lwn.net/Articles/820829/
* Validation of a sports schedule, for example, ensuring that a team
doesn't have two games at the same time, that a coach isn't
coaching two teams at the same time, a team isn't playing at a time
when they aren't available, etc. This program checks about 25
different constraints per team, on average.
* Converting SQL data across web sites from one schema to another by
way of exported/imported CSVs.
* A literate programming tool. The concept of "literate programming"
was invented by Donald Knuth, in some ways the grandfather of
modern programming. Hints of the idea appear in modern commenting
systems such as Javadoc.
* An IRC client and bot.
* Extracting bibliographies from technical journal articles.
* runawk, a wrapper for Awk.
* Components of the pkgsrc framework for building packages on
Unix-like systems, including pkg_summary-utils.
https://github.com/arnoldrobbins/texiwebjr
http://literateprogramming.com/knuthweb.pdf
https://github.com/cheusov/runawk
http://www.pkgsrc.org/
https://github.com/cheusov/pkg_summary-utils
In conclusion, Awk can do much more than the simple line-by-line text processing that is usually considered its forte. The discussions and
examples in this article show that the language still has a place in
the 21st century.
From: <
https://www.fosslife.org/
awk-power-and-promise-40-year-old-language>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)