[SOLVED] AWK: State-Aware Pattern Matching
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Topics in this forum are automatically closed 6 months after creation.
[SOLVED] AWK: State-Aware Pattern Matching
I'm hoping there are some AWK gurus here.
For beginners: an AWK script essentially breaks down as:
/pattern1/ { operation1 }
/pattern2/ { operation2 }
...
Then the input is processed by reading the first record (a text line by default), comparing it with the pattern definitions, and running the operations for all patterns that find a match in the record. Then the next record is read, until end-of-file.
What I want to do is have a state variable so that I can control which patterns are tested according to what section of an input file is currently being read. I know how to do that and make the operations conditional, but that means duplicating the state awareness through every operation:
/pattern1/ { if state=A { operation1a } else if state=B { operation1b }... }
/pattern2/ { if state=A { operation2a } else if state=B { operation2b }... }
...
...but every pattern required in every state then needs to be specified for every state.
What would be more efficient is to create a switch on testing the patterns in the first place:
if state=A {
/pattern1A/ { operation1A }
/pattern2A/ { operation2A }
...
}
if state=B {
/pattern1B/ { operation1B }
/pattern2B/ { operation2B }
...
}
...
...that way the patterns defined for State B need not be the same as for State A, and only patterns relevant to (say) State B need to be specified in the State B section.
The trouble is, AWK does not process the script in that way. It is procedural within the { operation } section, but outside that it looks at every pattern definition and applies each of them to the input record. "if" would not be valid syntax outside an operation.
Any ideas?
For beginners: an AWK script essentially breaks down as:
/pattern1/ { operation1 }
/pattern2/ { operation2 }
...
Then the input is processed by reading the first record (a text line by default), comparing it with the pattern definitions, and running the operations for all patterns that find a match in the record. Then the next record is read, until end-of-file.
What I want to do is have a state variable so that I can control which patterns are tested according to what section of an input file is currently being read. I know how to do that and make the operations conditional, but that means duplicating the state awareness through every operation:
/pattern1/ { if state=A { operation1a } else if state=B { operation1b }... }
/pattern2/ { if state=A { operation2a } else if state=B { operation2b }... }
...
...but every pattern required in every state then needs to be specified for every state.
What would be more efficient is to create a switch on testing the patterns in the first place:
if state=A {
/pattern1A/ { operation1A }
/pattern2A/ { operation2A }
...
}
if state=B {
/pattern1B/ { operation1B }
/pattern2B/ { operation2B }
...
}
...
...that way the patterns defined for State B need not be the same as for State A, and only patterns relevant to (say) State B need to be specified in the State B section.
The trouble is, AWK does not process the script in that way. It is procedural within the { operation } section, but outside that it looks at every pattern definition and applies each of them to the input record. "if" would not be valid syntax outside an operation.
Any ideas?
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 3 times in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: AWK: State Aware Pattern Matching
That is not quite correct and in this case relevantly so. Certainly the
/re/
pattern is the most often used but an AWK statement is more generallyCode: Select all
pattern { action }
pattern
can be something other than a regex match as well; see man awk
. Here this is to say you can do basically what you yourself suggest:Code: Select all
$ cat foo.awk
/^section / {
section = $2;
}
section == 0 {
if ($0 ~ /foo/) print "0: foo";
if ($0 ~ /bar/) print "0: bar";
}
section == 1 {
if ($0 ~ /foo/) print "1: foo";
if ($0 ~ /baz/) print "1: baz";
}
gawk
knows about switch/case
for one -- but this is a minimal illustration only.This delimits sections by literal lines "section ?", i.e.,
Code: Select all
rene@hp8k:~$ cat foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz
rene@hp8k:~$ awk -f foo.awk foo.txt
0: foo
0: bar
1: foo
1: baz
Re: AWK: State Aware Pattern Matching
I don't disagree with what you say, but what you've done is move the pattern matching inside the action section of a pattern matching the value of "section". I was aware of that possibility, I was hoping somebody had a way to escape the actual pattern matching.
One idea I have is to abort processing of any further pattern rules using NEXT.
What about something like:
STATE==n && /pattern/ { action }
?
I'll try that out later.
As to using AWK in preference to anything else, I'm sorry but I'm old-school. It's *much* easier to use something I'm familiar with than something I'm not, and anyway AWK already handles a lot of the nitty-gritty of text file processing which would have to be implemented explicitly in a general-purpose language.
One idea I have is to abort processing of any further pattern rules using NEXT.
What about something like:
STATE==n && /pattern/ { action }
?
I'll try that out later.
As to using AWK in preference to anything else, I'm sorry but I'm old-school. It's *much* easier to use something I'm familiar with than something I'm not, and anyway AWK already handles a lot of the nitty-gritty of text file processing which would have to be implemented explicitly in a general-purpose language.
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: AWK: State Aware Pattern Matching
I don't understand your comment; what I did was exactly what you yourself suggested in the part of your post starting with "What would be more efficient [ ... ]".
Anyways; need be off...
Anyways; need be off...
Re: AWK: State Aware Pattern Matching
Yeah, well maybe I didn't express myself as precisely as some might like. By "efficient" I meant in terms of concise and easily understood code rather than execution.
Anyway, I updated the previous post to indicate a line of enquiry I shall pursue next...
Anyway, I updated the previous post to indicate a line of enquiry I shall pursue next...
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: AWK: State Aware Pattern Matching
Believe you may have misread my 'section' clauses as being the very same; note that the second uses e.g. /baz/ rather than /bar/, i.e., does as per that
If that's not what you mean I'll give up.... that way the patterns defined for State B need not be the same as for State A, and only patterns relevant to (say) State B need to be specified in the State B section.
Re: AWK: State Aware Pattern Matching
Why do you think I don't understand what you wrote?
This works (expanding slightly on your example):
Code: Select all
F:\Test>type foo2.awk
/^section / {
section = $2;
}
# Applies the following only to lines found after "section 0"
section == 0 && /foo/ { print "0: foo" }
section == 0 && /bar/ { print "0: bar" }
section == 0 { next }
# Applies the following only to lines found after "section 1"
section == 1 && /foo/ { print "1: foo" }
section == 1 && /baz/ { print "1: baz" }
section == 1 { next }
# Applies the following only to lines where the section switch is not "0" or "1"
{ print $0 }
Code: Select all
F:\Test>type foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz
F:\Test>gawk -f foo2.awk foo.txt
0: foo
0: bar
1: foo
1: baz
section 2
foo
bar
baz
F:\Test>
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: [SOLVED] AWK: State-Aware Pattern Matching
I often see the text "foo" and "bar" used as random sample strings... but why isn't it "fu" and "bar"?
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: [SOLVED] AWK: State-Aware Pattern Matching
A long list of
You could but some NEXT actions in there to cut processing short, but that complicates the structure and makes mistakes while updating the code more likely. So I wouldn't do that unless necessary.
STATE==n && /pattern/ { action }
lines would be my chosen approach, unless I was really worried about performance. You could but some NEXT actions in there to cut processing short, but that complicates the structure and makes mistakes while updating the code more likely. So I wouldn't do that unless necessary.
Re: [SOLVED] AWK: State-Aware Pattern Matching
I did (see above). I don't see that terminating a state section with a next complicates anything, unless you want to do some subsequent processing valid for all states.
I like Rene's approach, because it effectively vectors to the relevant state section (especially if using a case statement). I just don't much like having to make explicit match statements within the state processing.
Very good! Thanks.
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Re: [SOLVED] AWK: State-Aware Pattern Matching
Here's a refinement – "$0 ~" is not required, because if a regexp is on the left then "~ $0" is implied:
Getting close! I had an idea the "?" selection operator might be used instead of the "if", but Gawk baulked at having anything other than an expression on the right of the ?.
NB: The switch-case structure is not available in all versions of AWK, and Gawk won't recognise it if in compatibility mode.
Code: Select all
F:\Test>type foo3.awk
/^section / {
section = $2;
}
{ switch (section) {
case 0:
if (/foo/) { print "0: foo" }
if (/bar/) { print "0: bar" }
break
case 1:
if (/foo/) { print "1: foo" }
if (/baz/) { print "1: baz" }
break
default:
print $0
}
}
F:\Test>type foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz
F:\Test>gawk -f foo3.awk < foo.txt
0: foo
0: bar
1: foo
1: baz
section 2
foo
bar
baz
F:\Test>
NB: The switch-case structure is not available in all versions of AWK, and Gawk won't recognise it if in compatibility mode.
Currently: Linux Mint 21.2 Cinnamon 64-bit 5.8.4, AMD Ryzen5 + Geforce GT 710
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2
Previously: LM20.3 LM20.2 LM20.1, LM20, LM20β, LM18.2