[SOLVED] AWK: State-Aware Pattern Matching

Dark Owl · Post by **Dark Owl** » Tue Jun 28, 2022 4:13 am

I'm hoping there are some AWK gurus here.

For beginners: an AWK script essentially breaks down as:

/pattern1/ { operation1 }
/pattern2/ { operation2 }
...

Then the input is processed by reading the first record (a text line by default), comparing it with the pattern definitions, and running the operations for all patterns that find a match in the record. Then the next record is read, until end-of-file.

What I want to do is have a state variable so that I can control which patterns are tested according to what section of an input file is currently being read. I know how to do that and make the operations conditional, but that means duplicating the state awareness through every operation:

/pattern1/ { if state=A { operation1a } else if state=B { operation1b }... }
/pattern2/ { if state=A { operation2a } else if state=B { operation2b }... }
...

...but every pattern required in every state then needs to be specified for every state.

What would be more efficient is to create a switch on testing the patterns in the first place:

if state=A {
/pattern1A/ { operation1A }
/pattern2A/ { operation2A }
...
}
if state=B {
/pattern1B/ { operation1B }
/pattern2B/ { operation2B }
...
}
...

...that way the patterns defined for State B need not be the same as for State A, and only patterns relevant to (say) State B need to be specified in the State B section.

The trouble is, AWK does not process the script in that way. It is procedural within the { operation } section, but outside that it looks at every pattern definition and applies each of them to the input record. "if" would not be valid syntax outside an operation.

Any ideas?

rene · Post by **rene** » Tue Jun 28, 2022 5:09 am

Dark Owl wrote: ⤴Tue Jun 28, 2022 4:13 am an AWK script essentially breaks down as:
/pattern1/ { operation1 }
/pattern2/ { operation2 }
...

That is not quite correct and in this case relevantly so. Certainly the /re/ pattern is the most often used but an AWK statement is more generally

Code: Select all

pattern { action }

where pattern can be something other than a regex match as well; see man awk. Here this is to say you can do basically what you yourself suggest:

Code: Select all

$ cat foo.awk
/^section / {
	section = $2;
}
section == 0 {
	if ($0 ~ /foo/) print "0: foo";
	if ($0 ~ /bar/) print "0: bar";
}
section == 1 {
	if ($0 ~ /foo/) print "1: foo";
	if ($0 ~ /baz/) print "1: baz";
}

Note; you can do that nicer -- gawk knows about switch/case for one -- but this is a minimal illustration only.

This delimits sections by literal lines "section ?", i.e.,

Code: Select all

rene@hp8k:~$ cat foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz
rene@hp8k:~$ awk -f foo.awk foo.txt
0: foo
0: bar
1: foo
1: baz

But that's to say then that you can basically do as you suggest; awk is a fairly complete procedural language -- although I'd still advise to not loose yourself too deeply in it because certainly something like Python would be soon-ish more convenient when things in fact get involved.

Dark Owl · Post by **Dark Owl** » Tue Jun 28, 2022 6:51 am

I don't disagree with what you say, but what you've done is move the pattern matching inside the action section of a pattern matching the value of "section". I was aware of that possibility, I was hoping somebody had a way to escape the actual pattern matching.

One idea I have is to abort processing of any further pattern rules using NEXT.

What about something like:

STATE==n && /pattern/ { action }

?

I'll try that out later.

As to using AWK in preference to anything else, I'm sorry but I'm old-school. It's *much* easier to use something I'm familiar with than something I'm not, and anyway AWK already handles a lot of the nitty-gritty of text file processing which would have to be implemented explicitly in a general-purpose language.

rene · Post by **rene** » Tue Jun 28, 2022 6:55 am

I don't understand your comment; what I did was exactly what you yourself suggested in the part of your post starting with "What would be more efficient [ ... ]".

Anyways; need be off...

Dark Owl · Post by **Dark Owl** » Tue Jun 28, 2022 7:00 am

Yeah, well maybe I didn't express myself as precisely as some might like. By "efficient" I meant in terms of concise and easily understood code rather than execution.

Anyway, I updated the previous post to indicate a line of enquiry I shall pursue next...

rene · Post by **rene** » Tue Jun 28, 2022 7:33 am

Believe you may have misread my 'section' clauses as being the very same; note that the second uses e.g. /baz/ rather than /bar/, i.e., does as per that

... that way the patterns defined for State B need not be the same as for State A, and only patterns relevant to (say) State B need to be specified in the State B section.

If that's not what you mean I'll give up.

Dark Owl · Post by **Dark Owl** » Tue Jun 28, 2022 9:03 am

rene wrote: ⤴Tue Jun 28, 2022 7:33 am If that's not what you mean I'll give up.

Why do you think I don't understand what you wrote?

This works (expanding slightly on your example):

Code: Select all

F:\Test>type foo2.awk
/^section / {
        section = $2;
}

# Applies the following only to lines found after "section 0"
section == 0 && /foo/ { print "0: foo" }
section == 0 && /bar/ { print "0: bar" }
section == 0 { next }

# Applies the following only to lines found after "section 1"
section == 1 && /foo/ { print "1: foo" }
section == 1 && /baz/ { print "1: baz" }
section == 1 { next }

# Applies the following only to lines where the section switch is not "0" or "1"
{ print $0 }

Code: Select all

F:\Test>type foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz

F:\Test>gawk -f foo2.awk foo.txt
0: foo
0: bar
1: foo
1: baz
section 2
foo
bar
baz

F:\Test>

It's just a question of which version is easiest to read.

Dark Owl · Post by **Dark Owl** » Thu Jun 30, 2022 5:38 pm

I often see the text "foo" and "bar" used as random sample strings... but why isn't it "fu" and "bar"?

Coggy · Post by **Coggy** » Fri Jul 01, 2022 3:52 am

A long list ofSTATE==n && /pattern/ { action } lines would be my chosen approach, unless I was really worried about performance.

You could but some NEXT actions in there to cut processing short, but that complicates the structure and makes mistakes while updating the code more likely. So I wouldn't do that unless necessary.

rene · Post by **rene** » Fri Jul 01, 2022 4:43 am

Dark Owl wrote: ⤴Thu Jun 30, 2022 5:38 pm I often see the text "foo" and "bar" used as random sample strings... but why isn't it "fu" and "bar"?

History; http://www.catb.org/jargon/html/F/foo.html

Dark Owl · Post by **Dark Owl** » Sat Jul 02, 2022 4:05 am

Coggy wrote: ⤴Fri Jul 01, 2022 3:52 am You could but some NEXT actions in there to cut processing short

I did (see above). I don't see that terminating a state section with a next complicates anything, unless you want to do some subsequent processing valid for all states.

I like Rene's approach, because it effectively vectors to the relevant state section (especially if using a case statement). I just don't much like having to make explicit match statements within the state processing.

rene wrote: ⤴Fri Jul 01, 2022 4:43 am History; http://www.catb.org/jargon/html/F/foo.html

Very good! Thanks.

Dark Owl · Post by **Dark Owl** » Sat Jul 02, 2022 5:52 am

Here's a refinement – "$0 ~" is not required, because if a regexp is on the left then "~ $0" is implied:

Code: Select all

F:\Test>type foo3.awk
/^section / {
        section = $2;
}

{  switch (section) {

   case 0:

      if (/foo/) { print "0: foo" }
      if (/bar/) { print "0: bar" }
      break

   case 1:

      if (/foo/) { print "1: foo" }
      if (/baz/) { print "1: baz" }
      break

   default:

      print $0

   }
}

F:\Test>type foo.txt
section 0
foo
bar
baz
section 1
foo
bar
baz
section 2
foo
bar
baz

F:\Test>gawk -f foo3.awk < foo.txt
0: foo
0: bar
1: foo
1: baz
section 2
foo
bar
baz

F:\Test>

Getting close! I had an idea the "?" selection operator might be used instead of the "if", but Gawk baulked at having anything other than an expression on the right of the ?.

NB: The switch-case structure is not available in all versions of AWK, and Gawk won't recognise it if in compatibility mode.

Linux Mint Forums

[SOLVED] AWK: State-Aware Pattern Matching

[SOLVED] AWK: State-Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: AWK: State Aware Pattern Matching

Re: [SOLVED] AWK: State-Aware Pattern Matching

Re: [SOLVED] AWK: State-Aware Pattern Matching

Re: [SOLVED] AWK: State-Aware Pattern Matching

Re: [SOLVED] AWK: State-Aware Pattern Matching

Re: [SOLVED] AWK: State-Aware Pattern Matching