[SOLVED] Extract text from a file.

Pecu1iar · Post by **Pecu1iar** » Sat Apr 23, 2022 11:30 pm

OK, so I've been looking for a few days, but I'm not skilled enough to find what I need or I don't understand what I'm reading. For someone who knows, I imagine it wouldn't be hard at all.

What I have is an M3U text file that contains a few thousand lines. I want to extract two pieces of information from all relevant lines.

I have attached a bit of the text file.

I need to extract tvg-name="whatever the name is" and group-title="whatever the title is" effectively removing everything else. When finished, I would like to be left with just the tvg-name in column A and the corresponding group title in column B. I would like this all to be saved in a format that can be manipulated within a spreadsheet like google sheets or excel.

I'm going to keep searching for some clue as to how to do this, but I know you all are a lot more skilled at this than I could ever be.

Thanks,

Post by **xenopeek** » Sun Apr 24, 2022 1:54 am

I often have to do things like that when extract data from text files. Thanks for including an example file. This should work:
grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/'
That puts tvg-name value in first column and group-title value in the second column, separated by a tab character.

For readability on the terminal add | column -s $'\t' -t at the end of the command, which neatly puts it in text columns:

Code: Select all

ABC (KOAT) Albuquerque NM    REGIONAL LOCALS
ABC (KVII) Amarillo TX       REGIONAL LOCALS
ABC (WOI) Ames IA            REGIONAL LOCALS
ABC (KAEF) Arcata CA         REGIONAL LOCALS
ABC (WLOS) Asheville NC      REGIONAL LOCALS
ABC (WSB) Atlanta GA         REGIONAL LOCALS
ABC (KAAL) Austin Minnesota  REGIONAL LOCALS
ABC (KERO) Bakersfield CA    REGIONAL LOCALS
ABC (KBMT) Beaumont TX       REGIONAL LOCALS
ABC (WLOX) Biloxi MS         REGIONAL LOCALS
ABC (WMAR) Blatimore MD      REGIONAL LOCALS

So anyway, what the command I gave does:

grep 'tvg-name=' M3U.txt this gets all the lines from the file that have tvg-name= on it, discarding the other lines
| pipes the output into the next command
sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/' this searches for tvg-name="<something>" on the line and group-title"<something>" some place after it—skipping over everything before, between and after those—and then replaces the line by just the bits we want: the 1st and 2nd <something> that we captured.

sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want.

rene · Post by **rene** » Sun Apr 24, 2022 2:03 am

So as to output directly to CSV, i.e., importable into a spreadsheet, I'd slightly tweak that to be

Code: Select all

grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/"\1","\2"/'

Post by **xenopeek** » Sun Apr 24, 2022 2:15 am

rene wrote: ⤴Sun Apr 24, 2022 2:03 am So as to output directly to CSV, i.e., importable into a spreadsheet, I'd slightly tweak that to be

Spreadsheet programs usually can also handle TSV files. For CSV a slightly nicer tweak is to just move the brackets one character out

That captures the quotes.

grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name=("[^"]*").*group-title=("[^"]*").*/\1,\2/'

Pecu1iar · Post by **Pecu1iar** » Sun Apr 24, 2022 2:30 am

you all are awesome.

I've tested the original script and added a bit to output a new file. It imports into google sheets pretty easily.

"sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want."

I'm always wanting to learn so if you have time to explain all the bits, I'll try to understand.

After looking at the file, I see that the tvg-name is actually found after the group title proceeded by a comma. I'm going to try to figure out how to extract and separate it that way just to see if I can.

Thank you all again.

rene · Post by **rene** » Sun Apr 24, 2022 2:54 am

xenopeek wrote: ⤴Sun Apr 24, 2022 2:15 am Spreadsheet programs usually can also handle TSV files.

Wow, that's actually a thing... https://en.wikipedia.org/wiki/Tab-separated_values

Post by **xenopeek** » Sun Apr 24, 2022 3:05 am

The sed command is in the form s/regular-expression/replacement/. We can split it like this for readability:
s / .*tvg-name="([^"]*)".*group-title="([^"]*)".* / \1\t\2 /

The .*tvg-name="([^"]*)".*group-title="([^"]*)".* part is the regular expression. There are a few special characters in this:

. dot matches any single character. It is a wildcard.
* star says "match zero or more of the preceding character". It is a repeat.
( ) brackets are a capture group. What is matched between them is captured and can be referred back to in the replacement. In the replacement above \1 refers back to the first capture group, \2 to the second.
[ ] square brackets are a character set. It matches a single character from the set of characters between the brackets. If the first character is a ^ it inverts the character set—it matches any single character except those from the character set.

So what it does:

.* this matches everything from the start of the line until tvg-name= is found, effectively skipping over it because we don't care about that part.
tvg-name=" this literally matches that text
([^"]*) this captures the value of tvg-name, using a character set where we want to find any character that is not a "
" this is the closing " of the tvg-name value
.* this skips over the next part until group-title is found
group-title=" matches that text
([^"]*) this captures the value of group-title, using a character set where we want to find any character that is not a "
" this is the closing " of the group-title value
.* this skips over the rest of the line

And in the replacement \1\t\2 the \1 refers back to the tvg-name value we captured, \t stands for the tab character, \2 refers back to the group-title captured.

Pecu1iar · Post by **Pecu1iar** » Sun Apr 24, 2022 4:50 am

Thank you Xenpeek. I'm going to refer back to this and try to do some other things.

I appreciate that you take the time to help all of us who are not as gifted as you when it comes to things like this.

user6c57b8 · Post by **user6c57b8** » Tue May 24, 2022 4:33 pm

This thread inspired me to create a github repository to help you (and me) be more skilled with bash/terminal/linux/mint/debian-based-linux, bash/terminal/cygwin, bash/terminal/wsl2 + perl (which comes with I think all modern linux distributions):

https://github.com/user95f85f/readme-NOW

Linux Mint Forums

[SOLVED] Extract text from a file.

[SOLVED] Extract text from a file.

Re: Extract text from a file.

Re: Extract text from a file.

Re: Extract text from a file.

Re: Extract text from a file.

Re: Extract text from a file.

Re: [SOLVED] Extract text from a file.

Re: [SOLVED] Extract text from a file.

Re: [SOLVED] Extract text from a file.