[SOLVED] Extract text from a file.

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
Pecu1iar
Level 1
Level 1
Posts: 39
Joined: Sun Nov 01, 2020 6:14 pm

[SOLVED] Extract text from a file.

Post by Pecu1iar »

OK, so I've been looking for a few days, but I'm not skilled enough to find what I need or I don't understand what I'm reading. For someone who knows, I imagine it wouldn't be hard at all.

What I have is an M3U text file that contains a few thousand lines. I want to extract two pieces of information from all relevant lines.

I have attached a bit of the text file.

I need to extract tvg-name="whatever the name is" and group-title="whatever the title is" effectively removing everything else. When finished, I would like to be left with just the tvg-name in column A and the corresponding group title in column B. I would like this all to be saved in a format that can be manipulated within a spreadsheet like google sheets or excel.

I'm going to keep searching for some clue as to how to do this, but I know you all are a lot more skilled at this than I could ever be.

Thanks,
Attachments
M3U.txt
(2.42 KiB) Downloaded 42 times
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 2 times in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
User avatar
xenopeek
Level 25
Level 25
Posts: 29615
Joined: Wed Jul 06, 2011 3:58 am

Re: Extract text from a file.

Post by xenopeek »

I often have to do things like that when extract data from text files. Thanks for including an example file. This should work:
grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/'
That puts tvg-name value in first column and group-title value in the second column, separated by a tab character.

For readability on the terminal add | column -s $'\t' -t at the end of the command, which neatly puts it in text columns:

Code: Select all

ABC (KOAT) Albuquerque NM    REGIONAL LOCALS
ABC (KVII) Amarillo TX       REGIONAL LOCALS
ABC (WOI) Ames IA            REGIONAL LOCALS
ABC (KAEF) Arcata CA         REGIONAL LOCALS
ABC (WLOS) Asheville NC      REGIONAL LOCALS
ABC (WSB) Atlanta GA         REGIONAL LOCALS
ABC (KAAL) Austin Minnesota  REGIONAL LOCALS
ABC (KERO) Bakersfield CA    REGIONAL LOCALS
ABC (KBMT) Beaumont TX       REGIONAL LOCALS
ABC (WLOX) Biloxi MS         REGIONAL LOCALS
ABC (WMAR) Blatimore MD      REGIONAL LOCALS
So anyway, what the command I gave does:
  • grep 'tvg-name=' M3U.txt ­this gets all the lines from the file that have tvg-name= on it, discarding the other lines
  • | pipes the output into the next command
  • sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/' this searches for tvg-name="<something>" on the line and group-title"<something>" some place after it—skipping over everything before, between and after those—and then replaces the line by just the bits we want: the 1st and 2nd <something> that we captured.
sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want.
Image
rene
Level 20
Level 20
Posts: 12212
Joined: Sun Mar 27, 2016 6:58 pm

Re: Extract text from a file.

Post by rene »

So as to output directly to CSV, i.e., importable into a spreadsheet, I'd slightly tweak that to be

Code: Select all

grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/"\1","\2"/'
User avatar
xenopeek
Level 25
Level 25
Posts: 29615
Joined: Wed Jul 06, 2011 3:58 am

Re: Extract text from a file.

Post by xenopeek »

rene wrote: Sun Apr 24, 2022 2:03 am So as to output directly to CSV, i.e., importable into a spreadsheet, I'd slightly tweak that to be
Spreadsheet programs usually can also handle TSV files. For CSV a slightly nicer tweak is to just move the brackets one character out :) That captures the quotes.

grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name=("[^"]*").*group-title=("[^"]*").*/\1,\2/'
Image
Pecu1iar
Level 1
Level 1
Posts: 39
Joined: Sun Nov 01, 2020 6:14 pm

Re: Extract text from a file.

Post by Pecu1iar »

you all are awesome.

I've tested the original script and added a bit to output a new file. It imports into google sheets pretty easily.

"sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want."

I'm always wanting to learn so if you have time to explain all the bits, I'll try to understand.

After looking at the file, I see that the tvg-name is actually found after the group title proceeded by a comma. I'm going to try to figure out how to extract and separate it that way just to see if I can.

Thank you all again.
rene
Level 20
Level 20
Posts: 12212
Joined: Sun Mar 27, 2016 6:58 pm

Re: Extract text from a file.

Post by rene »

xenopeek wrote: Sun Apr 24, 2022 2:15 am Spreadsheet programs usually can also handle TSV files.
Wow, that's actually a thing... https://en.wikipedia.org/wiki/Tab-separated_values
User avatar
xenopeek
Level 25
Level 25
Posts: 29615
Joined: Wed Jul 06, 2011 3:58 am

Re: [SOLVED] Extract text from a file.

Post by xenopeek »

The sed command is in the form s/regular-expression/replacement/. We can split it like this for readability:
s / .*tvg-name="([^"]*)".*group-title="([^"]*)".* / \1\t\2 /

The .*tvg-name="([^"]*)".*group-title="([^"]*)".* part is the regular expression. There are a few special characters in this:
  • . dot matches any single character. It is a wildcard.
  • * star says "match zero or more of the preceding character". It is a repeat.
  • ( ) brackets are a capture group. What is matched between them is captured and can be referred back to in the replacement. In the replacement above \1 refers back to the first capture group, \2 to the second.
  • [ ] square brackets are a character set. It matches a single character from the set of characters between the brackets. If the first character is a ^ it inverts the character set—it matches any single character except those from the character set.
So what it does:
  • .* this matches everything from the start of the line until tvg-name= is found, effectively skipping over it because we don't care about that part.
  • tvg-name=" this literally matches that text
  • ([^"]*) this captures the value of tvg-name, using a character set where we want to find any character that is not a "
  • " this is the closing " of the tvg-name value
  • .* this skips over the next part until group-title is found
  • group-title=" matches that text
  • ([^"]*) this captures the value of group-title, using a character set where we want to find any character that is not a "
  • " this is the closing " of the group-title value
  • .* this skips over the rest of the line
And in the replacement \1\t\2 the \1 refers back to the tvg-name value we captured, \t stands for the tab character, \2 refers back to the group-title captured.
Image
Pecu1iar
Level 1
Level 1
Posts: 39
Joined: Sun Nov 01, 2020 6:14 pm

Re: [SOLVED] Extract text from a file.

Post by Pecu1iar »

Thank you Xenpeek. I'm going to refer back to this and try to do some other things.

I appreciate that you take the time to help all of us who are not as gifted as you when it comes to things like this.
user6c57b8
Level 2
Level 2
Posts: 52
Joined: Mon Aug 05, 2019 1:07 pm

Re: [SOLVED] Extract text from a file.

Post by user6c57b8 »

This thread inspired me to create a github repository to help you (and me) be more skilled with bash/terminal/linux/mint/debian-based-linux, bash/terminal/cygwin, bash/terminal/wsl2 + perl (which comes with I think all modern linux distributions):

https://github.com/user95f85f/readme-NOW
Locked

Return to “Scripts & Bash”