OK, so I've been looking for a few days, but I'm not skilled enough to find what I need or I don't understand what I'm reading. For someone who knows, I imagine it wouldn't be hard at all.
What I have is an M3U text file that contains a few thousand lines. I want to extract two pieces of information from all relevant lines.
I have attached a bit of the text file.
I need to extract tvg-name="whatever the name is" and group-title="whatever the title is" effectively removing everything else. When finished, I would like to be left with just the tvg-name in column A and the corresponding group title in column B. I would like this all to be saved in a format that can be manipulated within a spreadsheet like google sheets or excel.
I'm going to keep searching for some clue as to how to do this, but I know you all are a lot more skilled at this than I could ever be.
Thanks,
[SOLVED] Extract text from a file.
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Topics in this forum are automatically closed 6 months after creation.
[SOLVED] Extract text from a file.
- Attachments
-
- M3U.txt
- (2.42 KiB) Downloaded 42 times
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 2 times in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Re: Extract text from a file.
I often have to do things like that when extract data from text files. Thanks for including an example file. This should work:
That puts tvg-name value in first column and group-title value in the second column, separated by a tab character.
For readability on the terminal add
So anyway, what the command I gave does:
grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/'
That puts tvg-name value in first column and group-title value in the second column, separated by a tab character.
For readability on the terminal add
| column -s $'\t' -t
at the end of the command, which neatly puts it in text columns:
Code: Select all
ABC (KOAT) Albuquerque NM REGIONAL LOCALS
ABC (KVII) Amarillo TX REGIONAL LOCALS
ABC (WOI) Ames IA REGIONAL LOCALS
ABC (KAEF) Arcata CA REGIONAL LOCALS
ABC (WLOS) Asheville NC REGIONAL LOCALS
ABC (WSB) Atlanta GA REGIONAL LOCALS
ABC (KAAL) Austin Minnesota REGIONAL LOCALS
ABC (KERO) Bakersfield CA REGIONAL LOCALS
ABC (KBMT) Beaumont TX REGIONAL LOCALS
ABC (WLOX) Biloxi MS REGIONAL LOCALS
ABC (WMAR) Blatimore MD REGIONAL LOCALS
grep 'tvg-name=' M3U.txt
this gets all the lines from the file that have tvg-name= on it, discarding the other lines|
pipes the output into the next commandsed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/\1\t\2/'
this searches for tvg-name="<something>" on the line and group-title"<something>" some place after it—skipping over everything before, between and after those—and then replaces the line by just the bits we want: the 1st and 2nd <something> that we captured.
Re: Extract text from a file.
So as to output directly to CSV, i.e., importable into a spreadsheet, I'd slightly tweak that to be
Code: Select all
grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name="([^"]*)".*group-title="([^"]*)".*/"\1","\2"/'
Re: Extract text from a file.
Spreadsheet programs usually can also handle TSV files. For CSV a slightly nicer tweak is to just move the brackets one character out That captures the quotes.
grep 'tvg-name=' M3U.txt | sed -r 's/.*tvg-name=("[^"]*").*group-title=("[^"]*").*/\1,\2/'
Re: Extract text from a file.
you all are awesome.
I've tested the original script and added a bit to output a new file. It imports into google sheets pretty easily.
"sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want."
I'm always wanting to learn so if you have time to explain all the bits, I'll try to understand.
After looking at the file, I see that the tvg-name is actually found after the group title proceeded by a comma. I'm going to try to figure out how to extract and separate it that way just to see if I can.
Thank you all again.
I've tested the original script and added a bit to output a new file. It imports into google sheets pretty easily.
"sed is using regular expressions. If you've not used regular expressions before it can look like mumbo jumbo but it's quite handy for text processing to know a bit. I can explain the regular expression in the sed command further if you want."
I'm always wanting to learn so if you have time to explain all the bits, I'll try to understand.
After looking at the file, I see that the tvg-name is actually found after the group title proceeded by a comma. I'm going to try to figure out how to extract and separate it that way just to see if I can.
Thank you all again.
Re: Extract text from a file.
Wow, that's actually a thing... https://en.wikipedia.org/wiki/Tab-separated_values
Re: [SOLVED] Extract text from a file.
The sed command is in the form s/regular-expression/replacement/. We can split it like this for readability:
The
s
/
.*tvg-name="([^"]*)".*group-title="([^"]*)".*
/
\1\t\2
/
The
.*tvg-name="([^"]*)".*group-title="([^"]*)".*
part is the regular expression. There are a few special characters in this:
.
dot matches any single character. It is a wildcard.*
star says "match zero or more of the preceding character". It is a repeat.( )
brackets are a capture group. What is matched between them is captured and can be referred back to in the replacement. In the replacement above \1 refers back to the first capture group, \2 to the second.[ ]
square brackets are a character set. It matches a single character from the set of characters between the brackets. If the first character is a^
it inverts the character set—it matches any single character except those from the character set.
.*
this matches everything from the start of the line until tvg-name= is found, effectively skipping over it because we don't care about that part.tvg-name="
this literally matches that text([^"]*)
this captures the value of tvg-name, using a character set where we want to find any character that is not a ""
this is the closing " of the tvg-name value.*
this skips over the next part until group-title is foundgroup-title="
matches that text([^"]*)
this captures the value of group-title, using a character set where we want to find any character that is not a ""
this is the closing " of the group-title value.*
this skips over the rest of the line
\1\t\2
the \1 refers back to the tvg-name value we captured, \t stands for the tab character, \2 refers back to the group-title captured.Re: [SOLVED] Extract text from a file.
Thank you Xenpeek. I'm going to refer back to this and try to do some other things.
I appreciate that you take the time to help all of us who are not as gifted as you when it comes to things like this.
I appreciate that you take the time to help all of us who are not as gifted as you when it comes to things like this.
-
- Level 2
- Posts: 52
- Joined: Mon Aug 05, 2019 1:07 pm
Re: [SOLVED] Extract text from a file.
This thread inspired me to create a github repository to help you (and me) be more skilled with bash/terminal/linux/mint/debian-based-linux, bash/terminal/cygwin, bash/terminal/wsl2 + perl (which comes with I think all modern linux distributions):
https://github.com/user95f85f/readme-NOW
https://github.com/user95f85f/readme-NOW