Parsing a HTML file and pulling in specific information

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Parsing a HTML file and pulling in specific information

Post by Jator »

Hi,

I'm trying to create a "snipet" in my conky file to pull in certain information. As an example, I'd like to pull in a page structured similarly to this one and pull in certain users info (my name plus 3-4 other individuals). I then want to pull in certain info only (such as name, blocks today and blocks yesterday only).

I know how to download a local copy using wget into a local directory, and I'm working through how to have a "list.txt" file that I can extract into conky using {execi cat ~/.conky/list.txt | sed } commands, but creating the truncated information from the html to the text file is a bit beyond me right now. Anyone point me to a good example of how this might work (I'm a tactile learner so examples are more meaningful for me that concepts, once I understand it, it's locked in, but one of those things I didn't identify in terms of my learning until I was nearly 50 years old).

TIA.
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
User avatar
AndyMH
Level 21
Level 21
Posts: 13752
Joined: Fri Mar 04, 2016 5:23 pm
Location: Wiltshire

Re: Parsing a HTML file and pulling in specific information

Post by AndyMH »

I'd probably write a python script for that. You might be able to do it with sed or awk, but I always end up asking for help here when I use either of those utilities.

EDIT - maybe a combination of grep and cut? A bit on cut:
https://unix.stackexchange.com/question ... -text-file
Thinkcentre M720Q - LM21.3 cinnamon, 4 x T430 - LM21.3 cinnamon, Homebrew desktop i5-8400+GTX1080 Cinnamon 19.0
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Re: Parsing a HTML file and pulling in specific information

Post by Jator »

Thanks for the response. I've used php and a little perl, haven't tried python. This is a hobby, so I'm learning the basics all over again. I'll check out the link and see if I can make that work.
Jator
Level 2
Level 2
Posts: 80
Joined: Sat Mar 13, 2021 10:58 am

Re: Parsing a HTML file and pulling in specific information

Post by Jator »

Here's the actual html I would like to extract info from:

Code: Select all

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
 <HEAD>
  <LINK REL="stylesheet" TYPE="text/css" HREF="http://192.168.0.2/stats/ppstats.css">
  <TITLE>Team The Old Republic Bovine RC5-72 Statistics</TITLE>
  <META HTTP-EQUIV="refresh" CONTENT="900">
  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
  <META NAME="description" CONTENT="Distributed.Net Bovine RC5-72 Personal Proxy Statistics for Team The Old Republic">
  <META NAME="keywords" CONTENT="Distributed.Net, RC5, RC5-64, DES, DES-II, DES-III, OGR, Encryption, Computers, Parallel Processing">
  <META NAME="robots" CONTENT="nofollow">
 </HEAD>
 <BODY>
  <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%">
   <TR>
    <TD VALIGN=TOP>
     <TABLE BORDER=0 CELLPADDING=1 CELLSPACING=0 WIDTH=130>
      <TR><TH ALIGN=LEFT>
       &nbsp;&nbsp;Team The Old Republic
      </TH></TR>
      <TR><TD CLASS=linkbar NOWRAP>
       &nbsp;&nbsp;<A HREF="http://www.usatoday.com">Team Home</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/index.html">Project Stats</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/exec.html">Team Summary</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byemail.html">By Email</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byhost.html">By Host</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byos.html">By OS</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/bycpu.html">By CPU</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byver.html">By Client ver</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byfull.html">By Full Detail</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/bydomain.html">By Domain</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byhour.html">By Hour of Day</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/bydate.html">By Date</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byweek.html">By Week</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/bymonth.html">By Month</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/byyear.html">By Year</A><BR>
       &nbsp;&nbsp;<A HREF="http://192.168.0.2/stats/bydaywk.html">By Day of Week</A><BR>
      </TD></TR>
      <TR><TH ALIGN=LEFT>
       &nbsp;&nbsp;Bovine RC5-72
      </TH></TR>
      <TR><TD CLASS=linkbar NOWRAP>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/">Distributed.Net</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/rc5/">Project Home</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/FAQ/">Distributed FAQs</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/clients.html">Client Download</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/proxies.html">Proxy Download</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/cgi/dnet-finger.cgi">Finger Gateway</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/lists/">Mailing Lists</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/tmsummary.php3?team=263">Team Summary</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/tmember.php3?team=263">Top 100 Overall</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/tmember.php3?team=263&source=y">Top 100 Yesterday</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/">Project Stats</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/statistics/">Graphical Stats</A><BR>
       &nbsp;&nbsp;<A HREF="http://www.distributed.net/rc5/proxyinfo.html">Keyservers</A><BR>
      </TD></TR>
      <TR><TH ALIGN=LEFT>
       &nbsp;&nbsp;Teams & Groups
      </TH></TR>
      <TR><TD CLASS=linkbar NOWRAP>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/tlist.php3?low=1&limit=100">Top 100 Overall</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/tlist.php3?low=1&limit=100&source=y">Top 100 Yesterday</A>
       &nbsp;&nbsp;<FORM ACTION="http://stats.distributed.net/rc5-64/tsearch.php3" METHOD=GET>
       &nbsp;&nbsp;Search for team:<BR>
       &nbsp;&nbsp;<INPUT TYPE="text" NAME="st" VALUE="" SIZE=6 MAXLENGTH=60>
                   <INPUT TYPE="submit" VALUE="Go!">
                   </FORM>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/newteam1.php3">Register New Team</A><BR>
      </TD></TR>
      <TR><TH ALIGN=LEFT>
       &nbsp;&nbsp;Individuals
      </TH></TR>
      <TR><TD CLASS=linkbar NOWRAP>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/plist.php3?low=1&limit=100">Top 100 Overall</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/rc5-64/plist.php3?low=1&limit=100&source=y">Top 100 Yesterday</A>
       &nbsp;&nbsp;<FORM ACTION="http://stats.distributed.net/rc5-64/psearch.php3" METHOD=GET>
       &nbsp;&nbsp;Search for email:<BR>
       &nbsp;&nbsp;<INPUT TYPE="text" NAME="st" VALUE="" SIZE=6 MAXLENGTH=60>
                   <INPUT TYPE="submit" VALUE="Go!">
                   </FORM>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/pedit.php3">Participant Details</A><BR>
       &nbsp;&nbsp;<A HREF="http://stats.distributed.net/pjointeam.php3?team=263">Join This Team</A><BR>
      </TD></TR>
      <TR><TD ALIGN=CENTER NOWRAP>
       Last Updated GMT:<BR>
       22 May 2021 21:45<BR>
       Hit Count: <IMG BORDER=0 SRC="/cgi-bin/Count.cgi?ft=0|tr=1|trgb=000000|srgb=00FF00|prgb=993333|md=8|dd=D|comma=T|df=rc5.dat" ALT="Hit Count" VSPACE=0 HSPACE=0><BR>
      </TD></TR>
     </TABLE>
    </TD>
    <TD VALIGN=TOP WIDTH="100%">
     <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%">
      <TR ALIGN=CENTER>
       <TD CLASS=section>Rank</TD>
       <TD CLASS=section NOWRAP>Hostname</TD>
       <TD CLASS=section>Blocks<BR>Wednesday</TD>
       <TD CLASS=section>Blocks<BR>Thursday</TD>
       <TD CLASS=section>Blocks<BR>Yesterday</TD>
       <TD CLASS=section>Blocks<BR>Today</TD>
       <TD CLASS=section>Blocks<BR>Total</TD>
       <TD CLASS=section>Last<BR>Seen</TD>
       <TD CLASS=section>Percent</TD>
      </TR>
      <TR CLASS=odd ALIGN=RIGHT>
       <TD NOWRAP ALIGN=LEFT>&nbsp;1</TD>
       <TD NOWRAP ALIGN=LEFT><IMG SRC="http://192.168.0.2/stats/icons/flags/unknown.gif" ALIGN=BOTTOM BORDER=0 WIDTH=14 HEIGHT=14 ALT="Unknown">&nbsp;<A HREF="http://192.168.0.2/stats/h1/index.html">workstation</A></TD>
       <TD>97345</TD>
       <TD>96976</TD>
       <TD>97570</TD>
       <TD>89547</TD>
       <TD>521953</TD>
       <TD CLASS=normal NOWRAP>33 secs</TD>
       <TD>51.62%</TD>
      </TR>
      <TR CLASS=even ALIGN=RIGHT>
       <TD NOWRAP ALIGN=LEFT>&nbsp;2</TD>
       <TD NOWRAP ALIGN=LEFT><IMG SRC="http://192.168.0.2/stats/icons/flags/unknown.gif" ALIGN=BOTTOM BORDER=0 WIDTH=14 HEIGHT=14 ALT="Unknown">&nbsp;<A HREF="http://192.168.0.2/stats/h2/index.html">htpc</A></TD>
       <TD>62613</TD>
       <TD>61830</TD>
       <TD>61353</TD>
       <TD>56743</TD>
       <TD>338125</TD>
       <TD CLASS=normal NOWRAP>58 secs</TD>
       <TD>33.44%</TD>
      </TR>
      <TR CLASS=odd ALIGN=RIGHT>
       <TD NOWRAP ALIGN=LEFT>&nbsp;3</TD>
       <TD NOWRAP ALIGN=LEFT><IMG SRC="http://192.168.0.2/stats/icons/flags/unknown.gif" ALIGN=BOTTOM BORDER=0 WIDTH=14 HEIGHT=14 ALT="Unknown">&nbsp;<A HREF="http://192.168.0.2/stats/h3/index.html">home</A></TD>
       <TD>21688</TD>
       <TD>21741</TD>
       <TD>21920</TD>
       <TD>19764</TD>
       <TD>118640</TD>
       <TD CLASS=normal NOWRAP>10 mins</TD>
       <TD>11.73%</TD>
      </TR>
      <TR CLASS=even ALIGN=RIGHT>
       <TD NOWRAP ALIGN=LEFT>&nbsp;4</TD>
       <TD NOWRAP ALIGN=LEFT><IMG SRC="http://192.168.0.2/stats/icons/flags/unknown.gif" ALIGN=BOTTOM BORDER=0 WIDTH=14 HEIGHT=14 ALT="Unknown">&nbsp;<A HREF="http://192.168.0.2/stats/h4/index.html">server</A></TD>
       <TD>3094</TD>
       <TD>3237</TD>
       <TD>3190</TD>
       <TD>2911</TD>
       <TD>20582</TD>
       <TD CLASS=normal NOWRAP>42 mins</TD>
       <TD>2.04%</TD>
      </TR>
      <TR CLASS=odd ALIGN=RIGHT>
       <TD NOWRAP ALIGN=LEFT>&nbsp;5</TD>
       <TD NOWRAP ALIGN=LEFT><IMG SRC="http://192.168.0.2/stats/icons/flags/unknown.gif" ALIGN=BOTTOM BORDER=0 WIDTH=14 HEIGHT=14 ALT="Unknown">&nbsp;<A HREF="http://192.168.0.2/stats/h5/index.html">connorpc</A></TD>
       <TD>1963</TD>
       <TD>2772</TD>
       <TD>2776</TD>
       <TD>2424</TD>
       <TD>11903</TD>
       <TD CLASS=normal NOWRAP>3 mins</TD>
       <TD>1.18%</TD>
      </TR>
      <TR ALIGN=RIGHT>
       <TD CLASS=section>&nbsp;</TD>
       <TD CLASS=section>Blocks Total</TD>
       <TD CLASS=section>186703</TD>
       <TD CLASS=section>186556</TD>
       <TD CLASS=section>186809</TD>
       <TD CLASS=section>171389</TD>
       <TD CLASS=section>1011203</TD>
       <TD CLASS=section>&nbsp;</TD>
       <TD CLASS=section>&nbsp;</TD>
      </TR>
      <TR>
       <TD COLSPAN=9>
        &nbsp;
       </TD>
      </TR>
      <TR ALIGN=CENTER>
       <TD CLASS=note COLSPAN=9>
        Please click on the highest level domain name for detailed host information.
       </TD>
      </TR>
     </TABLE>
    </TD>
   </TR>
  </TABLE>
  <HR SIZE=1>
  <P ALIGN=CENTER>
   <A HREF="http://www.distributed.net/"><IMG BORDER=0 ALT="Distributed Computing" WIDTH=400 HEIGHT=40 SRC="http://www.distributed.net/banners/image.cgi"></A>
  </P>
  <P ALIGN=CENTER>
   Personal Proxy Statistics
   <A HREF="http://usmcug.usm.maine.edu/rc5/files/ppstats-rc5.zip">ppstats-rc5.zip</A> v7.1<BR>
   Copyright (C) <A HREF="mailto:kpesce@netscape.net">Kevin Pesce</A> 1998, 1999<BR>
   University of Southern Maine Computer Users Group
  </P>
 </BODY>
</HTML>
What I am wanting to pull out and dump into the txt file is:

Hostname, BlocksToday, BlocksYesterday
workstation, 89547,97570 <---actual values from the html

There are other hosts I want to collect the same data on (server, home, htpc, connorpc), but I think this gives you the general idea. Ideally I'd like to create a comma seperated table in the txt file to easily pull into the conky script. The positive is there aren't variations in host, so it will be a static list of machines and the variables are the "blocks" to be countered periodicially.

TIA.
User avatar
AndyMH
Level 21
Level 21
Posts: 13752
Joined: Fri Mar 04, 2016 5:23 pm
Location: Wiltshire

Re: Parsing a HTML file and pulling in specific information

Post by AndyMH »

If you google "python parse html" you will get lots of hits. I've done very little python, not my favourite language, but have used beautiful soup in the distant past. So your next hobby - learn python.

Suggest you install an IDE, I used idle in the past (install from software manager), fairly basic but met my needs, there will be more experienced python users here who can offer better alternatives.

I also bought a couple of books:
https://www.amazon.co.uk/gp/product/159 ... UTF8&psc=1
this was a good starter

and then more targeted towards what you want to do:
https://www.amazon.co.uk/gp/product/149 ... UTF8&psc=1

This is not to suggest that python is the best tool for the job, I've looked at perl in the past and didn't like it and haven't done any php.
Thinkcentre M720Q - LM21.3 cinnamon, 4 x T430 - LM21.3 cinnamon, Homebrew desktop i5-8400+GTX1080 Cinnamon 19.0
1000
Level 6
Level 6
Posts: 1040
Joined: Wed Jul 29, 2020 2:14 am

Re: Parsing a HTML file and pulling in specific information

Post by 1000 »

Code: Select all

TODAY=$(cat html.table | grep -A4 workstation | tail -n1) ; echo "$TODAY"
       <TD>89547</TD>
It is in the table.
TODAY = Line / row with "workstation" and "4" column. = 89547
User avatar
Termy
Level 12
Level 12
Posts: 4248
Joined: Mon Sep 04, 2017 8:49 pm
Location: UK
Contact:

Re: Parsing a HTML file and pulling in specific information

Post by Termy »

I have many shell projects over here which might help give you some examples of data parsing. If you're familiar at all with PERL, there's also this, but this does sound more like a job for just BASH and wget(1)/curl(1).

I talk a lot about shell parsing in a way that is clean and efficient, on my YT channel, a link to which you'll find in my signature; that should help quite a bit, but if you're brand new to shell programming, you might find some of the topics a little much.

The likes of cat(1), uniq(1), wc(1), grep(1), cut(1), sed(1), awk(1), colrm(1), and most of those other parsing tools 'they' want you to think you need to repeat a million times between a million more pipes, are typically* redundant, because BASH itself alone is more than capable of these tasks. IE:

Code: Select all

# Viewing a file, like how people tend to use cat(1):
readarray Lines < ~/.bashrc
printf '%s\n' "${Lines[@]}"

# Viewing a given line(1) in a file, using glob pattern
# matching (MATCH), similar to how people tend to use grep(1):
while read Line; do
	case $Line in
		MATCH)
			printf '%s\n' "$REPLY"
			break ;;
	esac
done < ~/.bashrc
# BASH supports REGEX too, if that's needed, but it's slower.

# Counting lines, like how people typically use wc(1):
readarray Lines < ~/.bashrc
printf '%d\n' ${#Lines[@]}

# Showing a given column in a given line, similar to to
# grep(1) and cut(1) procedures:
while read -a Line; do
	if [ "${Line[0]}" == 'Mem:' ]; then
		printf '%s\n' ${Line[2]}
		break
	fi
done <<< "$(free -h)"

# Using BASH's associative arrays to show an unordered list of unique
# lines, similar to but far better than how people tend to use uniq(1):
declare -A Results
while IFS= read Line; do
	Results[$Line]=1
done < ~/file.txt
printf '%s\n' "${!Results[@]}"
# The above can be reordered by associating a number to each line.
# I use this method in CSI3:
# https://github.com/terminalforlife/Extra/blob/master/source/csi3/csi3
# uniq(1) only uniques if the duplicates lines are adjacent to each other.
# You can overcome this with AWK, but it's usually a bit overkill.
And so on it goes. Hope that helps.

* Note that I did say "typically", meaning that I'm well aware a pure-BSH and pure-BASH solution isn't always appropriate or efficient; it depends on things like the amount of data you're parsing, the project, and your capabilities.
I'm also Terminalforlife on GitHub.
Locked

Return to “Scripts & Bash”