Bulk rewriting relative links in HTML files?

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
somelurker
Level 4
Level 4
Posts: 206
Joined: Fri Jul 01, 2016 6:10 pm

Bulk rewriting relative links in HTML files?

Post by somelurker »

I have a friend who saves webpages as HTML rather than bookmarking them. I'm trying to help her back up the contents of her drive, and much of it consists of these saved webpages. Unfortunately, there are pipe symbols (|) in the titles of many of those webpages. Mint accepts such symbols in ext4 partitions, but when I try to move the files into an NTFS partition of an external drive (for compatibility with Windows computers), the file system balks. I tried renaming all the files and folders with pipe symbols to something else, but doing so breaks the relative links in the HTML file that are supposed to point to local resources. Suppose a webpage is titled "3 best ways to raise a dog | cat.html". The folder would then presumably be named "3 best ways to raise a dog | cat". Replacing all the | characters with, say, "PIPE" breaks links to any images stored within the folder.

Is there an easy solution to this problem that uses a file system Windows can read? I couldn't find any scripts that can rewrite all relative links to the image folder in the same directory as an html file. Right now, the best I can do is to either store all the HTML files with broken image folders or skip the ones with pipe symbols in their name. Neither option is ideal. I can't do the renaming by hand because there are hundreds of such files with pipe symbols, if not thousands.
Last edited by LockBot on Fri Feb 17, 2023 11:00 pm, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
t42
Level 11
Level 11
Posts: 3744
Joined: Mon Jan 20, 2014 6:48 pm

Re: Bulk rewriting relative links in HTML files?

Post by t42 »

To change all files in the current directory from | to __ (two underscores)

Code: Select all

sed -i 's/|/__/g' *
why to __? something to keep track of things
-=t42=-
somelurker
Level 4
Level 4
Posts: 206
Joined: Fri Jul 01, 2016 6:10 pm

Re: Bulk rewriting relative links in HTML files?

Post by somelurker »

Your command would change all pipe symbols in the html files to __, right?

It looks like a great workaround, but should I be concerned about pipe characters in the files that are not part of a directory name? I can immediately tell if a pipe is part of a path if I eyeball it, but wouldn't I need some kind of HTML parser to identify all the links? The logic would go something like this:

1. Find all links in the HTML document
2. If the link is relative, change all the pipe symbols in the link to __
somelurker
Level 4
Level 4
Posts: 206
Joined: Fri Jul 01, 2016 6:10 pm

Re: Bulk rewriting relative links in HTML files?

Post by somelurker »

It looks like nobody has any easy solutions to this one yet. In case someone comes across this post in the future, I couldn't solve the harder problem of rewriting the links in HTML files to point to a different directory, and I couldn't find scripts that do what I want. I did manage to work around the problem; I zipped up all the files without any problems. I think the real problem will occur when trying to extract those zipped files into a drive with a format that doesn't support pipe symbols. But that's not a huge problem because the zip file only needs to be extracted into a file system that supports such file names. It's not a perfect solution, but it's the best I was able to come up with. Zipping at least spares me from having to skip any files.
weedeater64
Level 1
Level 1
Posts: 44
Joined: Mon Jun 08, 2015 6:23 pm

Re: Bulk rewriting relative links in HTML files?

Post by weedeater64 »

I have no simple/easy way to do this, but some suggestions.

Look at the html-xml-utils package. It is a suite of utilities for working with, html and xml.

Also look at lynx browser and it's options for parsing links.

One thing that might be easy_ish, use pandoc to convert all the html files into markdown. Then run markdown on them and see if it gets rid of the pipe symbols. This might be easy or a huge mess, IDK. Worth a try though as it would be easy to try.

If you export firefox's bookmarks, the file it gives you is the worst pile of crap html you've ever seen. I used the above method to clean it up into something way more readable and manageable.

If it works just note the new generated links and change the directories and file names accordingly.

Sounds like a nightmare.
Locked

Return to “Scripts & Bash”