How to find orphan files on your web site

sadhu · Post by **sadhu** » Fri May 12, 2017 9:57 am

A web site orphan is a file that is never referenced by any other web page. If none of the active files on your web site call up a particular file (e.g., an image, because it was renamed), then that is an orphan. In html such files should have been included in href=, src=, url= statements; php might use the include() statement.

The free application xenu can find bad links, and its GUI implies it can find orphan files, but I've never got that function to work. So I searched and searched the Internet but could not find a way to get a list of orphan files.

Figuring that the orphans I am looking for would not be found in html or php text files, it occurred to me that grep would work for a site search. So I went and wrote my own little bash script that seems to do the job. If anyone can improve it, I would be grateful.

This 'findorphans' script works only on your local copy of the site. AND file/directory names cannot contain spaces.

Code: Select all

#!/bin/bash
# checks the current directory and everything under it for filenames that
# are NOT explicitly named in all of the other text files in the tree
#Usage: findorphans [ directory containing files to be checked ] NULL is root directory 
DIR=$1
echo -e "These files in ./$DIR could be orphans:"
# mkdir orphans/$DIR
for i in $( ls -1 $DIR ); do 
    rgrep --exclude-dir={olib,zdb,dev,txp,working} -q $i  * 
    if [ $? -ne 0 ] 
    then
       echo "$i"
#       mv $DIR/$i orphans/$DIR/$i
    fi
done
echo -e "Finished checking $DIR...\n"

Usage:

If there are any directories that are not part of your site proper, or are accessed only via php (which is the 'olib' directory in the code above) put them in the --exclude-dir{...} list. Or remove this parameter entirely.
In a terminal window, change directory to the root of your web site. Then execute <path to>findorphans [directory], e.g., "findorphans css", "findorphans images", and so forth. Running it without parameters causes an orphan search of the current directory.
To see where the files in your web site tree are used, comment out the if ... fi block and remove the -q from the grep command. This can be a rather lengthy output. You could pipe it to another file for inspection at leisure. Or if you're only interested in the names of the files that access them, add -l to the rgrep line. (rgrep produces the same result as grep -r)
To automatically remove the orphans from the directory of interest, first create a directory called 'orphans' in the root directory of your site, then uncomment the mkdir and mv lines. Then the orphan files are moved out of your main directory tree to a new folder tree called 'orphans', which can then be moved entirely out of the web site.

I'm posting this so that it might help some others with similar problems with orphans. Sooner or later it will be found and indexed by duck duck go, et al.

-Sadhu!