rsync with find+awk regex filter instead of --include/--exclude

Forum rules
Before you post please read how to get help
Post Reply
TI58C
Level 4
Level 4
Posts: 354
Joined: Tue Jul 18, 2017 5:57 am

rsync with find+awk regex filter instead of --include/--exclude

Post by TI58C » Sat Aug 03, 2019 9:50 am

Hi all,

Timeshift is based on rsync. That made me curious. Played around with grsync for a while, then tried my hand at rsync itself. Great !!

Only one thing I do not like: if you want to do any filtering that is a litte more complicated than just in- or excluding a few files or directories, the --include and --exclude statements are a bit confusing. First you have to include a lot of subdirs and then you have to exclude because otherwise the directory you want never gets visited ? Oh dear ...

So, tried find and awk to produce a list and filter it with regex-patterns. To me at least, that is a much more straight-forward way of doing things.
Used --files-from option of rsync. Rsync did not entirely play ball: the --delete option (to delete files from destination that are no longer in source) does not work with the --files-from option.

Solved that, wrote a small script (strictly for use in terminal). Nothing complicated, but it does the trick. Hope it will be useful to other forum-users as well. You will have to change source- and destination-directories and parameters in script to your own needs/wishes, but if you can handle regexes and/or awk, it should be simple.

Script shows results on screen and writes results to ./backup_script.log (found in dir where script was started.


please let me know what you think of it.


EDIT!!!!! pasted a previous version . code below corrected 2019-08-03 17.20 CEST

Robert / TI58C

Code: Select all

#!/bin/bash

# SCRIPT IS INTENDED TO BE RUN IN TERMINAL (CLI). For clicking in filemanager you'll have to adapt it.

# SOURCE AND DESTINATION DIR MUST NOT CONTAIN SPACES OR JUNK LIKE \N IN THEIR NAMES !!!!!

# Filenames may contain spaces, no problem. What happens with filenames containing \n (newline) ? This seems to be legitimate 
# Awk will split such names, and rsync --ignore-missing-args "should" just skip them.... (not tested)
# find -type f,l will find files, hardlinks and soft-links. See rsync options -x, -l, -H 

# ADAPT THESE DIRECTORIES TO YOUR OWN NEEDS. Mind the "/" at the end!
src=/home/rob/
dest=/media/rob/595eb89f-b110-4a98-9a26-daa0a050992c/home_rolling_backup/

# Any value of actual_run other than 1 will produce a "dry-run", simulation only.
actual_run=1

# Any value of delete_on_dest other than 1 will NOT DELETE files in destination directory
delete_on_dest=1

echo "Script will rsync source directory : ""$src"
echo "with destination directory         : ""$dest"
echo
if [[ "$actual_run" == "1" ]]
then
   echo "Script will do an ACTUAL RUN....Changes will be made to destination directory !!!"
   if [[ "$delete_on_dest" == "1" ]]
   then
      echo "Script will DELETE  files in destination directory that are not in the rsync_list !!!"
   else
      echo "Script will not delete files in destination directory."
   fi
else
   echo "Script will do a dry-run, simulation only"
   if [[ "$delete_on_dest" == "1" ]]
   then
      echo "Script will show files in destination directory that are not in the rsync_list."
   else
      echo "Script will not show files in destination directory that are not in the rsync_list."
   fi
fi
echo
echo "Script will use awk-regexes to filter contents of source directory. Are you sure awk filters are correct ? "
echo
read -p "To continue, type Y or y else type any other key to exit" -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]
then
   exit 1
fi
echo

if [[ "$actual_run" == "1" ]]
then
   echo $(date) > "$src"000_timestamp_last_backup_update
fi

# ADAPT THE AWK REGEX FILTERS BELOW TO YOUR OWN NEEDS
find $src -type f,l -printf %P\\n \
| awk ' !/\
^[.]cache.*\
|^[.]googleearth.*unified_cache.*\
|^[.]local.*Trash\
|^[.]local.*gvfs\
|^[.]PlayOnLinux\
|^rsync_list.txt\
/ ' | sort > rsync_list.txt

if [[ "$actual_run" == "1" ]]
then
   # ACTUAL RUN
   rsync --itemize-changes --ignore-missing-args -R -t -p -o -g -x -v -l -H -s --files-from=rsync_list.txt $src $dest | tee ./backup_script.log
else
   # DRY RUN
   rsync --dry-run --itemize-changes --ignore-missing-args -R -t -p -o -g -x -v -l -H -s --files-from=rsync_list.txt $src $dest | tee ./backup_script.log
fi

# delete_on_dest
if [[ ! "$delete_on_dest" == "1" ]]
then
   rm -f rsync_list.txt
   exit 0
fi

echo | tee -a ./backup_script.log
echo | tee -a ./backup_script.log
echo | tee -a ./backup_script.log
echo "******* LIST OF FILES IN SOURCE DIRECTORY : ""$DEST" " THAT ARE NOT IN RSYNC_LIST AND WILL BE REMOVED IN ACTUAL RUN *******" | tee -a ./backup_script.log
echo | tee -a ./backup_script.log

# When using the rsync --files-from option, the rsync --delete option will not work. So $dest will be incremental, over time it will accrue files
# that were deleted from source. Solve this by using find and comm to compare, awk to add destination path and quotes for filenames containing spaces
# and finally xargs to remove all files from $dest that are not in rsync_list.txt

find $dest -type f,l -printf %P\\n | sort > dest_list.txt

if [[ "$actual_run" == "1" ]]
then

   comm -13 rsync_list.txt dest_list.txt | awk -v dst="$dest" '{ print dst $0 }' | xargs -t -d '\n' rm | tee -a ./backup_script.log
else

   comm -13 rsync_list.txt dest_list.txt | awk -v dst="$dest" '{ print "\"" dst $0 "\""}' | tee -a ./backup_script.log
fi

rm -f rsync_list.txt dest_list.txt

exit
#-------------------------------------------------------------------------------------------------------------------------------------------------------


#                                   SOME PROBABLY UNNECESSARY REMARKS ABOUT AWK PATTERNS AND RSYNC OUTPUT

# NEXT AWK STATEMENT WILL EXCLUDE FILES MATCHING REGEX(ES) FROM LIST OF FILES FOR RSYNC
#| awk ' !/^[.]cache|^[.]googleearth.*[C|c]ache/ ' 

# NEXT AWK STATEMENT WILL >>> ONLY <<< INCLUDE FILES MATCHING REGEX(ES) IN LIST OF FILES FOR RSYNC, ALL OTHER FILES WILL NOT BE IN BACKUP
#| awk ' /.*[K|k]noppix|.*officejet.*85/ } '

# combine patterns like this :
# awk '/foo/ && /bar/'    prints lines that match /foo/ and /bar/, in any order
# awk '/foo/ && !/bar/'   prints lines that match /foo/ but not /bar/
# awk '/foo/ || /bar/'    prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar'

# or like this  (within parentheses explicit $0~ comparison is necessary):
# awk '($0~/pattern/ && $0~/pattern1/) || ($0!~/pattern2/ && $0~/pattern3/)'

# Credits:  http://www.grymoire.com/Unix/Awk.html, especially http://www.grymoire.com/Unix/Awk.html#uh-53 
# and https://catonmat.net/ten-awk-tips-tricks-and-pitfalls along with several others over the years.

# Nice explanation of the rsync itemized changes list here : https://stackoverflow.com/questions/4493525/rsync-what-means-the-f-on-rsync-logs

# 20190803 Robert / TI58C
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.

TI58C
Level 4
Level 4
Posts: 354
Joined: Tue Jul 18, 2017 5:57 am

Re: rsync with find+awk regex filter instead of --include/--exclude

Post by TI58C » Sun Aug 04, 2019 4:03 am

Found a MUCH ! better way of achieving same result here: https://stackoverflow.com/a/15383897

Results in this:

Code: Select all

#!/bin/bash

# SCRIPT IS INTENDED TO BE RUN IN TERMINAL (CLI). For clicking in filemanager you'll have to adapt it.
# SOURCE AND DESTINATION DIR MUST NOT CONTAIN SPACES OR JUNK LIKE \N IN THEIR NAMES !!!!!
# Filenames may contain spaces, no problem. What happens with filenames containing \n (newline) ? This seems to be legitimate 
# Awk will split such names, and rsync --ignore-missing-args "should" just skip them.... (not tested)
# find -type f,l will find files, hardlinks and soft-links. See rsync options -x, -l, -H 

# ADAPT THESE DIRECTORIES TO YOUR OWN NEEDS. Mind the "/" at the end!
src=/home/rob/
dest=/media/rob/595eb89f-b110-4a98-9a26-daa0a050992c/home_rolling_backup/
temp=/media/rob/595eb89f-b110-4a98-9a26-daa0a050992c/home_temp/

# ADAPT THE AWK REGEX FILTERS BELOW TO YOUR OWN NEEDS
find $src -type f,l -printf %P\\n \
| awk ' !/\
^[.]cache\
|^[.]googleearth.*unified_cache.*\
|^[.]local.*Trash\
|^[.]local.*gvfs\
|^[.]PlayOnLinux\
|^vaults\
|^rsync_list.txt\
/ ' | sort > rsync_list.txt

mkdir -p $dest
rm -rf $temp

echo "This may take a while..."

rsync -RtpogxvlHs --itemize-changes --ignore-missing-args --log-file=./rsync_log1 --files-from=rsync_list.txt --link-dest=$dest $src $temp
rsync -ra --delete --itemize-changes --ignore-missing-args --log-file=./rsync_log2 --link-dest=$temp $temp/ $dest

printf "\n\n\n-------------------------------------------------------------------------------\n\n\n" >> ./rsync_log1
cat ./rsync_log1 ./rsync_log2 > rsync_log
rm -f ./rsync_log1 ./rsync_log2
xdg-open rsync_log
exit


#  Credits: https://stackoverflow.com/a/15383897
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.

Post Reply

Return to “Scripts & Bash”