Need to combine 3 files, removing duplicate lines
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Topics in this forum are automatically closed 6 months after creation.
Need to combine 3 files, removing duplicate lines
I have been collecting quotes for quite some time, and I now have a file of them that is about 450 pages long. My main desktop computer developed troubles recently, so I was using my laptop for a while, and added a few quotes to the file on the laptop. Then I had to replace the hard drive in the laptop, and then I bought a new desktop computer. Long story short, the end result of all of this is I now have 3 versions of my quotes file. What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
Re: Need to combine 3 files, removing duplicate lines
I found this:
It removes the duplicate lines, but the original file had 2 blank lines between quotes. Of course, this removes those blank lines, and I would like to keep them. I also messed around with
awk '!a[$0]++' input.txt > output.txt
It removes the duplicate lines, but the original file had 2 blank lines between quotes. Of course, this removes those blank lines, and I would like to keep them. I also messed around with
sort
and uniq
but couldn't see a way to get what I want with those, either.“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
Re: Need to combine 3 files, removing duplicate lines
can it be as simple as to change afterwards each
\n"
OR "\n
to \n"\n
?-=t42=-
Re: Need to combine 3 files, removing duplicate lines
Meld is very nice tool ( for me this is best tool )
https://www.fossmint.com/best-diff-merg ... for-linux/
But
1. You need move text from .odt binary file to .txt (just text file)
I don't know tool merge for odt. files.
2. This works great if the lines are sorted.
If not, it's going to be hard to tidy up. ( sometimes )
If Meld not will work very well ...
You can sort all lines in text files.
Before this, make copies of the files because we will damage them.
Example
Then you can use meld again and you see real differences.
But you can not just move because lines are sorted.
You need find it in orginal file and move the whole your quote by hand.
Unless you write your own script for sorting citations.
https://www.fossmint.com/best-diff-merg ... for-linux/
But
1. You need move text from .odt binary file to .txt (just text file)
I don't know tool merge for odt. files.
2. This works great if the lines are sorted.
If not, it's going to be hard to tidy up. ( sometimes )
If Meld not will work very well ...
You can sort all lines in text files.
Before this, make copies of the files because we will damage them.
Example
Code: Select all
$ cat file
C
A
B
$ cat file | sort > new.file
$ cat new.file
A
B
C
But you can not just move because lines are sorted.
You need find it in orginal file and move the whole your quote by hand.
Unless you write your own script for sorting citations.
Re: Need to combine 3 files, removing duplicate lines
Code: Select all
#!/bin/bash
NUMBER=50000
# Create empty file
echo " " > work.file
while read LINE ;do
# counter
NUMBER=$[$NUMBER+1]
# counter status instead empty line
[ -z "$LINE" ] && echo $NUMBER >> work.file
# save line if not empty
[ -z "$LINE" ] || echo "$LINE" >> work.file
done < input.txt
# remove the same lines, then remove numbers from 50000 to 59999 ( counter status )
awk '!a[$0]++' work.file | sed 's/^5[0-9][0-9][0-9][0-9]//g' > output.txt
# remove no needed file
rm work.file
echo "Ready. Check output.txt"
# Open file in default text editor
#xdg-open output.txt
# Or open in your text editor ( for example in xed )
#xed output.txt
copy and save to file, and run
Code: Select all
bash script
My default text editor in system is web browser, so added example with xed - text editor.
Re: Need to combine 3 files, removing duplicate lines
This might work:
cat f.1 f.2 f.3 | sort -u -m > f.new
...or
-m
-> -c
If you don't care about the order of the quotes, just use
-u
.Please edit your original post title to include [SOLVED] if/when it is solved!
Your data and OS are backed up....right?
Your data and OS are backed up....right?
Re: Need to combine 3 files, removing duplicate lines
Thanks for the suggestions. I will try them out over the next few days.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
Re: Need to combine 3 files, removing duplicate lines
jimallyn wrote: ⤴Thu Jan 28, 2021 2:27 am .....
What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.
Hi Jimallyn,
A long time ago you were the first to welcome me on this forum. Been down (personally and on the forum) for a while, but back again.
Just saw your question. You were so close with awk ...
export your .odt to .txt. Does not matter if in one, two, or any number of .txt files. This awk one-liner should do it:
Of course change nr of input files to what you want..
Code: Select all
awk '/^[\s|\t]*$/ {next}; {if(!seen[$0]) {seen[$0]=1 ; printf $0 "\n\n"}} ' input1.txt input2.txt input3.txt
/../{next} skips all empty lines. You want your aphorisms, nothing else. How to print will follow later.
seen[$0] is an associative array. The line is the INDEX, the value can be 0 or 1. seen[$0] is 0 (!seen[$0]) if line is not already in array index (seen[$0] is undefined). In that case we make new element seen[] with line ($0) as index and array element value =1 to catch later copies of the line and we print (printf) first the line and then 2 newlines (\n). Change the number of \n's to change interlineal space. Or change "\n\n" to "\n\n-------- separation --------\n\n" , whatever...
Oh, and of course you can add > output.txt.
Kind regards,
Robert
PS: Do note that a minimal change between two lines that are essentially the same (extra space, extra comma, whatever) will mean the (simple) script will see an "unique" line.
PPS: The above assumes that each quote is on ONE line !
EDIT 1: Even better, more elegant and closer to what you already found would be this:
Code: Select all
awk '{ORS="\n\n---------------\n\n"} ; /^[\s|\t]*$/ {next}; !seen[$0]++ ' input1.txt input2.txt input3.txt
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.
Re: Need to combine 3 files, removing duplicate lines
The majority of them take up two or more lines.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
Re: Need to combine 3 files, removing duplicate lines
Meld has already been suggested, a gui tool.
One other thing, if working off multiple PCs. I sync important files across my PCs with unison via my NAS. It means whatever I'm working on I have the latest files.
One other thing, if working off multiple PCs. I sync important files across my PCs with unison via my NAS. It means whatever I'm working on I have the latest files.
Thinkcentre M720Q - LM21.3 cinnamon, 4 x T430 - LM21.3 cinnamon, Homebrew desktop i5-8400+GTX1080 Cinnamon 19.0
Re: Need to combine 3 files, removing duplicate lines
I have three computers, two desktops and a laptop, I use mega.nz cloud storage. By using mega I have all of my data on the cloud and it syncs across all three systems.
Re: Need to combine 3 files, removing duplicate lines
Hi Jimallyn,
Meld might do the trick, but I suspect it would need a lot of manual work.
You said your quotes were separated by 2 blank lines. If that is always true, in all 3 files, then you could test this (will also work with 1 blank line) :
Code: Select all
awk '{RS="\n\n+" ; ORS="\n\n\n"} ; !seen[$0]++ ' input1.txt input2.txt input3.txt
Awk will simply see a string $0 like "line1\nline2\nline3\n...." .
If record is not seen before, record will be printed, followed by 3 linefeeds (\n); first after last line of quote, then 2 more to produce 2 blank lines.
This means it does not matter whether a quote consists of 1,2 or more lines, as long as they are all separated by one or more empty lines (AND the quotes themselves do not contain empty lines). An empty line means "just a linefeed". Lines that contain only spaces or tabs are NOT considered empty.
Hope this works, tested it on a few small samples, but do not know exact layout of your doc or if size of doc (and thus array) might be a problem.
If no luck, could you give me a bit more ifo about that ?
Robert
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.
Re: Need to combine 3 files, removing duplicate lines
I saw this item in a MS help location, maybe it will help.
https://social.technet.microsoft.com/Fo ... d-document
https://social.technet.microsoft.com/Fo ... d-document
Re: Need to combine 3 files, removing duplicate lines
This will do it, and efficiently enough, unless your files are incredibly large (IE: tens of thousands of long lines):
Of course, replace the file paths used above with what is actually relevant to you. You can use brace expansion to make things easier, if needed. Do note though, that the order in this particular case will not be maintained, which I'm assuming doesn't matter here.
The reason this works is because it uses BASH's associative arrays, which do not allow for duplicate keys. If you're not familiar, an associative array is basically an array with a list of key=value pairs.
So, the above code can just be executed, then its output redirected to a desired file for storage.
The AWK solution is a good one too; probably more appropriate if it's just a quick one-off.
Code: Select all
declare -A Quotes
for File in FILE_1 FILE_2 FILE_3; {
while read; do
Quotes[$REPLY]=1
done < "$File"
}
printf '%s\n' "${!Quotes[@]}"
The reason this works is because it uses BASH's associative arrays, which do not allow for duplicate keys. If you're not familiar, an associative array is basically an array with a list of key=value pairs.
So, the above code can just be executed, then its output redirected to a desired file for storage.
The AWK solution is a good one too; probably more appropriate if it's just a quick one-off.
I'm also Terminalforlife on GitHub.