Need to combine 3 files, removing duplicate lines

jimallyn · Post by **jimallyn** » Thu Jan 28, 2021 2:27 am

I have been collecting quotes for quite some time, and I now have a file of them that is about 450 pages long. My main desktop computer developed troubles recently, so I was using my laptop for a while, and added a few quotes to the file on the laptop. Then I had to replace the hard drive in the laptop, and then I bought a new desktop computer. Long story short, the end result of all of this is I now have 3 versions of my quotes file. What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.

jimallyn · Post by **jimallyn** » Thu Jan 28, 2021 3:33 am

I found this:

awk '!a[$0]++' input.txt > output.txt

It removes the duplicate lines, but the original file had 2 blank lines between quotes. Of course, this removes those blank lines, and I would like to keep them. I also messed around with sort and uniq but couldn't see a way to get what I want with those, either.

t42 · Post by **t42** » Thu Jan 28, 2021 4:33 am

can it be as simple as to change afterwards each \n" OR "\n to \n"\n ?

1000 · Post by **1000** » Thu Jan 28, 2021 5:42 am

Meld is very nice tool ( for me this is best tool )
https://www.fossmint.com/best-diff-merg ... for-linux/

But
1. You need move text from .odt binary file to .txt (just text file)
I don't know tool merge for odt. files.

2. This works great if the lines are sorted.
If not, it's going to be hard to tidy up. ( sometimes )

If Meld not will work very well ...
You can sort all lines in text files.
Before this, make copies of the files because we will damage them.

Example

Code: Select all

$ cat file
C
A
B

$ cat file | sort > new.file

$ cat new.file
A
B
C

Then you can use meld again and you see real differences.
But you can not just move because lines are sorted.
You need find it in orginal file and move the whole your quote by hand.

Unless you write your own script for sorting citations.

1000 · Post by **1000** » Thu Jan 28, 2021 6:52 am

Code: Select all

#!/bin/bash


NUMBER=50000

# Create empty file
echo " " > work.file

while read LINE ;do
    # counter 
    NUMBER=$[$NUMBER+1]
    # counter status instead empty line
    [ -z "$LINE" ] && echo $NUMBER >> work.file
    # save line if not empty
    [ -z "$LINE" ] || echo "$LINE" >> work.file
done < input.txt 

# remove the same lines, then remove numbers from 50000 to 59999 ( counter status )
awk '!a[$0]++' work.file | sed 's/^5[0-9][0-9][0-9][0-9]//g' > output.txt

# remove no needed file
rm work.file 

echo "Ready. Check output.txt"

# Open file in default text editor
#xdg-open output.txt

# Or open in your text editor ( for example in xed )
#xed output.txt

You cat test it. ( I used your file name input.txt from your example above )
copy and save to file, and run

Code: Select all

bash script

If you want to open a file automatically, you need to uncomment the appropriate line or edit it.
My default text editor in system is web browser, so added example with xed - text editor.

Flemur · Post by **Flemur** » Thu Jan 28, 2021 8:30 am

jimallyn wrote: ⤴Thu Jan 28, 2021 2:27 am Long story short, the end result of all of this is I now have 3 versions of my quotes file. What I need is something that will merge these three files into one, and remove any duplicate lines in the process.

This might work:
cat f.1 f.2 f.3 | sort -u -m > f.new
...or -m -> -c
If you don't care about the order of the quotes, just use -u.

jimallyn · Post by **jimallyn** » Fri Jan 29, 2021 6:13 am

Thanks for the suggestions. I will try them out over the next few days.

TI58C · Post by **TI58C** » Fri May 07, 2021 10:46 am

jimallyn wrote: ⤴Thu Jan 28, 2021 2:27 am .....
What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.

Hi Jimallyn,

A long time ago you were the first to welcome me on this forum. Been down (personally and on the forum) for a while, but back again.
Just saw your question. You were so close with awk ...

export your .odt to .txt. Does not matter if in one, two, or any number of .txt files. This awk one-liner should do it:
Of course change nr of input files to what you want..

Code: Select all

awk '/^[\s|\t]*$/ {next}; {if(!seen[$0]) {seen[$0]=1 ; printf $0 "\n\n"}} ' input1.txt input2.txt input3.txt

/../{next} skips all empty lines. You want your aphorisms, nothing else. How to print will follow later.
seen[$0] is an associative array. The line is the INDEX, the value can be 0 or 1. seen[$0] is 0 (!seen[$0]) if line is not already in array index (seen[$0] is undefined). In that case we make new element seen[] with line ($0) as index and array element value =1 to catch later copies of the line and we print (printf) first the line and then 2 newlines (\n). Change the number of \n's to change interlineal space. Or change "\n\n" to "\n\n-------- separation --------\n\n" , whatever...

Oh, and of course you can add > output.txt.

Kind regards,
Robert

PS: Do note that a minimal change between two lines that are essentially the same (extra space, extra comma, whatever) will mean the (simple) script will see an "unique" line.

PPS: The above assumes that each quote is on ONE line !

EDIT 1: Even better, more elegant and closer to what you already found would be this:

Code: Select all

  awk '{ORS="\n\n---------------\n\n"} ; /^[\s|\t]*$/ {next}; !seen[$0]++  ' input1.txt input2.txt input3.txt

ORS = Output Record Separator

jimallyn · Post by **jimallyn** » Sun May 09, 2021 1:32 am

TI58C wrote: ⤴Fri May 07, 2021 10:46 amPPS: The above assumes that each quote is on ONE line !

The majority of them take up two or more lines.

AndyMH · Post by **AndyMH** » Sun May 09, 2021 5:15 am

Meld has already been suggested, a gui tool.

One other thing, if working off multiple PCs. I sync important files across my PCs with unison via my NAS. It means whatever I'm working on I have the latest files.

GELvdH · Post by **GELvdH** » Sun May 09, 2021 8:51 am

I have three computers, two desktops and a laptop, I use mega.nz cloud storage. By using mega I have all of my data on the cloud and it syncs across all three systems.

TI58C · Post by **TI58C** » Mon May 10, 2021 6:45 am

jimallyn wrote: ⤴Sun May 09, 2021 1:32 am
TI58C wrote: ⤴Fri May 07, 2021 10:46 amPPS: The above assumes that each quote is on ONE line !
The majority of them take up two or more lines.

Hi Jimallyn,

Meld might do the trick, but I suspect it would need a lot of manual work.
You said your quotes were separated by 2 blank lines. If that is always true, in all 3 files, then you could test this (will also work with 1 blank line) :

Code: Select all

awk '{RS="\n\n+" ; ORS="\n\n\n"} ; !seen[$0]++  ' input1.txt input2.txt input3.txt

Awk can work with "records" that span multiple lines. The code <RS=...> tells awk to consider consecutive lines of text (may include linefeed \n) as one "record" until awk sees one or more blank lines (blank lines are not included in record). This multiline record will function as "$0" in !seen[$0]++.
Awk will simply see a string $0 like "line1\nline2\nline3\n...." .
If record is not seen before, record will be printed, followed by 3 linefeeds (\n); first after last line of quote, then 2 more to produce 2 blank lines.

This means it does not matter whether a quote consists of 1,2 or more lines, as long as they are all separated by one or more empty lines (AND the quotes themselves do not contain empty lines). An empty line means "just a linefeed". Lines that contain only spaces or tabs are NOT considered empty.

Hope this works, tested it on a few small samples, but do not know exact layout of your doc or if size of doc (and thus array) might be a problem.
If no luck, could you give me a bit more ifo about that ?

Robert

GELvdH · Post by **GELvdH** » Tue May 11, 2021 11:17 am

I saw this item in a MS help location, maybe it will help.
https://social.technet.microsoft.com/Fo ... d-document

Termy · Post by **Termy** » Mon Jun 07, 2021 10:06 pm

This will do it, and efficiently enough, unless your files are incredibly large (IE: tens of thousands of long lines):

Code: Select all

declare -A Quotes
for File in FILE_1 FILE_2 FILE_3; {
    while read; do
        Quotes[$REPLY]=1
    done < "$File"
}   

printf '%s\n' "${!Quotes[@]}"

Of course, replace the file paths used above with what is actually relevant to you. You can use brace expansion to make things easier, if needed. Do note though, that the order in this particular case will not be maintained, which I'm assuming doesn't matter here.

The reason this works is because it uses BASH's associative arrays, which do not allow for duplicate keys. If you're not familiar, an associative array is basically an array with a list of key=value pairs.

So, the above code can just be executed, then its output redirected to a desired file for storage.

The AWK solution is a good one too; probably more appropriate if it's just a quick one-off.

Linux Mint Forums

Need to combine 3 files, removing duplicate lines

Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines

Re: Need to combine 3 files, removing duplicate lines