Need to combine 3 files, removing duplicate lines

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
User avatar
jimallyn
Level 19
Level 19
Posts: 9075
Joined: Thu Jun 05, 2014 7:34 pm
Location: Wenatchee, WA USA

Need to combine 3 files, removing duplicate lines

Post by jimallyn »

I have been collecting quotes for quite some time, and I now have a file of them that is about 450 pages long. My main desktop computer developed troubles recently, so I was using my laptop for a while, and added a few quotes to the file on the laptop. Then I had to replace the hard drive in the laptop, and then I bought a new desktop computer. Long story short, the end result of all of this is I now have 3 versions of my quotes file. What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
User avatar
jimallyn
Level 19
Level 19
Posts: 9075
Joined: Thu Jun 05, 2014 7:34 pm
Location: Wenatchee, WA USA

Re: Need to combine 3 files, removing duplicate lines

Post by jimallyn »

I found this:

awk '!a[$0]++' input.txt > output.txt

It removes the duplicate lines, but the original file had 2 blank lines between quotes. Of course, this removes those blank lines, and I would like to keep them. I also messed around with sort and uniq but couldn't see a way to get what I want with those, either.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
t42
Level 11
Level 11
Posts: 3742
Joined: Mon Jan 20, 2014 6:48 pm

Re: Need to combine 3 files, removing duplicate lines

Post by t42 »

can it be as simple as to change afterwards each \n" OR "\n to \n"\n ?
-=t42=-
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Need to combine 3 files, removing duplicate lines

Post by 1000 »

Meld is very nice tool ( for me this is best tool )
https://www.fossmint.com/best-diff-merg ... for-linux/

But
1. You need move text from .odt binary file to .txt (just text file)
I don't know tool merge for odt. files.

2. This works great if the lines are sorted.
If not, it's going to be hard to tidy up. ( sometimes )

If Meld not will work very well ...
You can sort all lines in text files.
Before this, make copies of the files because we will damage them.

Example

Code: Select all

$ cat file
C
A
B

$ cat file | sort > new.file

$ cat new.file
A
B
C
Then you can use meld again and you see real differences.
But you can not just move because lines are sorted.
You need find it in orginal file and move the whole your quote by hand.

Unless you write your own script for sorting citations.
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Need to combine 3 files, removing duplicate lines

Post by 1000 »

Code: Select all

#!/bin/bash


NUMBER=50000

# Create empty file
echo " " > work.file

while read LINE ;do
    # counter 
    NUMBER=$[$NUMBER+1]
    # counter status instead empty line
    [ -z "$LINE" ] && echo $NUMBER >> work.file
    # save line if not empty
    [ -z "$LINE" ] || echo "$LINE" >> work.file
done < input.txt 

# remove the same lines, then remove numbers from 50000 to 59999 ( counter status )
awk '!a[$0]++' work.file | sed 's/^5[0-9][0-9][0-9][0-9]//g' > output.txt

# remove no needed file
rm work.file 

echo "Ready. Check output.txt"

# Open file in default text editor
#xdg-open output.txt

# Or open in your text editor ( for example in xed )
#xed output.txt
You cat test it. ( I used your file name input.txt from your example above )
copy and save to file, and run

Code: Select all

bash script
If you want to open a file automatically, you need to uncomment the appropriate line or edit it.
My default text editor in system is web browser, so added example with xed - text editor.
User avatar
Flemur
Level 20
Level 20
Posts: 10096
Joined: Mon Aug 20, 2012 9:41 pm
Location: Potemkin Village

Re: Need to combine 3 files, removing duplicate lines

Post by Flemur »

jimallyn wrote: Thu Jan 28, 2021 2:27 am Long story short, the end result of all of this is I now have 3 versions of my quotes file. What I need is something that will merge these three files into one, and remove any duplicate lines in the process.
This might work:
cat f.1 f.2 f.3 | sort -u -m > f.new
...or -m -> -c
If you don't care about the order of the quotes, just use -u.
Please edit your original post title to include [SOLVED] if/when it is solved!
Your data and OS are backed up....right?
User avatar
jimallyn
Level 19
Level 19
Posts: 9075
Joined: Thu Jun 05, 2014 7:34 pm
Location: Wenatchee, WA USA

Re: Need to combine 3 files, removing duplicate lines

Post by jimallyn »

Thanks for the suggestions. I will try them out over the next few days.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
TI58C
Level 4
Level 4
Posts: 389
Joined: Tue Jul 18, 2017 5:57 am

Re: Need to combine 3 files, removing duplicate lines

Post by TI58C »

jimallyn wrote: Thu Jan 28, 2021 2:27 am .....
What I need is something that will merge these three files into one, and remove any duplicate lines in the process. The files are in LibreOffice .odt format, but they can easily be converted to plain text files if needed for processing, and I assume that a BASH command, one-liner, or script would easily handle this chore, but I am definitely NOT a BASH wizard. Can somebody help me with this? Thanks.


Hi Jimallyn,

A long time ago you were the first to welcome me on this forum. Been down (personally and on the forum) for a while, but back again.
Just saw your question. You were so close with awk ...

export your .odt to .txt. Does not matter if in one, two, or any number of .txt files. This awk one-liner should do it:
Of course change nr of input files to what you want..

Code: Select all

awk '/^[\s|\t]*$/ {next}; {if(!seen[$0]) {seen[$0]=1 ; printf $0 "\n\n"}} ' input1.txt input2.txt input3.txt

/../{next} skips all empty lines. You want your aphorisms, nothing else. How to print will follow later.
seen[$0] is an associative array. The line is the INDEX, the value can be 0 or 1. seen[$0] is 0 (!seen[$0]) if line is not already in array index (seen[$0] is undefined). In that case we make new element seen[] with line ($0) as index and array element value =1 to catch later copies of the line and we print (printf) first the line and then 2 newlines (\n). Change the number of \n's to change interlineal space. Or change "\n\n" to "\n\n-------- separation --------\n\n" , whatever...

Oh, and of course you can add > output.txt.

Kind regards,
Robert


PS: Do note that a minimal change between two lines that are essentially the same (extra space, extra comma, whatever) will mean the (simple) script will see an "unique" line.

PPS: The above assumes that each quote is on ONE line !




EDIT 1: Even better, more elegant and closer to what you already found would be this:

Code: Select all

  awk '{ORS="\n\n---------------\n\n"} ; /^[\s|\t]*$/ {next}; !seen[$0]++  ' input1.txt input2.txt input3.txt
ORS = Output Record Separator
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.
User avatar
jimallyn
Level 19
Level 19
Posts: 9075
Joined: Thu Jun 05, 2014 7:34 pm
Location: Wenatchee, WA USA

Re: Need to combine 3 files, removing duplicate lines

Post by jimallyn »

TI58C wrote: Fri May 07, 2021 10:46 amPPS: The above assumes that each quote is on ONE line !
The majority of them take up two or more lines.
“If the government were coming for your TVs and cars, then you'd be upset. But, as it is, they're only coming for your sons.” - Daniel Berrigan
User avatar
AndyMH
Level 21
Level 21
Posts: 13739
Joined: Fri Mar 04, 2016 5:23 pm
Location: Wiltshire

Re: Need to combine 3 files, removing duplicate lines

Post by AndyMH »

Meld has already been suggested, a gui tool.

One other thing, if working off multiple PCs. I sync important files across my PCs with unison via my NAS. It means whatever I'm working on I have the latest files.
Thinkcentre M720Q - LM21.3 cinnamon, 4 x T430 - LM21.3 cinnamon, Homebrew desktop i5-8400+GTX1080 Cinnamon 19.0
User avatar
GELvdH
Level 5
Level 5
Posts: 979
Joined: Tue Jan 08, 2019 10:10 am
Location: 3rd rock from Sun

Re: Need to combine 3 files, removing duplicate lines

Post by GELvdH »

I have three computers, two desktops and a laptop, I use mega.nz cloud storage. By using mega I have all of my data on the cloud and it syncs across all three systems.
TI58C
Level 4
Level 4
Posts: 389
Joined: Tue Jul 18, 2017 5:57 am

Re: Need to combine 3 files, removing duplicate lines

Post by TI58C »

jimallyn wrote: Sun May 09, 2021 1:32 am
TI58C wrote: Fri May 07, 2021 10:46 amPPS: The above assumes that each quote is on ONE line !
The majority of them take up two or more lines.
Hi Jimallyn,

Meld might do the trick, but I suspect it would need a lot of manual work.
You said your quotes were separated by 2 blank lines. If that is always true, in all 3 files, then you could test this (will also work with 1 blank line) :

Code: Select all

awk '{RS="\n\n+" ; ORS="\n\n\n"} ; !seen[$0]++  ' input1.txt input2.txt input3.txt
Awk can work with "records" that span multiple lines. The code <RS=...> tells awk to consider consecutive lines of text (may include linefeed \n) as one "record" until awk sees one or more blank lines (blank lines are not included in record). This multiline record will function as "$0" in !seen[$0]++.
Awk will simply see a string $0 like "line1\nline2\nline3\n...." .
If record is not seen before, record will be printed, followed by 3 linefeeds (\n); first after last line of quote, then 2 more to produce 2 blank lines.

This means it does not matter whether a quote consists of 1,2 or more lines, as long as they are all separated by one or more empty lines (AND the quotes themselves do not contain empty lines). An empty line means "just a linefeed". Lines that contain only spaces or tabs are NOT considered empty.

Hope this works, tested it on a few small samples, but do not know exact layout of your doc or if size of doc (and thus array) might be a problem.
If no luck, could you give me a bit more ifo about that ?

Robert
Linux is like my late labrador lady-dog: loyal and loving if you treat her lady-like, disbehaving princess if you don't.
User avatar
GELvdH
Level 5
Level 5
Posts: 979
Joined: Tue Jan 08, 2019 10:10 am
Location: 3rd rock from Sun

Re: Need to combine 3 files, removing duplicate lines

Post by GELvdH »

I saw this item in a MS help location, maybe it will help.
https://social.technet.microsoft.com/Fo ... d-document
User avatar
Termy
Level 12
Level 12
Posts: 4248
Joined: Mon Sep 04, 2017 8:49 pm
Location: UK
Contact:

Re: Need to combine 3 files, removing duplicate lines

Post by Termy »

This will do it, and efficiently enough, unless your files are incredibly large (IE: tens of thousands of long lines):

Code: Select all

declare -A Quotes
for File in FILE_1 FILE_2 FILE_3; {
    while read; do
        Quotes[$REPLY]=1
    done < "$File"
}   

printf '%s\n' "${!Quotes[@]}"
Of course, replace the file paths used above with what is actually relevant to you. You can use brace expansion to make things easier, if needed. Do note though, that the order in this particular case will not be maintained, which I'm assuming doesn't matter here.

The reason this works is because it uses BASH's associative arrays, which do not allow for duplicate keys. If you're not familiar, an associative array is basically an array with a list of key=value pairs.

So, the above code can just be executed, then its output redirected to a desired file for storage.

The AWK solution is a good one too; probably more appropriate if it's just a quick one-off. :lol:
I'm also Terminalforlife on GitHub.
Locked

Return to “Scripts & Bash”