Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Quick to answer questions about finding your way around Linux Mint as a new user.
Forum rules
There are no such things as "stupid" questions. However if you think your question is a bit stupid, then this is the right place for you to post it. Please stick to easy to-the-point questions that you feel people can answer fast. For long and complicated questions prefer the other forums within the support section.
Before you post please read how to get help
Post Reply
RadioKaga
Level 1
Level 1
Posts: 31
Joined: Wed Feb 03, 2021 10:33 am

Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by RadioKaga »

Hi, I'm rearranging some image files on my external hard drive and I'm having bit of a problem. See, I have lot of images and its takes time to move this files one by one. So by chance, if I know a word that word that appears lot in the file names, I can move a lot of these in a chunks, to make the job easier. Example, if I have images of Porsches, I can just use Catfish and type Porsche and it will show about all files that have "Porsche" in it - and then move them on the destination file etc.

But the problem I have is that I'm having little trouble sometimes telling what is the name/word that appears a lot amongst the files, since there are random lot of them. Sometimes what I think may yield lot of hits may produce only 4 matches and sometimes other word - that I have neglected, not thinking it can produce a lot, can produce like 88.

So I am asking is there a way to find out what is the most commonly appearing word within the file names located in the folder?
User avatar
xenopeek
Level 25
Level 25
Posts: 25626
Joined: Wed Jul 06, 2011 3:58 am
Location: The Netherlands

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by xenopeek »

Run this command from the folder where you have the files:
ls | cut -d'.' -f1 | tr ' ' $'\n' | sort | uniq -c | sort -n

It will list all files, remove the extension, split them into words (split on space character) and then count each word and sort them on that count.

If you need to split words on other characters than space or you have files separated into subfolders let me know and we can adjust the command easily for that.
Image
RadioKaga
Level 1
Level 1
Posts: 31
Joined: Wed Feb 03, 2021 10:33 am

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by RadioKaga »

xenopeek wrote:
Sun Jun 13, 2021 7:04 am
Run this command from the folder where you have the files:
ls | cut -d'.' -f1 | tr ' ' $'\n' | sort | uniq -c | sort -n

It will list all files, remove the extension, split them into words (split on space character) and then count each word and sort them on that count.

If you need to split words on other characters than space or you have files separated into subfolders let me know and we can adjust the command easily for that.
Almost good, but could it ignore underscores/treat them as space instead? I have lot of files that go like "2000_Pontiac_Firebird_Trans_Am_WS6" and some/others seem to have dots and Hyphen-minus that hamper the word isolation.
vimes666
Level 4
Level 4
Posts: 408
Joined: Tue Jan 19, 2016 6:08 pm

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by vimes666 »

This will work with underscores as well. By comparing the two you will figure how to handle the other characters :)

Code: Select all

ls | cut -d'.' -f1 | tr ' ' $'\n' | tr '_' $'\n' | sort | uniq -c | sort -n
If you think the issue is solved, edit your original post and add the word solved to the title.
User avatar
xenopeek
Level 25
Level 25
Posts: 25626
Joined: Wed Jul 06, 2011 3:58 am
Location: The Netherlands

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by xenopeek »

Slightly more complex:
ls | sed -r 's/\.[^.]*$//' | sed -r 's/[ _.-]/\n/g' | sort | uniq -c | sort -n
This will only remove the last dot and what follows from each filename and will split each filename on spaces, underscores, dots and hyphens.

For some hyphenated words you may want to count them as one word but that's a lot more complex. Hopefully this can be a starter list.
Image
User avatar
Flemur
Level 19
Level 19
Posts: 9753
Joined: Mon Aug 20, 2012 9:41 pm
Location: Potemkin Village

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by Flemur »

xenopeek wrote:
Sun Jun 13, 2021 10:14 am
Slightly more complex:
ls | sed -r 's/\.[^.]*$//' | sed -r 's/[ _.-]/\n/g' | sort | uniq -c | sort -n
That seems very different and much harder to understand than your first version; is there a way to change | tr ' ' $'\n' | to something like | tr [ _.-] $'\n' | (analogous to the sed -r 's/[ _.-]) ? (Just curious since I'd probably ls > somefile and then vi somefile to separate the words, and avoid learning stuff at the same time.
Please edit your original post title to include [SOLVED] if/when it is solved!
Your data and OS are backed up....right?
User avatar
xenopeek
Level 25
Level 25
Posts: 25626
Joined: Wed Jul 06, 2011 3:58 am
Location: The Netherlands

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by xenopeek »

Yes but it becomes more error prone if you want to add or remove a character to split on:
ls | sed -r 's/\.[^.]*$//' | tr ' _.-' '\n\n\n\n' | sort | uniq -c | sort -n
Image
RadioKaga
Level 1
Level 1
Posts: 31
Joined: Wed Feb 03, 2021 10:33 am

Re: Is there a way to find out what is the most commonly appearing word within the file names located in the folder?

Post by RadioKaga »

xenopeek wrote:
Sun Jun 13, 2021 10:14 am
Slightly more complex:
ls | sed -r 's/\.[^.]*$//' | sed -r 's/[ _.-]/\n/g' | sort | uniq -c | sort -n
This will only remove the last dot and what follows from each filename and will split each filename on spaces, underscores, dots and hyphens.

For some hyphenated words you may want to count them as one word but that's a lot more complex. Hopefully this can be a starter list.

Sorry for the late response. Thanks guys, that bit of code has been a life saver. Have a great rest of your day.
Post Reply