Files

About writing shell scripts and making the most of your shell
Forum rules
Topics in this forum are automatically closed 6 months after creation.
Locked
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Files

Post by 1000 »

How can I check that the file contains only printable characters + spaces + tabs ?

Is this method good ?

Code: Select all

$  A=$(sed 's/[[:blank:][:print:]]//g' /bin/echo | tr -d '\n') ; [ -z "$A"] && echo "This is only text" || echo "This is not text file"
bash: warning: command substitution: ignored null byte in input
bash: [: missing `]'
This is not text file
Last edited by LockBot on Wed Dec 28, 2022 7:16 am, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
matrovska
Level 4
Level 4
Posts: 226
Joined: Tue Jan 21, 2020 3:58 pm

Re: Files

Post by matrovska »

With the command file

Code: Select all

$ file .bashrc
.bashrc: ASCII text
http://manpages.ubuntu.com/manpages/foc ... ile.1.html
Linux Mint 21 Vanessa. Lenovo thinkpad T460
rene
Level 20
Level 20
Posts: 12212
Joined: Sun Mar 27, 2016 6:58 pm

Re: Files

Post by rene »

Note; sed operates on lines, things between \n, and is as such fundamentally okay with \n as well. Assuming then that you are also I'd use something like

Code: Select all

sed -n '/^[[:blank:][:print:]]*$/!q1' "$FILE" && echo "This is only text" || echo "This is not a text file"
I.e., break out of sed with (as a GNU sed extension) exit code 1 as soon as a line containing anything other than a space, tab or (other) printable character is found in the file "$FILE".
donalduck
Level 4
Level 4
Posts: 236
Joined: Mon Oct 07, 2013 1:43 pm
Location: there

Re: Files

Post by donalduck »

good to know cat -v command

-v option is used to show non-printing characters with special notation

from man cat:
-v, --show-nonprinting
use ^ and M- notation, except for LFD and TAB
a bash example i like using process substitution:

Code: Select all

diff <(cat confused.tmp) <(cat -v confused.tmp)
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Files

Post by 1000 »

Code: Select all

(cat confused.tmp)
From a security point of view, this is not a very good idea.
The file "confused.tmp" may be a binary file in my test.
It is possible that the non-malicious file will be executed partially by "cat".
This is most often seen when the system or terminal is unstable or freezing.

I guess a better idea is to display the characters in some code ( for example in hexadecimal code )
and check each character to see if it belongs to the "printable" group.
But I am learning signs and character conversion and I don't know if I can.
So I tried a little different method with sed like above.

The "file" command:
- it is problematic e.g. for detect binary files
- The list of different outputs from this command in terminal is huge.
In my free time I will try to check.
User avatar
AndyMH
Level 21
Level 21
Posts: 13704
Joined: Fri Mar 04, 2016 5:23 pm
Location: Wiltshire

Re: Files

Post by AndyMH »

If you want to strip out non printable characters from a file use strings. man strings for more info.
Thinkcentre M720Q - LM21.3 cinnamon, 4 x T430 - LM21.3 cinnamon, Homebrew desktop i5-8400+GTX1080 Cinnamon 19.0
Aztaroth
Level 5
Level 5
Posts: 764
Joined: Mon Jan 11, 2021 1:48 am

Re: Files

Post by Aztaroth »

1000 wrote: Fri Dec 03, 2021 4:18 pm How can I check that the file contains only printable characters + spaces + tabs ?
Having the same needs sometimes ago, I found here :
https://stackoverflow.com/questions/319 ... characters
this

Code: Select all

grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"
which checks if all characters are in the Hex20-Hex7E (aka printable) range, file being the name of the file you wanna check.

One person among those who tried it had issues because of a too old version of grep.
LMDE4 has 3.3 which works fine.
There is also a standard POSIX version I didn't check because the first one was OK for me.

In fact, in my script, I treated the $? return of grep -qP "[^\x20-\x7E]" file as a boolean returning 1 (true) when only ASCII characters and 0 (false) when "weird" characters.
dual boot LMDE4 (mostly) + LM19.3 Cinnamon (sometimes)
dave0808
Level 5
Level 5
Posts: 986
Joined: Sat May 16, 2015 1:02 pm

Re: Files

Post by dave0808 »

I use

Code: Select all

cat -vet filename
as I simply remember the options as the word "vet". It could be reduced further to simply -et but that never stuck in my head. :D
rene
Level 20
Level 20
Posts: 12212
Joined: Sun Mar 27, 2016 6:58 pm

Re: Files

Post by rene »

dave0808 wrote: Tue Dec 07, 2021 7:27 am It could be reduced further to simply -et but that never stuck in my head. :D
Really? Show "alien characters" doesn't work for you? ;)
User avatar
Termy
Level 12
Level 12
Posts: 4248
Joined: Mon Sep 04, 2017 8:49 pm
Location: UK
Contact:

Re: Files

Post by Termy »

If I wanted to do this in a more programmatic way, on a file which isn't too big, I'd probably do the following, which checks each character as it's read. It's slow, but it works. Otherwise, I'd do this with grep(1), PERL, or AWK.

Code: Select all

Data=`< ~/.bashrc`
for (( Offset = 0; Offset <= ${#Data}; Offset++ )); {
    Char=${Data:Offset:1}

    case $Char in
        [[:print:][:space:]]|'')
            ;;
        *)
            printf "Err: Character '%s' found.\n" "$Char" 1>&2
            exit 1 ;;
    esac
}
1000 wrote: Mon Dec 06, 2021 10:56 am It is possible that the non-malicious file will be executed partially by "cat".
This is most often seen when the system or terminal is unstable or freezing.
I'm not sure how that would happen, as cat(1) should just read the data from the file, then display it onto the terminal. I mean, I suppose if cat(1) has some sort of vulnerability the code in a file could abuse, but I'd say that's something different.

What you're referring to there, sounds more like when data is interpreted as escape sequences, which is more of a benign inconvenience than something unsafe, as I understand and have experienced it. Running the `reset` almost always fixes the terminal.

That being said, it is possible to be a bit of a nuisance, and potentially do some somewhat unpleasant things with escape sequences, but I've never been all that concerned about that. I think that was more of an issue years ago, as software tends to prohibit the more alarming things by default.

If you're concerned, you could potentially minimize risk by reading from the file in a few other ways, using just the shell itself, such as:

Code: Select all

# Store the data into a simple variable, then display it.
Data=`< FILE`
printf '%s\n' "$Data"

# Use an array with `readarray` or `mapfile`, then dump the elements.
readarray Data < FILE
printf '%s' "${Data[@]}"

# Print the lines one at a time with a `while read` loop.
while read; do
	printf '%s\n' "$REPLY"
done < FILE
By default, `read` won't properly use an escape sequence, which you can see by doing:

Code: Select all

read <<< "this \e[91mis\e[0m a test."
printf '%b\n' "$REPLY"
Yet, if you use the `-r` flag with the `read` builtin, the escape sequence is allowed to come to life. In this case, using the `%b` format specification with the `printf` builtin interprets the escape sequences, as demonstrated with the below commands.

Code: Select all

read -r <<< "this \e[91mis\e[0m a test."
printf '%b\n' "$REPLY"
The `help` for the `read` builtin shows this:

Code: Select all

      -r	do not allow backslashes to escape any characters
Which I've always found grossly unhelpful and misleading, because it actually does allow escape sequences/characters to be interpreted by something like `echo` or `printf`. If you use `read` without `-r`, you will literally see the sequence without the backslash ('\'). If anyone knows why they said it like that, please let me know, because it's baffled me for 7 years. Lol
Last edited by Termy on Wed Dec 22, 2021 1:36 pm, edited 2 times in total.
I'm also Terminalforlife on GitHub.
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Files

Post by 1000 »

I'm not sure how that would happen, as cat(1) should just read the data from the file
Of course, I don't mean also anything bad.
But whatever it is, if can have an influence, I don't want it. :mrgreen:
e.g. execution of a control characters.
It is redundant to display (execution) not printable characters.

This is a bit of a bigger topic.
The "cat" program is a bit of an old program.
- The priority for all programs was to create a working and useful program, not safety.
But the whole Linux environment is getting more and more secure that's true.
- "cat" has its own specific use ( various )
and changing the rules in program may destroy the scripts in which it was used.
So it has to stay.

I am not an expert in this field,
but it seems to me and also read a lot that
- if you only use "cat" command for text files, that's fine
- if you are using binary files then this is not good idea
1. Some characters are invisible ( not printable, for example "control characters" )
2. Rather, files are saved in the form of zero and one.
With 0 and 1 lines are long so ("we") use hexadecimal editors.

Example link about cat.
https://security.stackexchange.com/ques ... urity-risk

Also " cat -et" and " cat -vet " are useful.

Off topic.
I change the subject.
About the use of invisible characters in the source code.

Attack examples include
- changing the order of characters
- replacing one character with another which is very similar

Example Links:
1. Example https://certitude.consulting/blog/en/in ... -backdoor/
2. homograph attack https://en.wikipedia.org/wiki/IDN_homograph_attack
3. PDF https://trojansource.codes/trojan-source.pdf

Therefore, this topic is not accidental :lol:
Last edited by 1000 on Thu Dec 09, 2021 9:05 am, edited 1 time in total.
dave0808
Level 5
Level 5
Posts: 986
Joined: Sat May 16, 2015 1:02 pm

Re: Files

Post by dave0808 »

rene wrote: Tue Dec 07, 2021 7:40 am
dave0808 wrote: Tue Dec 07, 2021 7:27 am It could be reduced further to simply -et but that never stuck in my head. :D
Really? Show "alien characters" doesn't work for you? ;)
I know, right?! It seems the most obvious, but because I learnt "vet" first, it's the one that I keep coming back to. :roll:
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Files

Post by 1000 »

Just an observation.
When I tested different ways I found that "grep" is not best tool for test not printable chars.
https://unix.stackexchange.com/question ... ash-script
( However " grep " can be used in other way
e.g. for checking hex characters. If we before convert characters to hex characters. ) )


Example which print chars from decimal 0 to 31.

Code: Select all

$ LC_ALL=C ; while read -r LINE ; do printf '%s %b\n' "$LINE" "$LINE"  | grep -qP "[^\x20-\x7E]" && echo "weird ASCII" || echo "clean one"  ; done <<< $(printf '\\x%x\n' {0..31}) | cat -n
     1	clean one
     2	weird ASCII
     3	weird ASCII
     4	weird ASCII
     5	weird ASCII
     6	weird ASCII
     7	weird ASCII
     8	weird ASCII
     9	weird ASCII
    10	weird ASCII
    11	clean one
    12	weird ASCII
    13	weird ASCII
    14	weird ASCII
    15	weird ASCII
    16	weird ASCII
    17	weird ASCII
    18	weird ASCII
    19	weird ASCII
    20	weird ASCII
    21	weird ASCII
    22	weird ASCII
    23	weird ASCII
    24	weird ASCII
    25	weird ASCII
    26	weird ASCII
    27	weird ASCII
    28	weird ASCII
    29	weird ASCII
    30	weird ASCII
    31	weird ASCII
    32	weird ASCII
You will see ( Edited: 32 ) lines because
- I used "cat -n" which counts from 1 instead 0
- one of the characters is new line

Code: Select all

    11	\xa 
    12	
    13	\xb
So for example 11 char above, which is "clean one"

Code: Select all

LC_ALL=C ; while read -r LINE ; do printf '%s %b\n' "$LINE" "$LINE"  ; done <<< $(printf '\\x%x\n' {0..31}) | cat -n
     1	\x0 
     2	\x1 
     3	\x2 
     4	\x3 
     5	\x4 
     6	\x5 
     7	\x6 
     8	\x7 
     9	\x8 
    10	\x9 	
    11	\xa 
    12	
this is "10" in dec and "a" in hex [ LF '\n' (new line) from man ascii ]

From " man ascii "

Code: Select all

       Oct   Dec   Hex   Char                  
       ────────────────
       000   0     00    NUL '\0' (null character)

       011   9     09    HT  '\t' (horizontal tab)   111   73    49    I
       012   10    0A    LF  '\n' (new line)         112   74    4A    J
       013   11    0B    VT  '\v' (vertical tab)     113   75    4B    K
       014   12    0C    FF  '\f' (form feed)        114   76    4C    L
I'll test the rest later.
User avatar
Termy
Level 12
Level 12
Posts: 4248
Joined: Mon Sep 04, 2017 8:49 pm
Location: UK
Contact:

Re: Files

Post by Termy »

1000 wrote: Mon Dec 20, 2021 5:17 pm When I tested different ways I found that "grep" is not best tool for test not printable chars.
Was that because of the binary character, and grep(1) not really being for that? If so, there's the -a flag, which might allow that to work properly.
I'm also Terminalforlife on GitHub.
1000
Level 6
Level 6
Posts: 1039
Joined: Wed Jul 29, 2020 2:14 am

Re: Files

Post by 1000 »

I mean to play precisely with grep -P "[^\xNUMBER-\xNUMBER]" is impossible.
Because the characters "null" and "new line" are treated especially / exceptionally.
I don't know what the reason is, but probably there is a reason for grep command.
If there's a way with grep, I don't know.

In my own example,
- I can ignore the new line malfunction. Because it is a character that I expect to appear in text files.
- I'm not convinced for the "null" sign. Because if it does nothing, it is fine. But if it is treated as the end it may not check the rest.
mmphosis
Level 1
Level 1
Posts: 25
Joined: Sat Apr 11, 2020 11:22 pm

containsonlyprintable (cop)

Post by mmphosis »

This is a solution using the C programming language in a shell script. I assume we're talking about US 7-bit ASCII, and not UTF-8. I also assume that vertical-tab ^K, form-feed ^L, and carriage return ^M are non-printable. Tab ^I, newline ^J, and space to tilde are accepted as "printable."

Code: Select all

#!/usr/bin/env bash
cat <<EOF >cop.c
#include <stdio.h>
int main()
{
	int c;
	while ((c = getchar()) != EOF) {
		if (c < ' ' || c > '~') {
			if (c != '\t' && c != '\n') {
				return 1;
			}
		}
	}
	return 0;
}
EOF
if [ -z "$(which cc)" ]; then
	echo "Install GCC compiler:"
	printf "\e7# \x1b[37msudo apt update\e8\n\e7# \x1b[37msudo apt install build-essential\e8\n"
	return 2>/dev/null || exit
fi
cc cop.c -o cop
# BIN is wherever you keep your local bin
export BIN=$HOME/bin
mkdir -p $BIN
PATH="$(/bin/echo ":$PATH" | sed -e "s@:$BIN@@g" -e "s/^://")"
PATH=$BIN:$PATH
install cop $BIN/
cat $(which ls)|cop && echo yes || echo no
cat cop.c|cop && echo yes || echo no
Locked

Return to “Scripts & Bash”