How to read files with different encodings using Awk?

How to read files with different encodings using Awk? - character-encoding

How can I correctly read files in encodings other than UTF8 in Awk?
I have a file in Hebrew/Windows-1255 encoding.
A simple {print $0} awk prints stuff like �.
how can I make it read correctly?

awk itself doesn't have any support for handling different encodings. It will honor the locale specified in the environment, but your best bet is to transcode the input to the proper encoding before handing it off to awk.
-f is the format you want to convert from, -t is the target format, and -c skips over any invalid characters which prematurely terminate iconv's operation. Of course --help will give more details.
iconv -c -f cp1255 -t utf8 somefile | awk ...

Related

Grepping list of phpass hashes against a file

I'm trying to grep multiple strings which look like this (there's a few hundred) against a file which contains data:string
Example strings: (no sensitive data is provided, they have been modified).
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
I've been researching how to grep a file of patterns against another file, and came across the following commands
grep -f strings.txt datastring.txt > output.txt
grep -Ff strings.txt datastring.txt > output.txt
But unfortunately, these commands do NOT work successfully, and only print out a handful of results to my output file. I think it may be something to do with the symbols contained in strings.txt, but I'm unsure. Any help/advice would be great.
To further mention, I'm using Cygwin on Windows (if this is relevant).
Here's an updated example:
strings.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
datastring.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1:53491
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91:03221
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/:20521
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
So technically, all lines should be included in the OUTPUT file, but only this line is outputted:
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
I just don't understand.

You have showed the output of cat -A strings.txt elsewhere, which includes ^M representing a CR (carriage return) character at the end of each line:
This indicates your file has Windows line endings (CR LF) instead of the Unix line endings (only LF) that grep would expect.
You can convert files with dos2unix strings.txt and back with unix2dos strings.txt.
Alternatively, if you don't have dos2unix installed in your Cygwin environment, you can also do that with sed.
sed -i 's/\r$//' strings.txt # dos2unix
sed -i 's/$/\r/' strings.txt # unix2dos

Grep Tab, Carriage Return, & New Line

I'm trying to use Grep to find a string with Tabs, Carriage Returns, & New Lines. Any other method would be helpful also.
grep -R "\x0A\x0D\x09<p><b>Site Info</b></p>\x0A\x0D\x09<blockquote>\x0A\x0D\x09\x09<p>\x0A\x0D\x09</blockquote>\x0A\x0D</blockquote>\x0A\x0D<blockquote>\x0A\x0D\x09<p><b>More Site Info</b></p>" *

From this answer
If using GNU grep, you can use the Perl-style regexp:
$ grep -P '\t' *
Also from here
Use Ctrl+V, Ctrl+M to enter a literal Carriage Return character into your grep string. So:
grep -IUr --color "^M"
will work - if the ^M there is a literal CR that you input as I suggested.
If you want the list of files, you want to add the -l option as well.

Quoting this answer:
Grep is not sufficient for this operation.
pcregrep, which is
found in most of the modern Linux systems can be used ...
Bash Example
$ pcregrep -M "try:\n fro.*\n.*except" file.py
returns
try:
from tifffile import imwrite
except (ModuleNotFoundError, ImportError):

grepping a not matching pattern with a pattern file and data from a pipe

I have an ignore.txt file:
cat ignore.txt
clint
when I do:
pip freeze | grep -v -f ignore.txt
I get:
GitPython==0.3.2.RC1
Markdown==2.2.1
async==0.6.1
clint==0.3.1
gitdb==0.5.4
legit==0.1.1
push-to-wordpress==0.1
python-wordpress-xmlrpc==2.2
smmap==0.8.2
but when I do:
pip freeze | grep -v clint
I do get the correct output:
GitPython==0.3.2.RC1
Markdown==2.2.1
async==0.6.1
gitdb==0.5.4
legit==0.1.1
push-to-wordpress==0.1
python-wordpress-xmlrpc==2.2
smmap==0.8.2
How can I achieve that with grep and command line tools?
Clarfication Edit: I use windows with cygwin so I believe this is GNU grep 2.6.3 (from grep --version)

Your syntax looks correct and works on my system.
There may be a problem with your ignore.txt file.
In particular, check that:
there are no leading or trailing spaces, tabs and the like around the word you are trying to filter (as suggested by Kent above)
the file has Unix line endings
the file is terminated by a single newline
About the latter, the Single Unix Specification says:
Patterns in pattern_file shall be terminated by a <newline>.
Which means that a file with no terminator, or with a different terminator (e.g. CR LF), might behave unexpectedly (though that might be system-dependent).

grep question using backslash

I have the following file:
asdasd
asd
asd
incompatible:svcnotallowed:svc\:/network/bmb/clerver\:default
incompatible:svcnotallowed:svc\:/network1/bmb/clerver\:default
incompatible:svcnotallowed:svc\:/network2/bmb/clerver\:default
asdasd
asd
asd
as
And now suppose I have the two variables v1="incompatible:svcnotallowed:" and v2="svc\:/network1/bmb/clerver\:default".
I would like to search the entire file using v1 and v2. I know this is a problem caused due to the file having a'\' in it. I just dont know how to eliminate it. I have tried storing v1 and v2 (both variable contents and grep usage) using single quotes, but in vain.
This is the series of commands I have tried :
grep "$v1$v2" file
grep '$v1$v2' file
I need this to work in KSH
please let me know the right way to use grep in this scenario.
Thanks.

grep -F "$v1$v2" file should do the trick -- with the -F option, it treats the pattern as a fixed string, so backslashes don't get interpreted as escapes or backreferences.
But fgrep "$v1$v2" file would probably be the most portable solution. As tomkaith13 notes in his comment, the -F option to grep isn't universally supported. On Solaris, the default grep doesn't support -F, but the version in /usr/xpg4/bin does.

Since you are using ksh, you can just use it to read the files
v1="incompatible:svcnotallowed:"
v2="svc\:/network1/bmb/clerver\:default"
while read -r line
do
case "$line" in
"$v1$v2" ) echo "$line";;
esac
done < file

Can grep show only words that match search pattern?

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?

Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer

It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.

You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th

Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly

grep command for only matching and perl
grep -o -P 'th.*? ' filename

I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010

cat *-text-file | grep -Eio "th[a-z]+"

You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple

I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename

To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq

You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'

grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal

$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.

ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to read files with different encodings using Awk? - character-encoding

How can I correctly read files in encodings other than UTF8 in Awk? I have a file in Hebrew/Windows-1255 encoding. A simple {print $0} awk prints stuff like �. how can I make it read correctly?

Related

Grepping list of phpass hashes against a file

Grep Tab, Carriage Return, & New Line

grepping a not matching pattern with a pattern file and data from a pipe

grep question using backslash

Can grep show only words that match search pattern?

Categories

Resources