Unix md5sum vs Powershell Get-hash - md5sum

I am trying to generate md5 hash from Powershell. I installed Powershell Community Extension (Pscx) to get command : Get-Hash
However when I generate md5 hash using Get-Hash, it doesn't seem to match the hash generated using md5sum on an Ubuntu machine.
Powershell:
PS U:\> "hello world" | get-hash -Algorithm MD5
Path Algorithm HashString Hash
---- --------- ---------- ----
MD5 E42B054623B3799CB71F0883900F2764 {228, 43, 5, 70...}
Ubuntu:
root#LT-A03433:~# echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
I know that the one generated by Ubuntu is correct as a couple of online sites show the same result.
What am I going wrong with Powershell Get-Hash?

The difference is not obvious, but you are not hashing the same data. MD5 is a hashing algorithm, and it has no notion of text encoding – this is why you can create a hash of binary data just as easily as a hash of text. With that in mind, we can find out what bytes (or octets; strictly a stream of values of 8 bits each) MD5 is calculating the hash of. For this, we can use xxd, or any other hexeditor.
First, your Ubuntu example:
$ echo "hello world" | xxd
0000000: 6865 6c6c 6f20 776f 726c 640a hello world.
Note the 0a, Unix-style newline at the end, displayed as . in the right view. echo by default appends a newline to what it prints, you could use printf, but this would lead to a different hash.
$ echo "hello world" | md5
6f5902ac237024bdd0c176cb93063dc4
Now let's consider what PowerShell is doing. It is passing a string of its own directly to the get-hash cmdlet. As it turns out, the natural representation of string data in a lot of Windows is not the same as for Unix – Windows uses wide strings, where each character is represented (in memory) as two bytes. More specifically, we can open a text editor, paste in:
hello world
With no trailing newline, and save it as UTF-16, little-endian. If we examine the actual bytes this produces, we see the difference:
$ xxd < test.txt
0000000: 6800 6500 6c00 6c00 6f00 2000 7700 6f00 h.e.l.l.o. .w.o.
0000010: 7200 6c00 6400 r.l.d.
Each character now takes two bytes, with the second byte being 00 – this is normal (and is the reason why UTF-8 is used across the Internet instead of UTF-16, for example), since the Unicode codepoints for basic ASCII characters are the same as their ASCII representation. Now let's see the hash:
$ md5 < thefile.txt
e42b054623b3799cb71f0883900f2764
Which matches what PS is producing for you.
So, to answer your question – you're not doing anything wrong. You just need to encode your string the same way to get the same hash. Unfortunately I don't have access to PS, but this should be a step in the right direction: UTF8Encoding class.

This question is surely related to How to get an MD5 checksum in PowerShell, but it’s different and makes an important point.
Md5sums are computed from bytes. In fact, your Ubuntu result is, in a sense, wrong:
$ echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
$ echo -n "hello world" | md5sum
5eb63bbbe01eeed093cb22bb8f5acdc3 -
In the first case you sum the 12 bytes which make up the ASCII representation of your string, plus a final carriage return. In the second case, you don’t include the carriage return.
(As an aside, it is interesting to note that a here string includes a carriage return:)
$ md5sum <<<"hello world"
6f5902ac237024bdd0c176cb93063dc4
In Windows powershell, your string is represented in UTF-16LE, 2 bytes per character. To get the same result in Ubuntu and in Windows, you have to use a recoding program. A good choice for Ubuntu is iconv:
$ echo -n "hello world" | iconv -f UTF-8 -t UTF-16LE | md5sum
e42b054623b3799cb71f0883900f2764 -

md5sum is wrong-ish, in spite of other people agreeing with it. It is adding a platform-specific end-of-line characters to the input string, on unix an lf, on windows a cr-lf.
Verify this on a machine with powershell and bash and e.g. postgres installed for comparison:
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
echo 'A string with no CR or LF at the end' | md5sum.exe
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
Output first two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
md5
----------------------------------
1b16276b75aba6ebb88512b957d2a198
PS> echo 'A string with no CR or LF at the end' | md5sum.exe
1b16276b75aba6ebb88512b957d2a198 *-
Output second two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
md5
----------------------------------
68a1fcb16b4cc10bce98c5f48df427d4
PS> bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
68a1fcb16b4cc10bce98c5f48df427d4 *-

Related

how to avoid lookbehind assertion is not fixed length

I have a file that contains a version number that I need to output. This version number is apart of a string in this file, that looks something like this:
https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
I need to get the version number, which is 1.2.345.
This grep command works: grep -Po '(?<=/name-of-file_CXP123456-/)\d.\d.\d\d\d'. However, the CXP number changes and as such I thought I could do something like this: grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' but that gives the following:
grep: lookbehind assertion is not fixed length
Is there anything I can add to the grep statement to avoid this?
Ultimately, this is part of a stage in Jenkins to get this version number. The sh command looks something like this:
VERSION = sh 'ssh -tt user#ip-address "cat dir/file*.content | grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' 1>&2"'
You can use
grep -Po '/name-of-file_.*-\K\d+(?:\.\d+)+'
See the regex demo. Details:
/name-of-file_ - a literal text
.* - any zero or more chars other than line break chars as many as possible
- - a hyphen
\K - a match reset operator that omits all text matched so far from the memory buffer
\d+ - one or more digits
(?:\.\d+)+ - one or more sequences of a . and one or more digits.
You don't need lookbehind for this job. You also don't need PCREs, or grep at all.
#!/usr/bin/env bash
# ^^^^- bash, *not* sh
case $BASH_VERSION in '') echo "ERROR: bash required" >&2; exit 1;; esac
string="https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
regex='.*/name-of-file_CXP[[:digit:]]+-([[:digit:].]+)[.]jar'
if [[ $string =~ $regex ]]; then
echo "Version is ${BASH_REMATCH[1]}"
else
echo "No version found in $string"
fi
Maybe too long for a comment... It looks like the version number is the 2nd-to last field if you split on forward slash?
rev | cut -d/ -f 2 | rev
awk -F/ '{print $(NF-1)}'
perl -lanF/ -e 'print $F[-2]'
Or even something like: basename $(dirname $(cat filename))
For those that are really desperate there is another solution which requires you to pre-build your regex string.
It's not a solution I would recommend but if there is really no other way no one can stop you.
While even with this you won't have true dynamic look-behinds and it is still quite limited it is an option available to you.
The idea is to build the look-behind for each possible length you need it to be.
So for example only match if it's not preceded by a # (0 to a 100 characters look-behind).
reg='';
for ((i = 0 ; i <= 100 ; i++)); do reg+='(?<!#.{'"${i}"'})'; done;
reg+='someVariableName=.*?($|;|\\n)';
grep --perl-regexp "$reg" /usr/local/mgmsbox/msc/scripts/msc.cfg
This might not be the best example but it gets the idea across.
This solution has it's own pitfalls. For example you need to double escape \\ escape-sequences like \n and any character that should not be interpreted should be put in a single-quote string (or use printf).

Regex for line containing one or more spaces or dashes

I got .txt file with city names, each in separate line. Some of them are few words with one or multiple spaces or words connected with '-'. I need to create bash command which will echo those lines out. Currently I'm using cat piped with grep but I can't get both spaces and dash into one search and I had problems with checking for multiple spaces.
print lines with dash:
cat file.txt | grep ".*-.*"
print lines with spaces:
cat file.txt | grep ".*\s.*"
tho when I try to do:
cat file.txt | grep ".*\s+.*"
I get nothing.
Thanks for help
Something like that should work:
grep -E -- ' |\-' file.txt
Explanation:
-E: to interpret patterns as extended regular expressions
--: to signify the end of command options
' |\-': the line contains either a space or a dash
This does not directly address your question, but is too much to put in a comment.
You don't need the .* in your patterns. .* at the beginning or end of a pattern is useless, because it means "0 or more of any character" and so will always match.
These lines are all identical:
cat file.txt | grep ".*-.*"
cat file.txt | grep "-.*"
cat file.txt | grep "-"
Plus you don't need to cat and pipe:
grep "-" file.txt
When grep pattern matches, the default action is to print the whole line, so .* in all your patterns are redundant, you may delete them. Also, you don't have to use cat file | as you may specify the file to grep directly after pattern, i.e. grep 'pattern' file.txt.
Here are some more details:
grep ".*-.*" = grep -- "-" - returns any lines having a - char (-- singals the end of options, the next thing is the pattern)
grep ".*\s.*" = grep "\s" - matches and returns lines containing a whitespace char (only GNU grep)
grep ".*\s+.*" = grep "\s+" - returns line containing a whitespace followed with a literal + char (since you are using POSIX BRE regex here the unescaped + matches a literal plus symbol).
You want
grep "[[:space:]-]" file.txt
See the online demo:
#!/bin/bash
s='abc - def
ghi
jkl mno'
grep '[[:space:]-]' <<< "$s"
Output:
abc - def
jkl mno
The [[:space:]-] POSIX BRE and ERE (enabled with -E option) compliant pattern matches either any whitespace (with the [:space:] POSIX character class) or a hyphen.
Note that [\s-] won't work since \s inside a bracket expression is not treated as a regex escape sequence but as a mere \ or s.

Escape puzzle: why does grep ignore escaping of single quote?

(other quote/grep questions are about bash interpretation, this is not)
Apparently grep handles escaped single quotes differently than other escaped regex characters, but I don't understand why.
$ grep --version
grep (GNU grep) 2.25
$ cat data
a']
b']
c']
d]
e\']
$ cat patterns
a']
b\'\]
c'\]
d\']
e\']
$ grep -Ef patterns data
a']
c']
Because c is matched but b isn't, apparently grep does not interpret an escaped single quote \'as a single quote. But as what then?
d isn't matched, so it is not ignored.
e isn't matched, so it is not taken literally
TIA for solving this x-mas mystery! PS. Yes in this case I can use -F for literal matching, but my application requires regex.
\' in GNU tools means "end of string". See http://www.regular-expressions.info/gnu.html:
Additional GNU Extensions
....
The anchor \` (backtick) matches at the very start of the subject string,
while \' (single quote) matches at the very end.
Don't ask me why they introduced that as it seems to be exactly the same as $.

Linux Text Extraction GREP

I am running a Teamspeak 3 server on a Ubuntu server and I would like to fetch the clients currently connected using a script.
The script currently outputs this from the Teamspeak Server Query:
clid=1 cid=11 client_database_id=161 client_nickname=Music client_type=1|clid=3 cid=11 client_database_id=153 client_nickname=Music\sBot client_type=0|clid=5 cid=1 client_database_id=68 client_nickname=Unknown\sfrom\s127.0.0.1:52537 client_type=1|clid=12 cid=11 client_database_id=3 client_nickname=FriendlyMan client_type=0|clid=16 cid=11 client_database_id=161 client_nickname=Windows\s10\sUser client_type=0|clid=20 cid=11 client_database_id=225 client_nickname=3C2J0N47H4N client_type=0
How can I extract the nicknames from this mess?
More Specifically only the ones that contain "client_type=0".
Played around with GREP (grep -E -o 'client_nickname=\w+'), close to what I want.
client_nickname=Music
client_nickname=Music
client_nickname=Unknown
client_nickname=FriendlyMan
client_nickname=Windows
client_nickname=3C2J0N47H4N
Desired Output:
Music Bot,FriendlyMan,Windows 10 User,3C2J0N47H4N
Our input consists of a single line:
$ cat file
clid=1 cid=11 client_database_id=161 client_nickname=Music client_type=1|clid=3 cid=11 client_database_id=153 client_nickname=Music\sBot client_type=0|clid=5 cid=1 client_database_id=68 client_nickname=Unknown\sfrom\s127.0.0.1:52537 client_type=1|clid=12 cid=11 client_database_id=3 client_nickname=FriendlyMan client_type=0|clid=16 cid=11 client_database_id=161 client_nickname=Windows\s10\sUser client_type=0|clid=20 cid=11 client_database_id=225 client_nickname=3C2J0N47H4N client_type=0
Using grep + sed
Here is one approach that starts with grep and then uses sed to cleanup to the final format:
$ grep -oP '(?<=client_nickname=)[^=]+(?=client_type=0)' file | sed -nE 's/\\s/ /g; H;1h; ${x; s/ *\n/,/g;p}'
Music Bot,FriendlyMan,Windows 10 User,3C2J0N47H4N
Using awk
Here is another approach that just uses awk:
$ awk -F'[= ]' '/client_type=0/{gsub(/\\s/, " ", $8); printf (f?",":"")$8; f=1} END{print ""}' RS='|' file
Music Bot,FriendlyMan,Windows 10 User,3C2J0N47H4N
The awk code uses | as the record separator and awk reads in one record at a time. Each record is divided into fields with the field separator being either a space or an equal sign. If the record contains the text client_type=0, then we replace all occurrences of \s in field 8 with space and then print the resulting field 8.
Using bash
#!/bin/bash
sep=
( cat file; echo "|"; ) | while read -r -d\| clid cid db name type misc
do
[ "$type" = "client_type=0" ] || continue
name=${name//\\s/ }
printf "%s%s" "$sep" "${name#client_nickname=}"
sep=,
done
echo ""
This produces the output:
Music Bot,FriendlyMan,Windows 10 User,3C2J0N47H4N

Grep Tab, Carriage Return, & New Line

I'm trying to use Grep to find a string with Tabs, Carriage Returns, & New Lines. Any other method would be helpful also.
grep -R "\x0A\x0D\x09<p><b>Site Info</b></p>\x0A\x0D\x09<blockquote>\x0A\x0D\x09\x09<p>\x0A\x0D\x09</blockquote>\x0A\x0D</blockquote>\x0A\x0D<blockquote>\x0A\x0D\x09<p><b>More Site Info</b></p>" *
From this answer
If using GNU grep, you can use the Perl-style regexp:
$ grep -P '\t' *
Also from here
Use Ctrl+V, Ctrl+M to enter a literal Carriage Return character into your grep string. So:
grep -IUr --color "^M"
will work - if the ^M there is a literal CR that you input as I suggested.
If you want the list of files, you want to add the -l option as well.
Quoting this answer:
Grep is not sufficient for this operation.
pcregrep, which is
found in most of the modern Linux systems can be used ...
Bash Example
$ pcregrep -M "try:\n fro.*\n.*except" file.py
returns
try:
from tifffile import imwrite
except (ModuleNotFoundError, ImportError):

Resources