I am trying to generate md5 hash from Powershell. I installed Powershell Community Extension (Pscx) to get command : Get-Hash
However when I generate md5 hash using Get-Hash, it doesn't seem to match the hash generated using md5sum on an Ubuntu machine.
Powershell:
PS U:\> "hello world" | get-hash -Algorithm MD5
Path Algorithm HashString Hash
---- --------- ---------- ----
MD5 E42B054623B3799CB71F0883900F2764 {228, 43, 5, 70...}
Ubuntu:
root#LT-A03433:~# echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
I know that the one generated by Ubuntu is correct as a couple of online sites show the same result.
What am I going wrong with Powershell Get-Hash?
The difference is not obvious, but you are not hashing the same data. MD5 is a hashing algorithm, and it has no notion of text encoding – this is why you can create a hash of binary data just as easily as a hash of text. With that in mind, we can find out what bytes (or octets; strictly a stream of values of 8 bits each) MD5 is calculating the hash of. For this, we can use xxd, or any other hexeditor.
First, your Ubuntu example:
$ echo "hello world" | xxd
0000000: 6865 6c6c 6f20 776f 726c 640a hello world.
Note the 0a, Unix-style newline at the end, displayed as . in the right view. echo by default appends a newline to what it prints, you could use printf, but this would lead to a different hash.
$ echo "hello world" | md5
6f5902ac237024bdd0c176cb93063dc4
Now let's consider what PowerShell is doing. It is passing a string of its own directly to the get-hash cmdlet. As it turns out, the natural representation of string data in a lot of Windows is not the same as for Unix – Windows uses wide strings, where each character is represented (in memory) as two bytes. More specifically, we can open a text editor, paste in:
hello world
With no trailing newline, and save it as UTF-16, little-endian. If we examine the actual bytes this produces, we see the difference:
$ xxd < test.txt
0000000: 6800 6500 6c00 6c00 6f00 2000 7700 6f00 h.e.l.l.o. .w.o.
0000010: 7200 6c00 6400 r.l.d.
Each character now takes two bytes, with the second byte being 00 – this is normal (and is the reason why UTF-8 is used across the Internet instead of UTF-16, for example), since the Unicode codepoints for basic ASCII characters are the same as their ASCII representation. Now let's see the hash:
$ md5 < thefile.txt
e42b054623b3799cb71f0883900f2764
Which matches what PS is producing for you.
So, to answer your question – you're not doing anything wrong. You just need to encode your string the same way to get the same hash. Unfortunately I don't have access to PS, but this should be a step in the right direction: UTF8Encoding class.
This question is surely related to How to get an MD5 checksum in PowerShell, but it’s different and makes an important point.
Md5sums are computed from bytes. In fact, your Ubuntu result is, in a sense, wrong:
$ echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
$ echo -n "hello world" | md5sum
5eb63bbbe01eeed093cb22bb8f5acdc3 -
In the first case you sum the 12 bytes which make up the ASCII representation of your string, plus a final carriage return. In the second case, you don’t include the carriage return.
(As an aside, it is interesting to note that a here string includes a carriage return:)
$ md5sum <<<"hello world"
6f5902ac237024bdd0c176cb93063dc4
In Windows powershell, your string is represented in UTF-16LE, 2 bytes per character. To get the same result in Ubuntu and in Windows, you have to use a recoding program. A good choice for Ubuntu is iconv:
$ echo -n "hello world" | iconv -f UTF-8 -t UTF-16LE | md5sum
e42b054623b3799cb71f0883900f2764 -
md5sum is wrong-ish, in spite of other people agreeing with it. It is adding a platform-specific end-of-line characters to the input string, on unix an lf, on windows a cr-lf.
Verify this on a machine with powershell and bash and e.g. postgres installed for comparison:
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
echo 'A string with no CR or LF at the end' | md5sum.exe
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
Output first two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
md5
----------------------------------
1b16276b75aba6ebb88512b957d2a198
PS> echo 'A string with no CR or LF at the end' | md5sum.exe
1b16276b75aba6ebb88512b957d2a198 *-
Output second two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
md5
----------------------------------
68a1fcb16b4cc10bce98c5f48df427d4
PS> bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
68a1fcb16b4cc10bce98c5f48df427d4 *-
I am trying to extract the version from a colon delimited list. The value I want is for foo, however there is another value in the list called foo-bar causing both values to return. This is what I am doing:
LIST="foo:1.0.0
foo-bar:1.0.1"
VERSION=$(echo "${LIST}" | grep "\bfoo\b" | cut -s -d':' -f2)
echo -e "VERSION: ${VERSION}"
Output:
VERSION: 1.0.0
1.0.1
NOTE: Sometimes LIST will look like the following, which should result in version being empty (this is expected).
LIST="foo
foo-bar:1.0.1"
You may use a PCRE regex enabled with -P option and use a (?!-) negative lookahead that will fail the match in case there is a - after a whole word foo:
grep -P "\bfoo\b(?!-)"
See online demo
This regex should extract any number and optional dots at the end of each line. If the line ends with a colon, then it won't match.
grep -oE '(([[:digit:]]+[.]*)+)$
I came with another simple question...
I got a string with a substring in the format xx:xx:xx where the x's are numbers. I want to extract that substring including the ":" symbol, so my output would be "xx:xx:xx".
I think it can be done with a grep -Eo [0-9], but im not sure of the syntax... Any help?
echo "substring in the format 12:43:37 where the x's are numbers" |
grep -o '[0-9:]*'
Output:
12:43:37
If you have other numbers in the input string you can be more specific:
grep -o '[0-9]*:[0-9]*:[0-9]*'
even:
grep -o '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]'
I need to only grep the md5 hash
this is the hash
MD5 (mt.pm) = adcddd9492c707642d2bcffbfc67b7a6
it needs to look like this
adcddd9492c707642d2bcffbfc67b7a6
or to do the reverse
crapb0c63a3cb776502fe03706b2fd540439 /home/mta.pm"
and only get the hash
now clue how to
any Help
To grep, do the following (this will not work in all grep implementations):
grep -o '[a-z0-9]*$'
or you can use sed:
sed 's/.*= *\([a-z0-9]*\)$/\1/'
Try this (GNU grep):
grep -oP '.* \K.*$'
Or better :
grep -o '[[:xdigit:]]\{32\}$'
Or with bash :
read -a arr <<< 'MD5 (mt.pm) = adcddd9492c707642d2bcffbfc67b7a6'
echo ${arr[-1]}
With \{32\} it's much stronger. md5 is always 32 hexadecimal characters, see http://en.wikipedia.org/wiki/MD5
[[:xdigit:]] is a POSIX class regex, that means to match only hex chars.
FINALLY
If you want to match a 32 hex characters long in a string :
grep -o '[[:xdigit:]]\{32\}'
will do the trick.
How can I correctly read files in encodings other than UTF8 in Awk?
I have a file in Hebrew/Windows-1255 encoding.
A simple {print $0} awk prints stuff like �.
how can I make it read correctly?
awk itself doesn't have any support for handling different encodings. It will honor the locale specified in the environment, but your best bet is to transcode the input to the proper encoding before handing it off to awk.
-f is the format you want to convert from, -t is the target format, and -c skips over any invalid characters which prematurely terminate iconv's operation. Of course --help will give more details.
iconv -c -f cp1255 -t utf8 somefile | awk ...