Problem with iconv - character-encoding

If you are on Mac OS X 10.6, and you are familiar with character encoding AND the terminal please do this:
Open a terminal and type the following commands:
echo sørensen > test.txt
iconv -f UTF8 -t ISO-8859-1 test.txt
You will see the output: "sørensen". Can somebody explain what is going on?

UTF-8 is multibyte encoding. Character ø is encoded by two bytes: C3-B8 . In encoding of your terminal (ISO-8859-1) this bytes are decoded as ø . Then you convert those bytes to ISO-8859-1's code of ø. Any questions?

I tried the "iconv" command from one file to another, looking at the data with "od -txC" with the following results:
Input: c3 83 c2 b8 [ 2 utf8-chars Capital A tilde; Cedilla ]
Command: iconv -f utf-8 -t ISO-8859-1 < in.txt > out.txt
Output: c3 b8 [ 2 ISO-8859-1 characters, Capital A tilde; Cedilla ]
So, the iconv conversion is correct.
But, if you instead treat the converted data as utf-8 (which Terminal is apparently doing), C3-B8 is "ø" (o-slash).
If you change your character encoding in Terminal (Preferences // Advanced // Character Encoding) to "Western (ISO Latin 1)" you'll see C3-B8 as "ø"

Related

Unix md5sum vs Powershell Get-hash

I am trying to generate md5 hash from Powershell. I installed Powershell Community Extension (Pscx) to get command : Get-Hash
However when I generate md5 hash using Get-Hash, it doesn't seem to match the hash generated using md5sum on an Ubuntu machine.
Powershell:
PS U:\> "hello world" | get-hash -Algorithm MD5
Path Algorithm HashString Hash
---- --------- ---------- ----
MD5 E42B054623B3799CB71F0883900F2764 {228, 43, 5, 70...}
Ubuntu:
root#LT-A03433:~# echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
I know that the one generated by Ubuntu is correct as a couple of online sites show the same result.
What am I going wrong with Powershell Get-Hash?
The difference is not obvious, but you are not hashing the same data. MD5 is a hashing algorithm, and it has no notion of text encoding – this is why you can create a hash of binary data just as easily as a hash of text. With that in mind, we can find out what bytes (or octets; strictly a stream of values of 8 bits each) MD5 is calculating the hash of. For this, we can use xxd, or any other hexeditor.
First, your Ubuntu example:
$ echo "hello world" | xxd
0000000: 6865 6c6c 6f20 776f 726c 640a hello world.
Note the 0a, Unix-style newline at the end, displayed as . in the right view. echo by default appends a newline to what it prints, you could use printf, but this would lead to a different hash.
$ echo "hello world" | md5
6f5902ac237024bdd0c176cb93063dc4
Now let's consider what PowerShell is doing. It is passing a string of its own directly to the get-hash cmdlet. As it turns out, the natural representation of string data in a lot of Windows is not the same as for Unix – Windows uses wide strings, where each character is represented (in memory) as two bytes. More specifically, we can open a text editor, paste in:
hello world
With no trailing newline, and save it as UTF-16, little-endian. If we examine the actual bytes this produces, we see the difference:
$ xxd < test.txt
0000000: 6800 6500 6c00 6c00 6f00 2000 7700 6f00 h.e.l.l.o. .w.o.
0000010: 7200 6c00 6400 r.l.d.
Each character now takes two bytes, with the second byte being 00 – this is normal (and is the reason why UTF-8 is used across the Internet instead of UTF-16, for example), since the Unicode codepoints for basic ASCII characters are the same as their ASCII representation. Now let's see the hash:
$ md5 < thefile.txt
e42b054623b3799cb71f0883900f2764
Which matches what PS is producing for you.
So, to answer your question – you're not doing anything wrong. You just need to encode your string the same way to get the same hash. Unfortunately I don't have access to PS, but this should be a step in the right direction: UTF8Encoding class.
This question is surely related to How to get an MD5 checksum in PowerShell, but it’s different and makes an important point.
Md5sums are computed from bytes. In fact, your Ubuntu result is, in a sense, wrong:
$ echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
$ echo -n "hello world" | md5sum
5eb63bbbe01eeed093cb22bb8f5acdc3 -
In the first case you sum the 12 bytes which make up the ASCII representation of your string, plus a final carriage return. In the second case, you don’t include the carriage return.
(As an aside, it is interesting to note that a here string includes a carriage return:)
$ md5sum <<<"hello world"
6f5902ac237024bdd0c176cb93063dc4
In Windows powershell, your string is represented in UTF-16LE, 2 bytes per character. To get the same result in Ubuntu and in Windows, you have to use a recoding program. A good choice for Ubuntu is iconv:
$ echo -n "hello world" | iconv -f UTF-8 -t UTF-16LE | md5sum
e42b054623b3799cb71f0883900f2764 -
md5sum is wrong-ish, in spite of other people agreeing with it. It is adding a platform-specific end-of-line characters to the input string, on unix an lf, on windows a cr-lf.
Verify this on a machine with powershell and bash and e.g. postgres installed for comparison:
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
echo 'A string with no CR or LF at the end' | md5sum.exe
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
Output first two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
md5
----------------------------------
1b16276b75aba6ebb88512b957d2a198
PS> echo 'A string with no CR or LF at the end' | md5sum.exe
1b16276b75aba6ebb88512b957d2a198 *-
Output second two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
md5
----------------------------------
68a1fcb16b4cc10bce98c5f48df427d4
PS> bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
68a1fcb16b4cc10bce98c5f48df427d4 *-

Why doesn't grep give the matching line?

I've just noticed that
grep -rni 'a2}' *
does not give all documents that have a string a2} the matching line. Why is this the case?
I've tried to create a minimal example, but when I create a new file and paste the content, it fails. So I've uploaded the file to a Git repository. Perhaps it's a encoding problem.
The content of the file is:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{KV-Diagramme}
\label{chap:a2}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\PsTexAbbildungOhneCaption{figures/a2-1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "skript"
%%% End:
The result of grep -rni 'a2}' * is
moose#pc08 ~/Downloads/algorithms/grep $ grep -rni "a2}" *
%%% End:master: "skript"%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
but I expected
moose#pc08 ~/Downloads/algorithms/grep $ grep -rni "a2}" *
\label{chap:a2}
Why do I get this result?
The file has CR line terminators so it looks a a single-line file:
#> file anhang-2.tex
anhang-2.tex: LaTeX document, ASCII text, with CR line terminators
convert it to Linux format:
#> mac2unix anhang-2.tex
mac2unix: converting file anhang-2.tex to Unix format ...
#> grep -rni 'a2}' anhang-2.tex
3:\label{chap:a2}
It's because your file is using Mac OS 9 line endings. You will need to first translate to UNIX line endings. How you do so depends on your scenario but you can do one file with this:
tr '\r' '\n' < anhang-2.tex > anhang-2.txt
Then you will be able to grep that new file.

using grep to find filenames with accented characters or ñ in the search term

I'm trying unsuccessfully to use (e)grep to list files in a directory when my search term includes Spanish characters such as accented vowels or "ñ" (e.g. bebé, caña). In the case of accented vowels, if I use the unaccented vowel, the search works as in the following:
$ ls | egrep 'bebe'
8 bebé
IMQ_bebé1.wav
IMQ_bebé2.wav
IMQ_bebé3.wav
IMQ_bebé4.wav
IMQ_bebé5.wav
IMQ_bebé6.wav
However, if I include the accented vowel in the search term, (e.g. $ ls | egrep 'bebé')
I get nothing.
The reason why using non-accented vowels in the search term is not possible is because this is part of a larger shell script that gets it's search terms from a text file. Also, search terms with "ñ" don't work, nor do they work with just "n". Plus, I'm sure there must be a way to do this!
I'm working on Mac OS X 10.6.8. My locale is as follows:
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
You can use iconv to unaccent your list and match ascii patterns:
ls | iconv -f utf8 -t ascii//TRANSLIT | egrep 'bebe'

character 0xc286 of encoding "UTF-8" has no equivalent in "WIN1252"....On conversion with iconv postgres restore crashes

I'm working on a software that uses delphi and postgres 9.0,
the original developer had chosen the database encoding as 'SQL_ASCII'...
so we changed the encoding to UTF-8 for our database..
we we started getting this error after
Onclick of the One of the check boxes
(the form is populated from the database)
the query where the error comes is
'select * from diary where survey in ('2005407')';
but this error only comes for few of the check boxes and not ALL
The change is straight forward but we have large amount of historical data that we will have to re-store into the newly created UTF-8 database..so i followed the steps i found on the net and stackoverflow also
Dump the database as e- UTF-8 SQL_Ascii_backup.backup
Use iconv to convert SQL_ASCII to UTF-8
"C:\Program Files\GnuWin32\bin\iconv.exe" -f ISO8859-1 -t UTF-8 C:\SQL_Ascii_backup.backup>UTF_Backup.backup
3.Create a new Database with encoding as UTF-8 and re-store the backup UTF_Backup.backup
But when i try to restore it i get t his error
then i tried with dumping the original SQL_ASCII database as plain SQL_Ascii_.sql file
and then again i used iconv to change the encoding..and then restoring
>"C:\Program Files\PostgreSQL\9.0\bin\psql.exe"-h localhost -p 5434 -d myDB -U myDB_admin -f C:\converted_utf8.sql
this is restoring properly but im still geting the error.
'character 0xc286 of encoding "UTF-8" has no equivalent in "WIN1252";
C2 86 is the UTF-8 encoding of the character U+0086, an obscure C1 control character. This character exists in ISO-8859-1, but not in Windows' default code page 1252, which has printable characters in the space where ISO-8859-1 has the C1 controls.
Your iconv command to convert to UTF-8 has -f ISO8859-1, but your probably meant -f windows-1252 instead. This maps the byte 86 to the † character.
I got rid of the error
'character 0xc286 of encoding "UTF-8" has no equivalent in "WIN1252";
by following dan04 answer, but to prevent the iconv failing to convert the dumped
Dump the database UTF-8 (do a plain dump..so you may be able to find the point of failure)
Use iconv to convert SQL_ASCII to UTF-8 using
"C:\Program Files\GnuWin32\bin\iconv.exe" -f windows-1252 -t UTF-8 C:\MqPlainDump.sql>convertedDump.sql
Replace the '[]' character (this is in my case which was causing the trouble..its a square character)
Restore the database
And the application is good to go (in my case)

How to read files with different encodings using Awk?

How can I correctly read files in encodings other than UTF8 in Awk?
I have a file in Hebrew/Windows-1255 encoding.
A simple {print $0} awk prints stuff like �.
how can I make it read correctly?
awk itself doesn't have any support for handling different encodings. It will honor the locale specified in the environment, but your best bet is to transcode the input to the proper encoding before handing it off to awk.
-f is the format you want to convert from, -t is the target format, and -c skips over any invalid characters which prematurely terminate iconv's operation. Of course --help will give more details.
iconv -c -f cp1255 -t utf8 somefile | awk ...

Resources