German Character Encoding - character-encoding

I have a big CSV file containing contacts, all the non-latin characters are displayed like that:
Zürich (Zürich)
Grône (Grône)
Chesières (Chesières)
Genève (Genève)
I tried to replace them with their right characters, like:
str_replace('ü', 'ü', $string);
They don't change, I tried to insert them in a MySQL database and then replace them, they still be the same.
What should I do?

Please check the encoding of the file.
Once you know it, you can read it in the proper way.
After that, you can convert the encoding, e.g., to UTF-8.

Picking this apart, let's look at the crux of the problem.
ü in UTF-8: 195, 188
ü in Windows-1252: 252
ü in UTF-8 misinterpreted as Windows-1252: ü (195, 188)
The key thing here is that when seeing UTF-8 (multibyte) to Windows-1252 (single byte) encoding errors a single UTF-8 character often ends up as two nonsense characters. Seeing four here suggests a double mangling:
ü in UTF-8 misinterpreted as Windows-1252: ü
ü in UTF-8 misinterpreted as Windows-1252: ü
So there it is. Somehow this was run through two layers of mangling, but to undo it you can force-encode Windows-1252 to UTF-8, then pretend it's Windows-1252 and do it again to UTF-8.

Working from what #tadman described, and from the 132 encodings known to my system, there are several combinations that could have resulted in this mojibake.
65001 utf-8 | 1252 iso-8859-1 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 1252 iso-8859-1 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 1254 iso-8859-9 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 1254 iso-8859-9 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 28591 iso-8859-1 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 28591 iso-8859-1 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 28599 iso-8859-9 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 28599 iso-8859-9 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 65000 utf-7 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 65000 utf-7 | 65001 utf-8 | 1254 iso-8859-9
So, once you are confident of the exact encodings involved and you check that they are reversible, you can reverse the mojibake like this:
var latin1 = Encoding.GetEncoding(1252, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
var utf8 = Encoding.GetEncoding(65001, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
utf8.GetString(latin1.GetBytes(utf8.GetString(latin1.GetBytes("Zürich")))).Dump();
C# (LINQPad)
Func<Encoding, String> format = (encoding) => $"{encoding.CodePage} {encoding.BodyName}";
var encodings = Encoding.GetEncodings().Select(e => e.GetEncoding()).ToList();
(
from encoding1 in encodings
from encoding2 in encodings
from encoding3 in encodings
from encoding4 in encodings
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ü")))) == "ü"
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ô")))) == "ô"
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("è")))) == "è"
select new { encoding1 = format(encoding1), encoding2 = format(encoding2), encoding3 = format(encoding3), encoding4 = format(encoding4) }
).Dump();

Related

Are unused bytes set to zero in response to SDO reads?

I send and SDO request to read a 1 byte value like this:
|11 bit COD-ID | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
| 0x0601 | 0x40 | index | subindex | 0x00 | 0x00 | 0x00 | 0x00 |
and the device responds with:
|11 bit COD-ID | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 | byte 6 | byte 7 |
| 0x581 | 0x4F | index | subindex | 0xFF | 0x00 | 0x00 | 0x00 |
0x4F means that the returned value is only 8 bit wide, only byte 4 is set. What about byte 5, 6, and 7. Are they guaranteed to be zero by the standard?
Yes, CAN frames involved in SDO requests always have an 8-byte payload. Unused bytes are set to 0 and should be ignored by the recipient.
This is guaranteed by CiA 301 section 7.2.4.3, which describes the SDO protocol.

How to differentiate between H.264 bitstream and HEVC bitstream?

I have two parsers to parse h.264 and HEVC bit stream.When I get a bit stream how can I differentiate between the bitstream so that I can use the correct parser.
Thanks for the help
For H.264 you are looking for:
(0x00) 0x00 0x00 0x01 [Access Unit Delimiter]
Where Access Unit Delimiter must be: (byte & 0x1f) == 0x09
For H.265 you are looking for
(0x00) 0x00 0x00 0x01 [Access Unit Delimiter | VPS | SPS]
Where Access Unit Delimiter must be: (byte >> 1 & 0x3f) == 0x23 or
VPS must: (byte >> 1 & 0x3f) == 0x20 or
SPS must: (byte >> 1 & 0x3f) == 0x21

What's so special about this PNG file?

This PNG file can not be uploaded from my app to a 3d-party server. It always reports this error:
does multipart has image?
I'm sure multipart encoding is correct. Tens of thousands of images are uploaded from my app without this issue. It it the first time.
I guess there is something special about this PNG file and I proved it:
Dropbox iOS app can not display the image.
Tweetbot can not upload it. The error message is "media type unrecognized".
So this PNG file is indeed special and quite some apps and servers don't handle it properly. But I don't know what's so special about it and hope someone who know PNG better than me can help. Thanks.
It is a CgBI file, not a PNG, most likely made with Apple's rogue modified pngcrush.
Such files always contain "CgBI" in bytes 12-15, where "IHDR" belongs.
CgBI files can be converted to valid PNG files (except that the transparent areas are irreparably damaged) by several applications, including
Jongware's pngdefry
Apple's "pngcrush" (but not the real pngcrush)
others listed on the above-referenced CgBI wiki page
Here are the first few bytes in your file:
$ od -c test.png | head -4
0000000 211 P N G \r \n 032 \n \0 \0 \0 004 C g B I
0000020 P \0 002 + 325 263 177 \0 \0 \0 \r I H D R
0000040 \0 \0 \0 ` \0 \0 \0 ` \b 006 \0 \0 \0 342 230 w
0000060 8 \0 \0 \0 c H R M \0 \0 z % \0 \0 200
Those bytes represent the following:
PNG signature 0-7
CgBI length 8-11
"CgBI" 12-15
CgBI data 16-19
CgBI CRC 20-23
IHDR length 24-27 (should be in 8-11)
"IHDR" 28-31 (should be in 12-15)
width 32-35 (should be in 16-19)
height 36-39 (should be in 20-23)
bit depth 40 (should be in 24)
color type 41 (should be in 25)
compression 42 (should be in 26)
filter method 43 (should be in 27)
interlace method 44 (should be in 28)
IHDR CRC 45-48 (should be in 29-32)
...

Format Table in JIRA

I am trying to pass a string into JIRA via an API call and have the string formatted like I have below. String ->
"This is a message with a table. \\\ ||A||B||C|| \\\ |1|2|3| \\\ |4|5|6|"
Expected Output:
This is a message with a table
| A | B | C |
| 1 | 2 | 3 |
| 4 | 5 | 6 |
Pretty much what is in the URL below but my line breakers in the message aren't working. Any help is appreciated.
https://jira.atlassian.com/secure/WikiRendererHelpAction.jspa?section=tables
Try \n or even \r\n in your submitted string instead of the \\. I've used \\ when I want to start a new line in JIRA's output, but I think you need the line break on the input here.

How to Insert " - " (space before and after a dash) in cassandra using cql?

HI I need to insert a test like this in cassandra using cql:
INSERT INTO "MediaCategory" ("MCategoryID", "SubMCategoryName", "PhotoRankID", "VirtualTourID", "LangID") VALUES (6,'Mur d'escalade - Intérieur',41004,141004, 1036);
but after insert the data in cassandra shows up like:-
LangID | PhotoRankID | MCategoryID | SubMCategoryName | VirtualTourID
--------+-------------+-------------+-------------------------------+---------------
1036 | 41004 | 6 | Mur d'escalade\xa0- Intérieur | 141004
The \Xao is getting into the data because of the space.Any idea how to escape it?
Your issue is that the space is not a space, it is a unicode non breaking space: '\xa0'
I would guess you are copying and pasting the text from somewhere, and it is giving you a char(160) space instead of char(32) space.

Resources