I typed in
file -I*
to look at all the encoding of all the CSV files in an entire directory. A lot of the file encodings are charset=binary. I'm not too familiar with this encoding format.
Does anyone know how to handle this encoding?
Thanks a lot for your time.
"Binary" encoding pretty much means that the encoding is unknown.
Everything is binary data under the hood. In text files each byte, or sequence of bytes, represents a specific character, and which character in particular depends on the encoding the file was encoded with/you're interpreting the file with. Some encodings are unambiguously recognisable, others aren't (e.g. any file is valid in any single-byte encoding, you can't easily distinguish one single-byte encoding from another). What file is telling you with charset=binary is that it doesn't have any more specific information than that the file contains bits and bytes (Capt'n Obvious to the rescue). It's up to you to interpret the file in the correct encoding/interpret it as the correct file format.
Related
While Compressing a file Using Huffmann coding,
After assigning Huffmann codes to each character in a file, these characters should be replaced with equivalent Huffmann codes in the compressed file. Then how the equivalent characters gets extracted with those Huffman codes from the compressed files while decompressing the file. Do the compressed file contains some extra information to decode the Huffmann codes?
Yes. You need to send a description of the Huffman code in order to decode them.
The usual implementation is to encode using a canonical Huffman code, and then sending just the lengths for each symbol. The description of the code can itself be compressed.
In my iOS app, I have a feature that writes data to a CSV file. This works fine in most cases with the following:
[csvString writeToFile: filePath atomically:YES encoding: NSUTF8StringEncoding error:&error];
I recently got an email from a Japanese user that the CSV file exported has weird symbols instead of Japanese characters. So I switched to using NSUTF16StringEncoding and it seems to work fine for Japanese characters as well.
So the question is: is it better to use NSUTF16StringEncoding, or are there any drawbacks to doing this? It seems that other examples I've seen for writing to CSV files (including CHCSVParser) use
NSUTF8StringEncoding, so I'm not sure which one to prefer.
Thanks.
There's no a "better" encoding.
UTF-8 uses a variable number of bytes per each character, from 1 to 4. UTF-16 uses always 2 bytes for every character. What is best, is really up to you and your business. In theory, if your users are mostly based in Asia and use primarily non-ASCII character, files encoded in UTF-16 are smaller. If your users are primarily living in the Western world and use Latin-based alphabets, using UTF-8 makes every file 50% smaller.
I believe your problem is not with the choice of the encoding, but rather with the presentation. Text editors cannot guess the encoding of a file, so it's possible that your Japanese user was using a text editor that defaults to UTF-16, and thus was unable to represent UTF-8 character sequences correctly.
The solution to this problem is to using the BOM sequence, as per this SO answer: https://stackoverflow.com/a/2585194/192024 (in short: just add those 3 bytes at the beginning of the file to tell editors what encoding to use)
I am trying to read a binary file in which i have been appending data using a BinaryWriter object. I keep getting this error:
"The output char buffer is too small to contain the decoded
characters, encoding 'Unicode (UTF-8)' fallback
'System.Text.DecoderReplacementFallback'."
My file has characters like | which i suspect are the problem but I don't know how to solve it.
The most probable reason is that your file contains some binary data, that does not represent valid UTF-8 codepoint, at the place from which you are trying to read UTF-8 character.
This can happen if your read algorithm lose "synchronization" with your write algorithm and tries to read character from the wrong place, where something else (not a character) was written.
I have a RAR file with non ASCII letters in filenames. I tried decoding it in Delphi. My code works fine for ASCII filenames but it failed on these. It is not WideChar, nor UTF8. I found RAR specs here:
http://ams.cern.ch/AMS/amsexch/arch/rar/technote.txt
but it says nothing about the character encoding.
I tried WOTSIT.org but all links to RARs are dead (almost every link is dead there; I even contacted admin but he didn't respond and didn't fix links).
It seems it is not an 8bit encoding, but no idea what could it be.
This is the only paragraph that says something about the name:
0x200 - FILE_NAME contains both usual and encoded
Unicode name separated by zero. In this case
NAME_SIZE field is equal to the length
of usual name plus encoded Unicode name plus 1.
If this flag is present, but FILE_NAME does not
contain zero bytes, it means that file name
is encoded using UTF-8.
It seems that it is UTF-8, but you say it is not. Can you try again?
I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?