Visual Basic.NET "The output char buffer is too small to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'System - binaryreader

I am trying to read a binary file in which i have been appending data using a BinaryWriter object. I keep getting this error:
"The output char buffer is too small to contain the decoded
characters, encoding 'Unicode (UTF-8)' fallback
'System.Text.DecoderReplacementFallback'."
My file has characters like | which i suspect are the problem but I don't know how to solve it.

The most probable reason is that your file contains some binary data, that does not represent valid UTF-8 codepoint, at the place from which you are trying to read UTF-8 character.
This can happen if your read algorithm lose "synchronization" with your write algorithm and tries to read character from the wrong place, where something else (not a character) was written.

Related

Multiple encryption append to a file

I have a log of a programs state. This log can be manualy or time interval saved on a file for persistant storage. Before saving it to the file it is encrypted with RNCryptor.
My current appending(saving) to file flow:
Read file
Decript information from the read string
Concat decrypted string with the new string
Encrypt the concatenated string
Write it to file
What I imagine:
Encode new string
Append to file
When I read this I will have to build a string from all the encoded strings. But I don't know how to decrypt the file with multiple encrypted blocks in it. How to differentiate where one ends and another begins.
Also is this the best performance choice. The text in the file at maximum could get to 100MB(Possibly it will never get this big).
Is using Core Data viable? Each append as different record or something. And core data could be encrypted so no need for RNCryptor.
Would appreciate code in Objective-C if any.
There are many things you can do:
Easiest would be to encode the ciphertexts to text (e.g. with Base64) and write each encoded ciphertext to a new line. You need encoding for that, because the ciphertext itself might contain bytes that can be interpreted as newline control characters, but that won't happen with a text encoding. The problem with this is that it blows up the logs unnecessarily (e.g. by 33% if Base64 is used)
You can prepend each unencoded ciphertext with its length (e.g. big-endian int32 encoding) and write both as-is to a file in binary mode. If you begin reading the file from the beginning, then you can distinguish each ciphertext, because you know how long the following ciphertext is and when the next encoded length starts. The blowup is only as big as the encoding of the ciphertext length for each ciphertext.
Use a binary delimiter such as 0x0101 between ciphertexts, but such a delimiter might still appear in the ciphertexts, so you need to escape it if you find it somewhere in the ciphertext. This is a little tricky to get right.
If the amount of logs is small (few MB), then you can find a library to append to a ZIP file.
You can use the array to store the information and then read and write that array to file. find Example here.
Steps :
Read Array from the file.
Add the New Encrypted string to array.
Write array to file.

file encoding on a mac, charset=binary

I typed in
file -I*
to look at all the encoding of all the CSV files in an entire directory. A lot of the file encodings are charset=binary. I'm not too familiar with this encoding format.
Does anyone know how to handle this encoding?
Thanks a lot for your time.
"Binary" encoding pretty much means that the encoding is unknown.
Everything is binary data under the hood. In text files each byte, or sequence of bytes, represents a specific character, and which character in particular depends on the encoding the file was encoded with/you're interpreting the file with. Some encodings are unambiguously recognisable, others aren't (e.g. any file is valid in any single-byte encoding, you can't easily distinguish one single-byte encoding from another). What file is telling you with charset=binary is that it doesn't have any more specific information than that the file contains bits and bytes (Capt'n Obvious to the rescue). It's up to you to interpret the file in the correct encoding/interpret it as the correct file format.

SAS Special Characters Throwing Off Column Alignment of Input

I am inputting a .dat data set into sas, in an exercise teaching informat use. Here is what I have so far.
DATA companies;
INFILE "/folders/myshortcuts/Stat324/BigCompanies.dat" encoding='wlatin2';
INPUT rank 3. #6 company $UTF8X25. #35 country $17. #53 sales comma6. #60 profits comma8. #70 assets comma8. #82 marketval comma6.;
RUN;
This works for every line except for those containing special/international characters. Such as:
94 SociÈtÈ GÈnÈrale France $98.6B $3.3B $1,531.1B $25.8B
These lines trip up at the first currency value (#53 sales comma6.) and a warning is thrown indicating that invalid data was found for that input, and a missing value (.) is assigned.
Playing around with # pointers and informat w values seems to reveal that the special characters are throwing off the column alignments, is this possible (a special character actually taking up 2 bits/spaces even if it prints as a single character. Is there a simple solution?
Yes, you're exactly correct: if the characters are encoded in UTF8, they may take between 1 and 4 bytes, with many characters being one byte, but some taking more (what you call "special characters" here). If SAS is reading the file as WLATIN1, then it will assume each byte is a separate character.
Your code is a bit confusing to me: you specify that the file is WLATIN1, but then you instruct SAS to read in that field as UTF-8. Which is it?
If your session encoding is compatible with UTF-8, and the file to be read in is encoded UTF-8, then you likely need to simply switch the encoding on infile to UTF-8. If your file has mixed encoding, and there is a reason you can't use UTF-8 encoding to read it in, then you may have a complicated problem that will need to be handled with special code (i.e., to figure out how long the UTF8 portion actually is, and then advance the pointer to the right spot to read the next field in). You also may be able to use a delimiter to read this in; that depends some on the exact format of the data.

trying to figure out the charset

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?
It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

How are non-ASCII file names encoded in RAR files?

I have a RAR file with non ASCII letters in filenames. I tried decoding it in Delphi. My code works fine for ASCII filenames but it failed on these. It is not WideChar, nor UTF8. I found RAR specs here:
http://ams.cern.ch/AMS/amsexch/arch/rar/technote.txt
but it says nothing about the character encoding.
I tried WOTSIT.org but all links to RARs are dead (almost every link is dead there; I even contacted admin but he didn't respond and didn't fix links).
It seems it is not an 8bit encoding, but no idea what could it be.
This is the only paragraph that says something about the name:
0x200 - FILE_NAME contains both usual and encoded
Unicode name separated by zero. In this case
NAME_SIZE field is equal to the length
of usual name plus encoded Unicode name plus 1.
If this flag is present, but FILE_NAME does not
contain zero bytes, it means that file name
is encoded using UTF-8.
It seems that it is UTF-8, but you say it is not. Can you try again?

Resources