How to load TextAsset from txt file containing extended ASCII characters in Unity? - character-encoding

In my case specifically, I have bullet-points (• or #149) in my text file.
If I copy paste "•" into my Unity text field in the editor, it shows up, so I am pretty sure the bullet-point is lost in the reading process. (I checked in debug mode, and indeed the bullet-point is lost at reading).
This is how I read in my text file as a TextAsset:
TextAsset content = Resources.Load(SlideManager.slideLanguage+"\\"+fileName+" ("+SlideManager.slideNumber+")") as TextAsset;

It turns out, that the way I read is completely fine. It reads the file correctly, but the encoding of the file is ASCII, therefore the resource loader cannot interpret none ASCII characters, and drops them.
Thus, since the bullet-point is not standard ASCII, but extended ASCII character, you have to specify the encoding of your text files.
For example, set encoding to UTF-8, and then it will work. I used notepad++ to set encoding, but I am sure there are many other ways you can do it.
To set encoding in Notepad++
Click on the tab named Encoding (fifth tab from the left on the top by default), and select Convert to UTF-8.

Related

RStudio - Script opened with wrong encoding, how to get back original characters?

I have a file in Spanish, when seen on my teacher's PC a bit of text would display as
regresión cuantílica más
but now that I've opened on mine I see this:
regresión cuantílica más
I have tried "Save with Encoding" to ISO-8859-1 and UTF-8 but it doesn't seem to change anything. Will I need to run some regex replacements on my file or is there a simpler way to fix this?
If you have already saved it and you've lost the original version of the file, it will be a pain to recover.
What you should have done when you noticed the bad characters was "Reopen with encoding", and chosen the "UTF-8" encoding. If you can still get the original file, do this now.
If you can't, then you're stuck with lots of manual fixing. Accented characters (and Euro signs, and a few other things) will show up as multi-character sequences. When you recognize one, use search and replace to replace that sequence with the correct character.

Contents of executable files cannot be copied?

So, text files can be copied and pasted to another location by copying the contents of the original file into a blank text file. This can be done with a text editor. Highlight contents of text file, copy, create new blank text file, paste in to it.
But, why can't image, audio, video, executable files, etc., be copied and pasted like this? For example, I open an executable file with a text editor, copy all of it's contents, create a new blank text file, change the extension to .exe, and paste into it (through a text editor). But, the file cannot be run. Why?
Also, I would like to be able to edit these types of files like I do with text files. Is there a way?
Because executable and media files are "binary" files. Text files are binary as well, but different. All files are created binary, but some are created more binary than others.
You're opening a binary file in a text editor. This immediately changes the semantics of the bytes. The main problem is bytes containing a value that happens to correspond to those of newline characters if it were a text file (0x0A and 0x0D), which will be rendered as a platform-dependent newline (\r\n on Windows, for example). When you copy that, you've changed either 0x0A or 0x0D to 0x0D 0x0A.
Then there's control characters or non-printable characters. Not all bytes between 0x00 and 0xFF can be represented as a character. They'll either be omitted or replaced with a displayable character.
So when you copy a text containing those, they'll be omitted or otherwise mangled.
In conclusion: you cannot reliably use text to display all possible byte values, unless you choose to encode the bytes' values, as is done using for example Base64 encoding.
If you want to edit a binary file, use an editor that is aware of those bytes: a "hex editor". Do note that changing random byte values in a binary file does not guarantee the sanity of that file: there may be checksums built into the format, and your edit will invalidate that checksum.

What kind of char is this and how do I convert it to a text?

What kind of char is this and how do I convert it to a text in c#/vb.net?
I opened a .dat file in notepad, took a screenshot and attached it here.
Your screenshot looks like the digits "0003" in box. This is a common way to display characters for which a glyph isn't available.
U+0003 is the "END OF TEXT" control character. It's unlikely to occur within a text file, but a ".dat" file might be a mixture of text and binary data.
You'll need to use a hex editor to find the exact ASCII code (assuming the file is ASCII, which seems to be an entirely incorrect assumption) that the file contains. It's safe to say that whatever byte sequence is contained in the file is not a printable character in whatever encoding the editor used to open the file, and that is why it used that graphic in place of the actual character.

How to read unicode characters accurately

I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?

How to correctly display Japanese RTF Fonts

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.

Resources