I've noticed that emojis in my app have stopped displaying properly on a UIWebView in iOS 5.
All characters are encoded for HTML when they are displayed and the output HTML is:
<p>Emoji (iOS 4): 😒</p>
This UTF-8 encoded HTML is rendered correctly in a UIWebView in iOS 4, but not in 5:
I understand there have been some changes in iOS 5 with regards to emoji, but the emoji character that has been encoded into 😒 was generated on iOS 5, so the 2 byte characters should be correct. No other changes have taken place to the code so it's definitely something introduced with iOS 5.
Any advice would be much appreciated and I'll happily provide more information if required. Thanks.
I've had a response from the developer forums:
The HTML parser in iOS 5 and Safari 5.1 has changed, and character references in the range 0xD800..0xDFFF (55296..57343) are treated as parsing errors and produce an object replacement character (which is typically rendered as a diamond with a question mark). This change in behaviour is consistent with what HTML5 specifies. This means that you can no longer encode characters using surrogate pair character references.
A relatively simple solution is to use a single character reference instead of a surrogate pair. In your example, instead of (0xD83D, 0xDE12), use 0x1F612. You can use either hex or decimal:
😒 or 😒
This explains the reason for the problem. I however, worked around the issue by only encoding a smaller subset of characters as the HTML document is in unicode.
Related
I have an MVC application in which I am generating PDF from HTML pages with Rotativa. In the HTML I display some strings which I take from the resources of my application. When they are displayed as simple HTML, all the strings look good, but when the conversion to PDF is made, the exponential values are not formatted properly.
For numbers less than 4, everything looks good, like in², but when I am trying to display powers equal or higher than 4, (in⁴) the output is alterated like receiving a tilda ~ instead of the expected number. I assume this is because of the character set supported by Rotativa.
Is it possible to make Rotativa display exponential values higher than 3?
NOTE: I don't want to use <sup> x </sup> as it does not solve the problem of strings retrieved from resources.
I have tried changing the UTF enconding or font styles, but nothing worked.
I finally managed to solve this with a work-around. After I asked this question I understood how can I convert from exponential numbers, to simple string numbers, then used <sup>exponent</sup> I displayed the values from resources.
Not so pretty, but it's still working!
I am using TCPDF for many years. Recently I had to work on Arabic language display. The client wanted SakkalMajalla font (available in Windows/font) and I converted this using TCPDF tool. The conversion process was successful without error.
Now, I am facing a little issue and could not solve it since last 2 months. One of the special characters (called tanween) is placed at the bottom of the preceding character whereas it should be on top.Everything else is working fine but little thing (ٍ
) displayed at wrong place changes the meaning of the word.
يمنع استخدام الهاتف الجوال داخل صالة الاختبار
منعاً باتاً
(I can not upload image as I need 10 reputation points for that, but please notice the little thing on top of this letter تاً. Here, it is displaying properly, but in the pdf it displays at the bottom of the letter.
Is there anyway to edit manually the positioning of this character?
I am searching for the solution for the last 2 months. I event wrote 2 emails to the author of TCPDF Nicolas, but he did not give any response.
Please help.
Even though the font conversion process appeared to work successfully, you should double-check with a font editor (like FontForge) to check that the character is actually encoded correctly in the converted font file.
I have found, after many years of trying to convert all sorts of non-Latin fonts from one format to another, that the most reliable solution for font conversion is this site:
http://www.xml-convert.com/en/convert-tff-font-to-afm-pfa-fpdf-tcpdf
I am using UTF-8 encoding for parsing data from JSON, it is working for quite number of languages, but some languages are displaying good in console, but when i try to display them in UILabel they end up showing question marks or garbage characters.
e.g
Amharic አማርኛ
Burmese မြန်မာစာ
You have to use special font that provides these characters. "Abyssinica SIL" (http://software.sil.org/abyssinica/download/) worked in my case for Amharic.
For Burmese characters you can use "Zawgyi-One" (http://www.rfa.org/burmese/help/font_download_english.html
).
For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)
Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'
Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.
I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Language Support installed.
Typically, English and Japanese is mixed and is mostly displayed without a problem, for example:
Inventory turns partnerships. 在庫回転率の
(my apologies if the Japanese text is broken incorrectly - I do not speak or read the language).
Quite frequently however, only the Japanese portion of the text will be gibberish, for example:
ŒÉñ?“]-¦Œüã‚Ì·•Ê‰?-vˆö‚ðŽû‰v‚ÉŒø‰?“I‚ÉŒ‹‚т‚¯‚é’mŽ¯‚ª‘÷Ý‚·‚é?(マーケットセクター、
見込み客の優 先順位と彼らに販売する知識)
From extensive online searching, it appears that the problem is as a result of the fonts saved as part of the RTF. Fonts present on Japanese language version of Windows is not necessarily the same as a US English version. It is possible to programmatically replace the fonts in the RTF file which yields an almost acceptable result, i.e.
-D‚‚スƒIƒyƒŒ[ƒVƒ・“‚ニƒƒWƒXƒeƒBƒbƒN‚フƒpƒtƒH[ƒ}ƒ“ƒX‚-˜‰v‚ノŒ‹‚ム‚ツ‚ッ‚ネ‚「‚±ニ‚ヘ?A‘‚「‚ノ-ウ‘ハ‚ナ‚ ‚驕B‚サ‚‚ヘAl“セ‚オ‚ス・‘P‚フˆロ‚ƒƒXƒN‚ノ‚ウ‚‚キB
However, there are still quite a few "junk" characters in there which are not correctly recognized as Japanese characters. Looking at the raw RTF you'll see the following:
-D\'82\'82\u65405?\'83I\'83y\'83\'8c[\'83V\'83\u12539?\ldblquote\'82\u65414?
Clearly, the Unicode characters are rendered correctly, but for example the \'82\'82 pair of characters should be something else? My guess is that it actually represents a double byte character of some sort, which was for some mysterious reason encoded as two separate characters rather than a single Unicode character.
Is there a generic, (relatively) foolproof way to take RTF containing Eastern Languages and reliably displaying it again?
For completeness sake, I updated the RTF font table in the following way:
Replaced the font name "?l?r ?o?S?V?b?N;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Updated font names by replacing "\froman\fprq1\fcharset0 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1\fcharset238 " with "\fnil\fprq1\fcharset128 "
Updated font names by replacing "\froman\fprq1 " with "\fnil\fprq1\fcharset128 "
Replacing font name "?? ?????;" with "\'82\'6c\'82\'72 \'82\'6f\'83\'53\'83\'56\'83\'62\'83\'4e;"
Update: Updating font names alone wont make a difference. The locale seems to be the big problem. I have seen a few site discussing ways around converting the display of Japanese RTF to something most reader would handle, but I haven't found a solution yet, see for example:
here and here.
My guess is that changing font names in the RTF has probably made things worse. If a font specified in the RTF is not a Unicode font, then surely the characters due to be rendered in that font will be encoded as Shift-JIS, not as Unicode. And then so will the other characters in the text. So treating the whole thing as Unicode, or appending Unicode text, will cause the corruption you see. You need to establish whether RTF you import is encoded Shift-JIS or Unicode, and also whether the machine you are running on (and therefore D2009 default input format) is Japanese or not. In Japan, if a text file has no Unicode BOM it would usually be Shift-JIS (but not always).
I was seeing something similar, but not with Japanese fonts. Just special characters like micro (as in microliters) and superscripts. The problem was that even though the RTF string I was sending to the user from an ASP.NET webpage was correct (I could see the encoded RTF stream using Fiddler2), when MS Word actually opened the RTF, it added a bunch of garbage escape codes like what I see in your sample.
What I did was to run the entire RTF text through a conversion routine that swapped all characters over ascii 127 to their special unicode point equivalent. So I would get something like \uc1\u181? (micro) for the special characters. When I did that, Word was able to open the file no problem. Ironically, it re-encoded the \uc1\uxxx? back to their RTF escaped equivalents.
Private Function ConvertRtfToUnicode(ByVal value As String) As String
Dim ch As Char() = value.ToCharArray()
Dim c As Char
Dim sb As New System.Text.StringBuilder()
Dim code As Integer
For i As Integer = 0 To ch.Length - 1
c = ch(i)
code = Microsoft.VisualBasic.AscW(c)
If code <= 127 Then
'Don't need to replace if one of your typical ASCII codes
sb.Append(c)
Else
'MR: Basic idea came from here http://www.eggheadcafe.com/conversation.aspx?messageid=33935981&threadid=33935972
' swaps the character for it's Unicode decimal code point equivalent
sb.Append(String.Format("\uc1\u{0:d}?", code))
End If
Next
Return sb.ToString()
End Function
Not sure if that will help your problem, but it's working for me.