I have a UITextField that takes input for both english and french.
After I type french accents chars for example àÂèÎôÛç, and log the string, it looks like this:
\M-C\240\M-C\M^B\M-C\M-(\M-C\M^N\M-C\M-4\M-C\M^[\M-C\M-'
I've done a little research, but couldn't find what kind of encoding this is.
What kind of encoding is this?
How people normally handle encoding of special character?
The return type of the text method of UITextField is NSString which according to the documentation is:
An NSString object encodes a Unicode-compliant text string, represented as a sequence of UTF–16 code units. All lengths, character indexes, and ranges are expressed in terms of 16-bit platform-endian values, with index values starting at 0.
So it is just Unicode and the Xcode console just isn't printing it correctly.
Related
I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?
For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.
First off this solution doesn't work for ligatures:
Convert or Print CGPDFStringRef string
I'm reading text from a PDF and trying to convert it to a NSString. I can get a byte array of text using Apple's CGPDFScanner in the form of a CGPDFString. The "fi" ligature character is giving me trouble. When I look at my byte array in the debugger I see a '\f'
So for simplicity sake lets say that I have this char:
unsigned char myLigatureFromPDF = '\f';
Ultimately I'd like to convert it to this (the unicode value for the "fi" ligature):
unichar whatIWant = 0xFB01;
This is my failed attempt (I copied this from PDFKitten btw):
const char str[] = {myLigatureFromPDF, '\0'};
NSString* stringEncodedLigature = [NSString stringWithCString:str encoding:NSUTF8StringEncoding];
unichar encodedLigature = [stringEncodedLigature characterAtIndex:0];
If anyone can tell me how to do this that would be great
Also, as a side note how does the debugger interpret the unencoded byte array, in other words when I hover over the array how does it know to show a '\f'
Thanks!
Every PDF parser is limited in its capabilities by one single important point of the PDF specifications: characters in literal strings are encoded as bytes or words, but the encoding does not need to be included in the file.
For example, if a subset of a font is included where the code "1" corresponds to the image (character glyph) of an "h" and the code "2" maps to a glyph "a", the string (\1\2\1\2) will show "haha", as expected. But if the PDF contains no further information on how the glyphs in that font correspond to Unicode, there is no way for a string decoder to find out the correct character codes for "glyph #1" and "glyph #2".
It seems your test PDF does contain that information -- else, how could it infer the correct characters for "regular" characters? -- but in this case, the "regular" characters were simply not remapped to other binary codes, for convenience. Also, again for convenience, the glyph for the single character "fi" was remapped to "0x0C" in the original font (or in the subset that got included into your file). But, again, if the file does not contain a translation table between character codes and Unicode values, there is no way to retrieve the correct code.
The above is true for all PDFs and strings. If the font definition in the PDF contains an encoding, your string extraction method should use it; if the PDF contains a /ToUnicode table for the font, again, your method should use it. If it contains neither, you get the literal string contents (and, presumably, you are not informed which method was used and how reliable it is).
As a final footnote: in TeX and LaTeX fonts, ligatures are mapped to lower ASCII codes (as well as a smattering of other non-ASCII codes, such as the curly quotes). It seems you are reading a PDF that was created through TeX here -- but that can only be inferred from this particular encoding. Also, even if you know in advance that the PDF was generated through TeX, it's not guaranteed that it does use this particular encoding, as the decision to translate or not translate is at the discretion of the PDF generator, not TeX itself.
Is there a simple way (a function, a method...) of validating a character that a user types to see if it's compatible with Mac OS Roman? I've read a few dozen topics to find out why an iOS application crashes in reference to CGContextShowTextAtPoint. I guess an application can crash if it tries to draw on an image a string (i.e. ©) containing a character that is not included in the Mac OS Roman set. There are 256 characters in this set. I wonder if there's a better way other than matching the selected character one by one with those 256 characters?
Thank you
You might give https://developer.apple.com/library/mac/#documentation/graphicsimaging/conceptual/drawingwithquartz2d/dq_text/dq_text.html a closer read.
You can draw any encoding using CGContextShowGlyphsAtPoint instead of CContextShowTextAtPoint so you can tell it what the encoding is. If the user types it then you'll be getting the string as an NSString which is a Unicode string underneath. Probably the easiest is going to be to get the utf8 encoding of that user entered string via NSString's UTF8String method.
If you really want to stick with the very limited MacRoman for some reason, then use NSString's cStringUsingEncoding: passing in NSMacOSRomanStringEncoding to get a MacRoman string. Read the documentation on this in NSString though. Will return null if the user string can't be encoded in MacRoman losslessly. As it discusses you can use dataUsingEncoding:allowLossyConversion: and canBeConvertedToEncoding: to check. Read the cautions in the Discussion for cStringUsingEncoding: about about lifecycle of the returned strings though. getCString:maxLength:encoding: might end up being a better choice for you. All discussed in the class documentation for NSString.
This doesn't directly answer the question but this answer may be a solution to your problem.
If you have an NSString, instead of using CGContextShowTextAtPoint, you can do:
[someStr drawAtPoint:somePoint withFont:someFont];
where someStr is an NSString containing any Unicode characters a user can type, somePoint is a CGPoint, and someFont is the UIFont to use to render the text.
I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))
I have some link resources with none latin characters like åäö
These are usually user uploaded files
The problem is that i am not successfull in encoding them
using filename.encodeAsURL seems to not encode it the right way
For example the character ö is turned into o%CC%88
Testing to type the same thing in firefox and copy the contents gives %C3%B6
What are the difference between these encodings and what should i use to get the correct encoding??
Both encodings are correct. You are actually seeing the encoding of two different strings.
The key here is noticing the o at the beginning of the string:
o%CC%88 is the letter o followed by Unicode Character Combining Diaeresis, which combines with the previous character when rendered.
%C3%B6 is Unicode Character Latin Small O With Diaeresis.
What you are seeing is that in the first case, the string entered is something like these two characters: o ¨, which are actually rendered as ö.
In the second case, it's the actual character ö.
My guess is you are seeing the difference between two different inputs.
Update based on below discussion: If you are dynamically processing Unicode characters, and you do not have control over the input methods, you can try to normalize the Unicode, using java.text.Normalizer (Java 1.6 or newer).
Normalizing attempts to ensure that all characters are consistently represented, so that accented characters are always represented by a combined character or always by the character+combining mark.
Rough example:
String.metaClass.normalizeUnicode = {
return java.text.Normalizer.normalize(delegate, java.text.Normalizer.Form.NFC)
}
input = input.normalizeUnicode()
There are four forms of normalization. I picked the one that seems to be best for your case based on the description of how they work, but you may prefer to try the other ones and see what works most consistently.
All that being said, if you are try to representing Unicode characters in a URL, and they are not being loaded and processed by the code directly, it's probably best to avoid using non-latin characters altogether. Not only does this have the benefit of consistently, but also significantly shorter and more legible URLs. boo.pdf is a lot easier to read than bo%CC%88o.pdf.