PDF - Ligature mapping in CMap

PDF - Ligature mapping in CMap - ios

I have a pdf which have following mapping:
<019A> <0074>
<039E> <00A9>
<019F> <00740069>
<01B5> <0075>
<01C0> <0076>
<01C7> <0079>
<03EC> <0030>
The mapping, cid <019F> represent ligature ti.
In mapping \u0074 -> t and \u0069 -> i (hence) ligature ti.
How do I get actual ligature unicode? or I have to keep the track for such pattern and replace cid mapping with actual unicode of the ligature?
Thanks.

Essentially, for every character code you cannot assume that there is only one unicode character in the mapping. You will have to take output of both the characters. It can be even more that two characters in unicode. Some fonts have ligatures for "ffl" as well.
Also to be noted here Unicode specification also has special single character definitions for ligatures as well: https://en.wikipedia.org/wiki/Typographic_ligature
It's possible the special ligature unicode characters may be used in the mapping.

Related

Convert Unicode escape sequence into its corresponding character

I'm receiving a string from the server and it has the special characters in code. Here's the example:
"El usuario o las contrase\UOOOOfffda no son v\UOOOOfffdlidos"
The first one should be an "ñ" and the second one "á"
I know it's not complicated but I can't find the answer. How can I get the string with the special characters correctly formatted?

Unicode U+FFFD (in your string, displayed as UTF-32 \U0000fffd) is "�", the replacement character. It is often substituted in strings when a system encounters unrecognized characters.
This character really shouldn't appear in string data since its purpose is to indicate an error in displaying or interpreting the string. Since your server is sending you that character for both ñ and á, there is no way to retrieve the correct character.
How are you "receiving" this string? It could be that you are accessing the server incorrectly so it isn't sending you an unmodified string.

Unicode for those characters should look like this:
#"accented-a is \u00f1, and tilda-n is \u00e1"
But it's not clear what you're getting from the server makes any sense. The objective-c literal must have a lowercase leading "u" followed only by valid hex digits (0-9 and a-f). I don't see a transformation that changes the literals you have to the ones you expect.
Once the characters are formatted properly, the built-in classes will just work, for example, assigning the string to a label's text property will show the user a nice glyph.

CGPDFPageRef special characters

I am trying to parse some PDF files to get the texts from it, but I have some problems with reading special characters like : ț ă â ' " and others.
I have next operators :
CGPDFOperatorTableSetCallback (table, "MP", &op_MP)
CGPDFOperatorTableSetCallback (table, "DP", &op_DP)
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC)
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC)
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC)
CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback)
CGPDFOperatorTableSetCallback(table, "Tj", stringCallback)
Instead of those special characters, I get Ñ Ó ß and so on...
Is there something I miss?
Thanks

The parameters for TJ and Tj operators (and the other text show operators) are not actual strings but byte arrays. The bytes in these arrays should be translated into characters based on font's Encoding and ToUnicode cmap (if available).
You also have to handle the Tf operator which sets the active font. Based on the font id provided as parameter you locate the font object in the /Resources dictionary. The font object contains the necessary entries for decoding correctly the parameters of TJ/Tj.
PDFKitten framework is a good start for inspiration.
Reading the PDF specification (section 9.10 and related) is a must for implementing text extraction from PDF files.

In the pdf file, characters are represented by their glyph code in their font. Fonts can use an arbitrary encoding, so there is no guarantee that the glyph code will correspond to the Unicode codepoint for the glyph, or even that a glyph has a Unicode codepoint. (For example, many fonts include ligatures and alternate forms of certain letters.) It can get quite complicated.
There may (should) be some indication about how to translate glyph codes to Unicode. There might be an explicit glyph-to-Unicode map, or the font might be using a standard Unicode-to-glyph encoding. The information should be in the font dictionary, so you need to know which font the characters are being rendered with.
Unfortunately, I don't know how you would access this information using the Quartz 2D framework.

How to escape strings with numeric character references in Java

Hello and thank you for reading my post.
The Apache Commons StringEscapeUtils.escapeHtml3() and StringEscapeUtils.escapeHtml4() functions allow, in particular, to convert characters with an acute (like é, à...) in a string into
character entity references which have the format &name; where name is a case-sensitive alphanumeric string.
How can I get the escaped string of a given string with numeric character references instead (&#nnnn; or &#xhhhh; where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form)?
I actually need to escape strings for a XML document which doesn't know about such entities as & eacute;, & agrave; etc.
Best regards.

To solve this problem, I wrote a method which takes a string as an argument and replaces, in this string, character entity references (like é) with their corresponding numeric character references (é in this case).
I used this W3C list of references: http://www.sagehill.net/livedtd/xhtml1-transitional/xhtml-lat1.ent.html
Nota: It would be great to be able to pass another argument to the StringEscapeUtils.escapeHtml4() method to tell it whether we would like character entity references or numeric character references in the output string...

Create your CharacterTranslator:
CharacterTranslator XML_ESCAPE = StringEscapeUtils.ESCAPE_XML11.with(
NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
and use it:
XML_ESCAPE.translate(…)

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!

The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Lua pattern match around comma

I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l will not work.

There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.
The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.
If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.
In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.
Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart