CGPDFPageRef special characters - ios

I am trying to parse some PDF files to get the texts from it, but I have some problems with reading special characters like : ț ă â ' " and others.
I have next operators :
CGPDFOperatorTableSetCallback (table, "MP", &op_MP)
CGPDFOperatorTableSetCallback (table, "DP", &op_DP)
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC)
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC)
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC)
CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback)
CGPDFOperatorTableSetCallback(table, "Tj", stringCallback)
Instead of those special characters, I get Ñ Ó ß and so on...
Is there something I miss?
Thanks

The parameters for TJ and Tj operators (and the other text show operators) are not actual strings but byte arrays. The bytes in these arrays should be translated into characters based on font's Encoding and ToUnicode cmap (if available).
You also have to handle the Tf operator which sets the active font. Based on the font id provided as parameter you locate the font object in the /Resources dictionary. The font object contains the necessary entries for decoding correctly the parameters of TJ/Tj.
PDFKitten framework is a good start for inspiration.
Reading the PDF specification (section 9.10 and related) is a must for implementing text extraction from PDF files.

In the pdf file, characters are represented by their glyph code in their font. Fonts can use an arbitrary encoding, so there is no guarantee that the glyph code will correspond to the Unicode codepoint for the glyph, or even that a glyph has a Unicode codepoint. (For example, many fonts include ligatures and alternate forms of certain letters.) It can get quite complicated.
There may (should) be some indication about how to translate glyph codes to Unicode. There might be an explicit glyph-to-Unicode map, or the font might be using a standard Unicode-to-glyph encoding. The information should be in the font dictionary, so you need to know which font the characters are being rendered with.
Unfortunately, I don't know how you would access this information using the Quartz 2D framework.

Related

How to convert a formatted string into plain text

User copy paste and send data in following format: "𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔'
'𝙶𝚎𝚟𝚒𝚎𝚕𝚢𝚗 𝙽𝚒𝚌𝚘𝚕𝚎 𝙻𝚞𝚖𝚋𝚊𝚐'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(
Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".unicode_normalize(:nfkc)
#=> "jovy debbie"
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

how to tokenize/parse/search&replace document by font AND font style in LibreOffice Writer?

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.
main word (font 1, bold)
foreign equivalent transliterated (font 1, italic)
foreign equivalent (font 2, bold)
part of speech (font 1, italic)
Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.
I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.
I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:
Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation
Replace: term found above + "delimiter"
Any suggestions how I can write a script for this, or if an existing tool can solve the problem?
Thanks!
Pseudo code for desired effect:
var delimiter = "|"
Go to beginning of document
While not end of document do:
var $currLine = get line from doc
var $currChar = get next character which is not space or punctuation;
var $font = currChar.font
var $font_style - currChar.font_style (e.g. bold, italic, normal)
While not end of line do:
$currChar = next character which is not space or punctuation;
if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
print $delimiter
$font = currChar.font
$font_style - currChar.font_style (e.g. bold, italic, normal)
}
end While
end While
Here are tips for each of the things your pseudocode does.
First, the easiest way to move line by line is with the TextViewCursor, although it is slow. Notice the XLineCursor section. For the while loop, oVC.goDown() will return false when the end of the document is reached. (oVC is our variable for the TextViewCursor).
Get each character by calling oVC.goRight(0, False) to deselect followed by oVC.goRight(1, True) to select. Then the selected value is obtained by oVC.getString(). To ignore space and punctuation, perhaps use python's isalnum() or the re module.
To determine the font of the character, call oVC.getPropertyValue(attr). Values for attr could simply be CharAutoStyleName and CharStyleName to check for any changes in formatting.
Or grab a list of specific properties such as 'CharFontFamily', 'CharFontFamilyAsian', 'CharFontFamilyComplex', 'CharFontPitch', 'CharFontPitchAsian' etc. Character properties are described at https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Formatting.
To insert the delimiter into the text: oVC.getText().insertString(oVC, "|", 0).
This python code from github shows how to do most of these things, although you'll need to read through it to find the relevant parts.
Alternatively, instead of using the LibreOffice API, unzip the .odt file and parse content.xml with a script.

Where should my brackets be in relation to the text for Arabic languages?

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0) ‏
I added the html entity of RLM / Right-to-left Mark ‏ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) ‏
HTML Entity (hex) ‏
HTML Entity (named) ‏
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)

PDF - Ligature mapping in CMap

I have a pdf which have following mapping:
<019A> <0074>
<039E> <00A9>
<019F> <00740069>
<01B5> <0075>
<01C0> <0076>
<01C7> <0079>
<03EC> <0030>
The mapping, cid <019F> represent ligature ti.
In mapping \u0074 -> t and \u0069 -> i (hence) ligature ti.
How do I get actual ligature unicode? or I have to keep the track for such pattern and replace cid mapping with actual unicode of the ligature?
Thanks.
Essentially, for every character code you cannot assume that there is only one unicode character in the mapping. You will have to take output of both the characters. It can be even more that two characters in unicode. Some fonts have ligatures for "ffl" as well.
Also to be noted here Unicode specification also has special single character definitions for ligatures as well: https://en.wikipedia.org/wiki/Typographic_ligature
It's possible the special ligature unicode characters may be used in the mapping.

Delphi 2009: Search skipping diacritics in unicode utf-8

I am having utf-8 encoded file containing arabic text and I have to search it.
My problem are diacritics, how to search skipping them?
Like if you load that text in Internet Explorer (converting text in HTML ofcourse ), IE is skipping those diacritics?
Any help?
Edit1: Search is simply performed by following code:
var m1 : TMemo; //contains utf-8 data)
m2 : TMemo; // contains results
...
m2.lines.BeginUpdate;
for s in m1.Lines do
begin
if pos(eSearch.Text,s)>0 then
begin
m2.Lines.Add(s);
end;
end;
m2.Lines.EndUpdate;
Edit2: Example of unicode data:
قُلْ هُوَ اللَّهُ أَحَدٌ
If you search only letters without diacritics قل the word قُلْ wont be found.
On Vista+ you can probably (I have no experience with Arabic) use CompareString with option LINGUISTIC_IGNOREDIACRITIC.
NORM_IGNORENONSPACE may also help. Then again, it may not.
Alternatively (but I'm just guessing) you may be able to parse your strings with GetStringTypeEx and manually remove diacritics. Probably you'd have to call FoldString or MultiByteToWideChar with flag MAP_COMPOSITE first.
I find that diacritics are not the only problem.
I would do character replacements, replacing them by empty strings, I would also normalize the text 'أ' 'إ' 'آ' are all converted to 'ا', and also do the same for ى ئ ي ؤ و ة ه ...
For search I'd also use a light stemmer like the "khoja stemmer" (Java source here)
A more advanced way is to do it like TREC:
Remove punctuation
Remove diacritics (mainly weak vowels) Most of the corpus did not contain weak vowels.
Some of the dictionary entries contained weak vowels. This made everything consistent.
Remove non letters
Replace initial إ or أ with bare alif .ا
Replace آ with ا
Replace the sequence ىء with ئ
Replace final ى with ي
Replace final ة with ه
Strip 6 prefixes: definite articles ( فال آال، بال، وال، ال، ) and و
(and) from the beginnings of normalized words
Strip 10 suffixes from the ends of words ات ان، ها،ي ة، ه، ية، يه، ين، ون
I would index the text by this modified text (for memos I'd store the index of the word in the original text), and do the same thing for the search query.
I would also search in Memo1.Text and not the lines one by one, the search could be for multiple words that may be at the end of a line and wrapped to the next line.

Resources