Why harfbuzz shape 2 single char into one glyph? - glyph

i'm new to both skia and harfbuzz, my project rely on skia to render text(Skia rely on harfbuzz to shape text.).
So, if i try to render text "ff" or "fl" or "fi"(or maybe some other combinations idk.), instead of render 2 "f", skia will render one glyph which composed of 2 chars("ff" or "fl" or "fi"), it will become much more obvious if i set text letter space property.
By following breakpoints, i tracked and found this all result from shaping result of harfbuzz. Harfbuzz will give out 1 glyph if the text is "ff" or "fl" or "fi".
It seems by making some configs of harfbuzz, i can avoid this, but idk how, please give me some hints.
PS:Shape result will be different if i use different font file, so this is also related to font file i used to shape.

What you are observing is the result of ligature glyph substitutions that occur during text layout.
Harfbuzz is performing advanced text layout using OpenType Layout features in a font. OpenType features are a mechanism for enabling different typographic capabilities that are supported in the font.
Some features are required for correct display of a script. For example, certain features are used to trigger word-contextual forms in Arabic script, or to trigger positioning of vowel marks in Bangla script or diacritic marks in Latin script. These are mandatory for correct display of these scripts.
Other features trigger optional typographic capabilities of a font—they're not essential for correct display of the script, but may be desired for high quality typography. Small caps or superscript forms are two examples of optional features. Many optional features are should not be used in applications by default. For instance, small caps should only be used when the content author explicitly wants them.
But in OpenType some optional features are recommended for use by default since they are intended to provide good quality typography for any body text. One such feature is "Standard Ligatures".
Your specific cases, "ff", "fi", etc., are considered standard ligatures. In any publication that has high quality typography, these will always be used in body text. Because the OpenType spec recommends that Standard Ligatures be applied by default, that's exactly what Harfbuzz is doing.
You can read the Harfbuzz documentation to find out more able how to enable or disable OpenType features. And you can find descriptions of all OpenType features in the OpenType Layout Tag Registry (part of the OpenType spec).
OpenType features use data contained directly in the fonts. Harfbuzz will enable the Standard Ligatures feature by default, but not all fonts necessarily have data that pertains to that feature. That's why you see the behaviour with some fonts but not others.
When a font does support features, the font data describe glyph-level actions to be performed. Harfbuzz (or any OpenType layout engine) will read the data and they perform the described actions. There are several types of actions that can be performed. One is a ligature substution—that is, substitute a sequence of glyphs with a single glyph, the ligature glyph. Ligature substitution actions could be used in fonts for a variety of purposes. Forming a "ff" ligature is one example. But a font might also substitute the default glyphs for a base letter and following combining mark with a single glyph that incorporates the base letter and the mark with the precise positioning of the mark for that combination. And that's something that would be essential for correct display of the script, but something that should be optional.
Thus, it would be a bad idea to disable all ligature substitutions. That's why OpenType has features as a trigger/control mechanism: features are organized around distinct typographic results, not the specific glyph-level actions used to achieve those results. So, you could disable a feature like Standard Ligatures without blocking ligature substitution actions that get used by the font for other typographic purposes.

Related

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

Add forbidden words to TexStudio / Latex

I have some words in my language (German) that seem to be valid according to TexStudios spellchecker.
However they must not be used for my thesis (and globally for me at least).
Is it possible to add words to a list, that trigger a (optimally huge) sign "DO NOT USE THIS!" or even prevent compilation in Latex when such words are used?
I'm looking for something like a negative dictionary.
I've seen files like "badwords" or "stopwords" but don't know when/how they are used. I can freely use them although "check for bad words" is on.
In case anyone else has the problem: Badword files are named after the main language. For me it happened that I have "de_DE_frami" as the dictionary set. Hence it did not use the "de_DE.badwords".
For a good highlighting: One can change the appearance in the options dialog (syntaxhighlighting->badwords) and make it e.g. background red, size 200%
I'd still would like to have a "bad" words and a "impossible" words distinction as you can sometimes not avoid "bad" words or they are not bad in all contexts.

iOS Font Not Properly Displaying All Unicode Characters

Example of two characters: U+22FF, U+23BA... and many others.
Is this an encoding layer that I'm misunderstanding for iOS? Like, at some point it no longer can properly display codes beyond 22B...?
I'm capturing this in an NSString, and trying to simply update a text field. Something like
NSString *test = #"\u23ba";
[displayText setText:test];
This will display a standard type error like a box with a question mark in it, or just a box (depending on the font).
Is there a way to expand the unicode options for iOS? Because these can be displayed on my Mac. Or, is my only option some variant of the NSAttributableString route?
U+22FF and U+23BA are valid codepoints (assigned to characters). But they are supported by a few fonts only. So you should first check which font(s) are being used, or available.
For example, U+22FF is included in Asana-Math, Cambria, Cambria Math, Code2000, DejaVu Sans (oddly, only Bold Oblique typefaces), FreeSerif, GNU Unifont, Quivira, Segoe UI Symbol, STIXMath, STX, Sun-ExtA, Symbola, XITS, XITSMath. U+23BA is included in Cambria, Cambria Math, Code2000, FreeMono, FreeSerif, GNU Unifont, Quivira, Segoe UI Symbol, Sun-ExtA, Symbola. Many of these are free fonts. Typographic quality varies a lot. Cambria fonts and Segoe UI Symbol are commercial, shipped with some Microsoft products. There are probably some other fonts that cover those characters, but not many (Everson Mono, I suppose, I don’t currently have it).

Finding System Fonts with Delphi

What is the best way to find all the system fonts a user has available so they can be displayed in a dropdown selection box?
I would also like to distinguish between Unicode and non-Unicode fonts.
I am using Delphi 2009 which is fully Unicode enabled, and would like a Delphi solution.
The Screen.Fonts property is populated via the EnumFontFamiliesEx API function. Look in Forms.pas for an example of calling that function.
The callback function that it calls will receive a TNewTextMetricEx record, and one of the members of that record is a TFontSignature. The fsUsb field indicates which Unicode subranges the font claims to support.
The system doesn't actually have "Unicode fonts." Even the fonts that have the word Unicode in their names don't have glyphs for all Unicode characters. You can distinguish between bitmap, printer, and TrueType fonts, but beyond that, the best you can do is to figure out whether the font you're considering supports the characters you want. And if the font isn't what you'd consider a "Unicode font," but it supports all the characters you need, then what difference does it make? To get this information, you may be interested in GetFontUnicodeRanges.
The Microsoft technology for displaying text with different fonts based on which fonts contain which characters is Uniscribe, particularly font fallback. I'm not aware of any Delphi support for Uniscribe; I started writing a set of import units for it once, but my interests are fickle, and I moved on to something else before I completed it. Michael Kaplan's blog talks about Uniscribe sometimes, so that's another place to look.
I can answer half your question, you can get a list of the Fonts that your current environment has access to as a string list from the global Screen object
i.e.
Listbox1.Items.AddStrings(Screen.Fonts);
You can look in the forms.pas source to see how Codegear fill Screen.Fonts by enumerating the Windows fonts. The returned LOGFONT structure has a charset member, but this does not provide a simple 'Unicode' determination.
As far as I know Windows cannot tell you explicitly if a font is 'Unicode'. Moreover if you try to display Unicode text in a 'non-Unicode' font Windows may substitute a different font, so it is difficult to say whether a font will or will not display Unicode. For example I have an ancient Arial Black font file which contains no Unicode glyphs, but if I use this to display Japanese text in a D2009 memo, the Japanese shows up correctly in Arial and the rest in Arial Black. In other examples, the usual empty squares may show up.

What things should be localized in an application

When thinking about what areas should be taken into account for a localized version of an application a number of things pop up right away:
Text display
Date and time
Units
Numbers and decimals
User input formats
LeftToRight support
Dialog and control sizes
Are there other things/areas to remember or keep in mind when building a localizable application? Are there any resources out there which provide a listing of best practices not just for text localization but for all things around localization?
After Kudzu's talk about l10N I left the room with way more questions then I had before and none of my old questions answered. But it gave me something to think about and brought the message "depends on how far you can/want to go" accross.
Translate text bodies with aforementioned things
Test all your controls for length/alignment in LTR/RTL, TTB(TopToBottom) BTT and all it's combinations.
Look out for special characters and encodings
Look out for combinations of different alignments (LTR, RTL, TTB, BTT) and how they effect punctuation and quotation signs.
Align controls according to text alignment (Hebrew Win has its start menu at the right
Take string lengths into account. They can overflow in other languages.
Put labels at the correct side of icons (LTR, TTB etc)
Translate language selection controls
No texts in images (can't be translated)
Translate EVERYTHING (headers, logos, some languages use different brand names, product names etc)
Does the region have a 24:00 or a 00:00 (changes the AM/PM that goes with it too)
Does the region use AM/PM or the 24:00 system
What calendar system are they using
What digit is for what part of the date (day, month, year in all its combinations)
Try to avoid "copying [number] files" equivalents. Some regions have different rules about changing words according to quantities. (This is an extremely complicated topic that I will elaborate on if desired)
Translate sentences, not words. Syntax rules are too complicated to put in your business logic.
Don't use flags for regions. Languages != countries
Consider what languages / dialects you can support (e.g. India has a gazillion of languages)
Encoding
Cultural rules (some western images displaying business woman can be near offensive in some other cultures)
Look out for language generalizations (e.g. boot(UK) != boot(US))
Those are the ones from the top of my head. The list just went on and on...
Don't forget the overhead of converting all documentation and help files.
a couple hints from my J2ME apps days:
don't translate separate words, translate whole phrases, even if there are matching repetitions. You'll later have to translate to a language where words have to be modified differently in different contexts and you may end up with an analog of "color: greenish"
Right2Lelf includes numbering of lists, alignment, and alternative scroll bars
Arabic languages write the same letter differently based on surrounding letters. You can't just print a string from a character buffer, you'll need a special control to output those or support from you platform
alphabetical sorting is HARD. No native Chinese could ever explain me the rules, but they will always spot wrongly sorted words. There appear to be a number of options to sort Chinese. I guess other languages may have the same problem

Resources