Sequence processing texts - why is padding 'pre' the default? - machine-learning

What's the intuition behind adding padding at the start of the text strings in preprocessors such as keras.preprocessing.sequence.pad_sequences ?
Would I be correct to assume that it is related to the fact that default truncation removes the start of strings? On that point, I guess I also don't understand why we think that ends of text strings would be more meaningful than the start.

Related

How can I standardize the the varying truncating dot characters of UILabel?

I have a plist file which I decode to load data onto my application.
This plist file contains String type values that gets mapped to UILabel's text property.
I noticed that the truncating behavior of the text in the label is not always the same.
To be more specific, the three dots that are added when the text is truncated are, as opposed to my expectation, two kinds: one being ... and the other being ⋯ which appears to be this unicode character in this link.
I checked UILabel's attribute settings but I was unable to find any settings related to this behavior.
Has anyone else experienced this problem and standardized the truncating character to be ...?
Here is the image describing the problem mentioned above. Both labels have 2 lines and have new line escape character inserted between the first line and the second line of text. I am posting a link to this image because apparently I don't have enough reputation to post an image.
varying truncating characters of UILabel
IMO this is a bug in UILabel, and it may be worth opening a Feedback about it.
TL;DR: I recommend using TTTAttributedLabel.
Long-winded answer, because this was such an interesting question:
UILabel uses a different ellipsis based on the language script being truncated. As you've noticed, for most scripts, they use HORIZONTAL ELLIPSIS (…), or something very similar. But for Chinese, Japanese, and Korean (CJK), they use MIDLINE HORIZONTAL ELLIPSIS (⋯), or again, something very similar. The only other exception I've found is Burmese, which uses three circles that I don't recognize.
In my tests, all the following used …: Latin, Cyrillic, Bengali, Arabic, Hebrew, Hindi, Thai, Kannada, Nepali, and Mongolian (I kid. iOS can't layout Mongolian. Nobody can layout Mongolian, but it still uses …). UILabel even uses … for Lao, even though I thought ຯ was specifically for that, but I guess eventually everything becomes Latin.
The problem with UILabel being so clever for CJK and Burmese is that it decides what character to use exclusively by looking at the first character being removed. And it thinks SPACE is Latin (or at least not "special").
So what to do? My recommendation is probably to use TTTAttributedLabel, since it lets you configure the truncation character, and more importantly, is open source so you can fix it if it's not working the way you want.
The second option would be to truncate the text by hand using techniques like the one described in How to change truncate characters in UILabel?. There are probably better ways to do it using CTFrameGetVisibleStringRange instead of constantly shrinking the string until it fits, but I don't know if it's worth the effort. (If that path sounds useful, I could probably write up something that does it. It's just probably not worth the trouble.)
And the final option I know is to replace the SPACE character with an "equivalent" CJK character. The closest I've found that works is HANGUL FILLER (U+3164), but I don't like it. It's too wide, and I expect that it will make Korean uncomfortable to read (but I rarely try to read Korean, so I may be wrong here):
With SPACE: 안녕 하세요
With FILLER: 안녕ㅤ하세요
There's also HALFWIDTH HANGUL FILLER (U+FFA0), which is better, but UILabel seems to make it zero width (this may be a font issue, so maybe worth trying):
With SPACE: 안녕 하세요
With HALF: 안녕ᅠ하세요
let string = "안녕 하세요"
let filler = "\u{3164}"
label.text = string.replacingOccurrences(of: " ", with: filler)
OTOH, you may run into the same problem if you use any other non-CJK characters, like Latin punctuation or Arabic numerals. So this solution may not scale. And you should make sure that Voice Over properly ignores it.

Change the height of individual characters using core text

I have a font where unfortunately the numbers and letters are different heights. I need to display a reference code which is a mix of letters and numbers and the uneven heights of the characters looks jarring. Is it possible with core text (or another technology on iOS) to render certain characters with a slightly stretched height so that it looks even numbers and letters are displayed together.
E.g i have the string '23Rt59RQ' I need the 2,3,5,9 to be rendered with a larger height.
AFAICT, there's nothing in the CGContext API (which is what you'd want to use for laying out sets of glyphs) which would directly, easily facilitate this.
If it's really very important to use the font you are using, you could make separate calls to CGContextShowGlyphsAtPositions for alphabetic and numeral characters, calling CGContextSetFontSize each time so that the end result ends up matching, but this is a lot of overhead for just drawing text, and will probably result in undesirable performance.
My real advice would be to pick a better font so that this isn't even an issue :)
In the end of used regex to identify the character groups and then created an attributed string varying the font size in the font given in the NSFontAttributeName attribute according to which characters were to be displayed.
Kinda hacky but it had the desired effect.

MathJax overarrow too short

I am trying to place an overarrow over a piece of text in MathJax.
I am using a custom font that I declare in the code-
\(\overrightarrow{\style{font-family: mysans, TeX, Arial, sans-serif;}{\text{" + tString + "}}}\)"
It works ok for most letters- for capital W or M , using a couple in a row like "WWW" the overbar is too short.
For lowercase i , using a couple in a row, ie "iii" it is too long. My hunch is that MathJax is using a standard character width size to figure out the length of the overarrow and when the character is much longer or shorter than that size, it calculates the overarrow incorrectly. Is there any way around this?
Thanks!
First off, you generally cannot use custom fonts with MathJax. As the documentation says
Since browsers do not provide APIs to access font metrics, MathJax has to ship with the necessary font data; this font data is generated during development and cannot be generated on the fly. In addition, most fonts do not cover the relevant characters for mathematical layout. Finally, some fonts (e.g. Cambria Math) store important glyphs outside the Unicode range, making them inaccessible to JavaScript.
However, if you are only looking to use custom fonts in text elements, then there is a way to work around this: style the surrounding context and set mtextFontInherit:true for the output jax, cf. e.g. here for HTML-CSS.
Unfortunately, this won't actually help you right now. There's a minor regression in MathJax 2.5 (see this discussion leading to the result you describe). This will be fixed in 2.5.1 and in the mean time you could set noReflows:false for the HTML-CSS output.

In a UILabel, is it possible to force a line NOT to break in a certain place

I have a UILabel that is supposed to be two lines long. The text is localized into French and English.
I'm setting the attributedText property of the label. The text is constructed from three appended strings, say textFragment1, textFragment2 and textFragment3. The middle string is actually an image created with an NSTextAttachment.
I want to make sure that textFragment2 and textFragment3 are on the same line. The first string may or may not wrap depending on how long it is. The problem is that my French text is currently fairly short, so the line is wrapping after textFragment2.
I have fixed this temporarily by adding a line break symbol in the localized French text for textFragment1. I don't really love this solution though. I was wondering if there is a way to treat textFragment2 and textFragment3 so that they will always be together on the same line.
You could use a non-breaking space (\u00a0) to join textFragment2 and textFragment3. This character looks just like a normal space—i.e. it results in the same amount of whitespace—but line breaking will not take place on either side of it.
You could also use a zero-width space (\u2060). Using this character will not result in any whitespace, but it will still prevent line breaking on either side. This is what you want if you don’t want any space between textFragment2 and textFragment3 but you still want to prevent line breaking there. (It’s also useful if you have a word with a hyphen in the middle of it but you want to prevent the line from being broken after the hyphen.)
You can read more about these kinds of characters on Wikipedia.

Efficiently Computing Text Widths

I need to compute the width of a column with many rows (column AutoSize feature). Using Canvas.TextWidth is far too slow.
Current solution: My current solution uses a text measurer class that builds a lookup table for a fixed alphabet once and then computes the width of a given string very fast by adding up character widths retrieved from the lookup table. For characters not contained in the lookup table, the average character width is used (also computed once).
Problem: This works well for European languages but not for Asian languages.
Question: What's the best way to tackle this problem? How can such an AutoSize feature be realized without the relatively slow Canvas functions and without depending on a specific alphabet?
Thanks for any help.
You said you want to get the maximum text width for a column. Can't you, say, take only the 4 or 5 longest strings and get their widths? That way you won't have to find the width for all items and can save quite some time.
Or you use your cache to find the rough length of the strings and then refine that by getting the actual width for the top 4 or 5 items you found.
I don't think it matters a lot whether you use Canvas.TextWidth or GetTextExtentPoint32. Just use one of these to get the exact widths, after you used one of the methods above to guesstimate the longest/widest strings.
To those who think this doesn't work
If the poster of the original question thinks it could work, I have no reason to think it won't. He knows best what kind of strings can be in the columns he has.
But that is not my main argument. He already wrote that he does a preliminary textwidth by adding the predetermined individual widths of the characters. That does not take into account any kerning. Well, kerning can only make a string narrower, so it still makes sense to check only the top 4 or 5 items for the exact width. The biggest problem that can occur is that the column could be a few pixels too wide, no more. But it will be a lot faster than using TextWidth or GetTextExtentPoint32 or similar functions on each entry (assuming more than 5 entries), and that is what the original poster wanted. I suggest that those who don't believe me simply try it out.
As for using the pure string length: even that is probably good enough. Yes, 'WWW' is probably wider than '!!!!!', but the original poster will probably know best wat kind of string material he has, and if it is feasible. '!!!!!' or 'WWW' are not the usual entries one expects. Especially if you consider that not only one single string is checked, but the longest 4 or 5 strings (or whatever number turns out to be optimal). It is very unlikely that the widest string is not among them. But the original poster can tell if that is possible or feasible. He seems to think it is.
So stop the downvoting and try it out for yourself.
I'm afraid you have to use Canvas.TextWidth, or your implementation will be imprecise. The width of text depends on the font kerning, where different character sequences may have different widths (not just the total of individual character widths).
Me, I cut out the middle-man and use the Windows API directly. Specifically, I use GetTextExtentPoint32 with the .Handle of the Canvas. There's nothing you can do to be faster, other than caching results in some way, and frankly you'll just add overhead.

Resources