Swift's Vision framework not recognizing Japanese characters - ios

I would like to read Japanese characters from a scanned image using swift's Vision framework. However, when I attempt to set the recognition language of VNRecognizeTextRequest to Japanese using
request.recognitionLanguages = ["ja", "en"]
the output of my program becomes nonsensical roman letters. For each image of japanese text there is unexpected recognized text output. However, when set to other languages such as Chinese or German the text output is as expected. What could be causing the unexpected output seemingly peculiar to Japanese?
I am building from the github project here.

As they said in WWDC 2019 video, Text Recognition in Vision Framework:
First, a prerequisite, you need to check the languages that are supported by language-based correction...
Look at supportedRecognitionLanguages for VNRecognizeTextRequestRevision2 for “accurate” recognition, and it would appear that the supported languages are:
["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant"]
If you use “fast” recognition, the list is shorter:
["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR"]
And if you fall back to VNRecognizeTextRequestRevision1, it is even shorter (lol):
["en-US"]
It would appear that Japanese is not a supported language at this point.

VisionKit has been support more language after mac update to macos Ventura.
need rebuild app using xcode 14
VNRecognizeTextRequest().supportedRecognitionLanguages()
["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA"]

Related

Multiple language OCR (Santali+Odia+English) combination is not working with gImageReader

I am trying to scan a Santali book with multiple character (Ol chiki script + English script + Odia script) with gImageReader 3.3.1 (17fa17) which uses Tesseract 4.1.0 but unable to get satisfactory results.
I have tried with English + Odia script are working fine they are giving very good result. But when I use Santali + Odia or English + Santali or Santali + Odia + English the output text becomes Odia, English or Odia and English respectively, instead of showing Ol chiki text in place. I have a file available for testing.
Also, by only using Santali tessdata it transliterate English and Odia words as Ol Chiki script.
When I use "sat.tessdata" to scan a normal santali image, it worked well.
Note: Ol chiki is the main writing script of Santali people approved by government of India. I think Ol Chiki is a new script not well supported by many software so the processed image text output always shows boxes, I solved this problem by coping it to the Notepad and saving. Exporting it to pdf is ok, I created editable text from it, no problem. I have created many OCR editable pdf with gImageReader.
My question is how to get combined multiple language output in Santali, Odia and English. Also I want to know why the text output of image when processed giving output for English and Odia but not for Santali or vice versa.
I have tried to train the language, it is taking a lot of time, I have little knowledge on coding. If their is any problem with sat.tessdata then i can take up with learning with Tesseract training.
I have used tessdata of
Santali - https://github.com/indic-ocr/tessdata/tree/master/sat
Odia - https://github.com/indic-ocr/tessdata/tree/master/ori
English - default of gImageReader

Translating and localizing technical words into other languages

I'm currently translating a website from english into other languages but have a problem when it comes to technical terms (non words) like "crontab".
Should I keep the english translation or is there another way to find the equivalent?
These aren't actually english words and when it comes to languages like Japanese, I'm at a loss as to what to do.
Here's an example sentence as an example:
"Use crontab to schedule scripts."
which translated into Japanese via Google Translate becomes:
"スクリプトをスケジュールするcrontabを使用してください。"
You can see how bizarre this looks, and I'm wondering if the sentence could even be understood by a Japanese speaker.
What do I do in these situations?
Using English words in Japanese
Talking about the word crontab, I think it's not bizarre to write it in English in a Japanese sentence like this:
crotabを使用してください
(please use crontab)
On Japanese wikipedia, you can see how crontab is used without translating into Japanese.
http://ja.wikipedia.org/wiki/Crontab
In Japanese technical writing, especially when you mention name of tools, it is common to use English as it is without translating into Japanese.
Using Katakana
You could also write the sentence like below using Katakana.
クーロンタブを使用してください
(please use crontab).
Japanese usually writes words from English in Katakana. Japanese Katakana is phonetic, in other words each character represents a sound (not meaning). But In this case, it doesn't look natural.
Mistranslation
There is a mistranslation in your Japanese sentence.
スクリプトをスケジュールするcrontabを使用してください。
(Please use crontab which scedule a script.)
To correct this, you could go like this:
スクリプトをスケジュールするには、crontabを使用してください。
(In order to schedule a script, please use crontab.)
Hope this helps.

Is there a standard for localized locale codes?

I am currently working on a project that would benefit from localized locale codes. For example, RFC 5646 and the parent-standard BCP 47 define locale codes for various locales, such as en-GB for British English and zh-Hans-SG for Singaporean Chinese using simplified Chinese characters. Unfortunately, these codes use only a small subset of the latin alphabet.
I am looking for a similar standard or commonly used system that defines a set of language codes in the respective writing system of each language (somewhat akin to an autoglossonym).
EDIT: I am strictly seeking localized locale codes since in the problem's context (URI i18n/l10n), it would be unreasonable to use an autoglossonym or other verbose equivalent.
Locale codes as specified by RFC 5656 and BCP 47 are meant to be machine parseable. Thus, en-GB is "English (Great Britain)" and zh-Hans-SG is "Chinese (Singapore, Simplified Chinese Script)".
They are designed so that web pages, e-books and other documents can specify the language and script they are written in in a standard way.
Thus, each language, script and country is given a unique code from the respective standards and collated in the IANA Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).
For a localized version of this, you are better off mapping the codes to a localized name (e.g. localizing the Description field of the subtag registry database, or using a project like iso-codes) and formatting that in a presentable way, keeping the locale code as an internal representation.

How to implement Transliteration of Indic language in iOS

I am trying to implement transliteration of standard English text to one of the Indic(Devnagari) script.
According to this post, there is CFStringTransform function in iOS which has capability of handling it if proper ICU constants are passed. I checked with few built in Constants for few available scripts like Arabic, Greek, it works perfect, but there is no built in constant for Indic language, ICU's official page, also does not have definite constant described.
Kindly let me know , any pointers to resolve this problem.
Try http://demo.icu-project.org/icu-bin/translit with Source1=Any and Target1=Devanagari - this will give you an idea of what an Any-Devanagari or en-Devanagari ( English specific, leave other languages alone) transliteration would give.

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only "pt" code?

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only pt code (generic Portuguese)?
This question is independent of the nature of the application, desktop, mobile or browser based. Let's assume you are not able to get region information from another source and you have to choose one language as the default one.
The question does apply as well for more case including:
pt-pt and pt-br
en-us and en-gb
fr-fr and fr-CA
zh-cn, zh-tw, .... - in fact in this case I know that zh can be used as predominant language for Simplified Chinese where full code is zh-hans. For Traditional Chinese, with codes like zh-tw, zh-hant-tw, zh-hk, zh-mo the proper code (canonical) should be zh-hant.
Q1: How to I determine the predominant languages for a specified meta-language?
I need a solution that will include at least Portuguese, English and French.
Q2: If the system reported Simplified Chinese (PRC) (zh-cn) as preferred language of the user and I have translation only for English and Traditional Chinese (en,zh-tw) what should I choose from the two options: en or zh-tw?
In general you should separate the "guess the missing parameters" problem from the "matching a list of locales I want vs. a list of locales I have" problem. They are different.
Guessing the missing parts
These are all tricky areas, and even (potentially) politically charged.
But with very few exceptions the rule is to select the "original country" of the language.
The exceptions are mostly based on population.
So fr-FR for fr, es-ES, etc.
Some exceptions: pt-BR instead of pt-PT, en-US instead of en-GB.
It is also commonly accepted (and required by the Chinese standards) that zh maps to zh-CN.
You might also have to look at the country to determine the script, or the other way around.
For instance az => az-AZ but az-Arab => az-Arab-IR, and az_IR => az_Arab_IR
Matching 'want' vs. 'have'
This involves matching a list of want vs. a list of have languages.
Dealing with lists makes it harder. And the result should also be sorted in a smart way, if possible. (for instance if want = [ fr ro ] and have = [ en fr_CA fr_FR ro_RO ] then you probably want [ fr_FR fr_CA ro_RO ] as result.
There should be no match between language with different scripts. So zh-TW should not fallback to zh-CN, and mn-Mong should not fallback to mn-Cyrl.
Tricky areas: sr-Cyrl should not fallback to sr-Latn in theory, but it might be understood by users. ro-Cyrl might fallback to ro-Latn, but not the other way around.
Some references
RFC 4647 deals with language fallback (but is not very useful in this case, because it follows the "cut from the right" rule).
ICU 4.2 and newer (draft in 4.0, I think) has uloc_addLikelySubtags (and uloc_minimizeSubtags) in uloc.h. That implements http://www.unicode.org/reports/tr35/#Likely_Subtags
Also in ICU uloc.h there are uloc_acceptLanguageFromHTTP and uloc_acceptLanguage that deal with want vs have. But kind of useless as they are, because they take a UEnumeration* as input, and there is no public API to build a UEnumeration.
There is some work on language matching going beyond the simple RFC 4647. See http://cldr.unicode.org/development/design-proposals/languagedistance
Locale matching in ActionScript at http://code.google.com/p/as3localelib/
The APIs in the new Flash Player 10.1 flash.globalization namespace do both tag guessing and language matching (http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/flash/globalization/package-detail.html). It works on TR-35 and can look beyond the # and consider the operation. For instance, if have = [ ja ja#collation=radical ja#calendar=japanese ] and want = [ ja#calendar=japanese;collation=radical ] then the best match depends on the operation you want. For date formatting ja#calendar=japanese is the better match, but for collation you want ja#collation=radical
Do you expect to have more users in Portugal or in Brazil? Pick accordingly.
For your general solution, you find out by reading up on Ethnologue.

Resources