Language detection using Apache Tika in StormCrawler - apache-tika

Does the Apache Tika integration for StormCrawler support language detection for a document? Is there a list of variables that Tika produces that I can include in the output of StormCrawler?

The short answer is no but you can use the langid module instead, last time I checked it was faster, had more languages and was more accurate than the one in Tika.
I am not aware of an exhaustive list of values returned by Tika.

Related

(Mis)-using open.ai whisper for text-to-text translation

I noticed that transcribing speech in multiple languages with openai whisper speech-to-text library sometimes accurately recognizes inserts in another language and would provide the expected output, for example: 八十多个人 is the same as 八十几个人. So 多 and 几 are interchangeable and they can both mean several.
Yet, the same audio input on a different pass (with the same model, or a smaller/bigger model) would intermittently result in glitches where the entire sentence is being translated rather than transcribed. I.e. a fragment would be translated either into the first or the second language that appears in the audio. With the example input above either the entire sentence would be in English (with Chinese bits translated to English), or the entire sentence would be in Chinese (with the English bits translated to Chinese). Important: in both cases no input language was specified, and no task type was passed (which implies the default --task transcribe).
The docs for whisper mention translation to English as the only available target language (with the option --task translate in the command line version), but there is no mention of translating to other target languages. Yet the behavior mentioned above indicates that the models are capable of doing translation to other languages too.
The question is if there is a known way to configure the models to do just text-to-text translation? Or is the behavior just some sort of glitch that is not something that can be 'exploited' or configured on a lower level that would allow using the models just for text translation between any of the supported languages?
According to a comment in the whisper's issue tracker this might be a possible answer:
From the paper, the dataset that was used did not use any English audio to polish text samples. The dataset was cleaned by using a different model to match spoken language with text language. If they did not match, the sample was excluded. An exception was made for a portion of the training data to match any spoken language to English text (X->en) translation.
So unfortunately there is no direct way, the model wasn't trained on it. For your use case, this can transcribe to English text, but there has to be some an outside system to translate from English text to Polish text.
The --language parameter is defined in the cli as:
--language
language spoken in the audio, specify None
to perform language detection (default: None)
Yet, despite the help text above this can have potentially useful undocumented side effects.
The 'exploit'
The undocumented glitch that was observed is that if you set a source language e.g. es but the audio input contains English then the English part of the input will be translated to Spanish. Parts of the audio input that are not in English will be transcribed although depending on the language it might not always work or it might generate garbage translations.
So the 'exploit' is that the models can be used to parse English audio and then translate it to a supported language.
The behaviour above occurs with the regular transcribe mode (the default, ie. --task transcribe), and is reproducible with both the original whisper implementation in python, as well as the CPU-optimized C++ port whisper.cpp which is using the same models but apparently with different parameters.
The quality of the non-English translation would depend on the language, and seems to be generally of lower quality that translating from English with the open-source huggingface models (e.g. Helsinki-NLP/opus-mt-es-en, facebook/m2m100_418M, facebook/m2m100_1.2B etc).

Can we give file for translation e.g. give English file to translator/third party service for French translation

I am using ngx-translate for Internationalization so how to give file for translation e.g. give English file to translator/third party service for French translation. Considering accuracy can we achieve this please help.
If you have JSON files most translators will have a CAT (computer aided translation) tool which can be used to translate JSON and it will respect the integrity of the JSON format. If the CAT tool does not filter the JSON correctly out of the box there are tweaks which can be made. If you're able to display a sample of one of your files I can provide further feedback. I'm not a translator but I 'prepare' many JSON projects for translators to work on without having to worry about breaking the file.

Web Page From English To Urdu converter

I need to convert my website's pages from English to the Urdu language. For this I was using Google's Translation API, but Google translate API is not returning the correct translation of the pages.
What should I to use to get 99% accurate results when translating pages from English Language to Urdu Language?
There are only few parameters that you can specify when using Google Translate API and that can make a difference to your results: source and model parameters:
Source is the language of the source text. If you don't specify it then it will be detected automatically. As your source language is English then I don't think this will be causing any troubles.
Model: As Urdu language is supported by the Neural Machine Translation Model, if you don't specify the model, then nmt model will be used. You can try to use base model, however the nmt one is supposed to "provide improved translation for longer and more complex content".
Maybe expecting the model to get 99% accuracy is expecting it to be almost perfect.

Is there a Way to localize an Application on Various Platforms

We are developing an Application which runs on various plattforms (Windows, Windows RT, MacOSX, iOS, Android).
The Problem is how to manage the different localizations on the different Platforms in an Easy Way. The Language Files on the different platforms have various formats (some are xml based, others are simple key-value pairs and others are totally crazy formats like on MacOS)
I'm sure, we aren't the first company with this problem, but I wasn't able to find an easy to use solution o achive the possibility to have one "datasource" where the strings are collected in different languages (the best would be an User Interface for the translators) and then can export it to the different formats for the different platforms.
Does anybody has a solution for this problem?
Greetings
Alexander
I recommend using GNU Gettext toolchain for management and at runtime use either
some alternate implementation for runtime reading like Boost.Locale,
own implementation (the .mo format is pretty trivial) or
use Translate Toolkit to convert the message catalogs to some other format of your liking.
You can't use the libintl component of GNU Gettext, because it is licensed under LGPL and terms of both Apple AppStore and Windows Live Store are incompatible with that license. But it is really trivial to reimplement the bit you need at runtime.
The Translate Toolkit actually reimplements all or most of GNU Gettext and supports many additional localization formats, but the Gettext .po format has most free tools for it (e.g. poedit for local editing and Weblate for online editing) so I recommend sticking with it anyway. And read the GNU Gettext manual, it describes the intended process and rationale behind it well.
I have quite good experience with the toolchain. The Translate Toolkit is easy to script when you need some special processing like extracting translatable strings from your custom resource files and Weblate is easy to use for your translators, especially when you rely on business partners and testers in various countries for most translations like we do.
Translate Toolkit also supports extracting translatable strings from HTML, so the same process can be used for translating your web site.
I did a project for iPhone and Android which had many translations and I think I have exactly the solution you're looking for.
The way I solved it was to put all translation texts in an Excel spreadsheet and use a VBA macro to generate the .string and .xml translation files from there. You can download my example Excel sheet plus VBA macro here:
http://members.home.nl/bas.de.reuver/files/multilanguage.zip
Just recently I've also added preliminary Visual Studio .resx output, although that's untested.
edit:
btw also my javascript xcode/eclipse converter might be of use..
you can store your translations on https://l10n.ws and get it via they API
Disclaimer: I am the CTO and Co-Founder at Tethras, but will try to answer this in a way that is not just "Use our service".
As loldop points out above, you really need to normalize your content across all platforms if you want to have a one-stop solution for managing your localized content. This can be a lot of work, and would require much coding and scripting and calling of various tools from the different SDKs to arrive at a common format that would service the localization needs of all the various file formats you need to support. The length and complexity of my previous sentence is inversely proportional to the amount of work you would need to do to arrive at a favorable solution for all of this.
At Tethras, we have built a platform that alleviates the need for multi-platform software publishers to have to do this. We support all of the native formats from the platforms you list above, and can leverage translations from one file format to another. For example, translate the content in Localizable.strings from your iOS app into a number of languages, then upload your equivalent strings.xml file from Android or foo.resx from Windows RT to the system, and it will leverage translations for you automatically. Any untranslated strings will be flagged and you can order updates for these strings.
In effect, Tethras is a CMS for localized content across many different native files formats.

Open-source OCR package that can handle unknown characters?

I want to find a (preferably) open-source OCR package (for any OS) that is capable of handling a new character set.
The language is Latin, but with some scribal abbreviations, about 10 different abbreviations that aren't in Unicode.
The text has been printed using specially-developed fonts, and I have high-res images of the text.
I'm assuming some training is going to be needed, first to map the scribal abbreviations to ASCII, and then presumably corpus-specific training for the software to learn where the abbreviations tend to appear within words.
Could anyone recommend a (preferably) open-source package capable of handling this?
AFAIK there is no library (free or commercial) that can be used as-is for what you describe (a language with characters not representable by Unicode) ... BUT as a good starting point there is an opensource OCR called Tesseract which you could take and modify for your special scenario... another interesting base could be OCROpus... but beware: this will mean lots of work.

Resources