I want to find a (preferably) open-source OCR package (for any OS) that is capable of handling a new character set.
The language is Latin, but with some scribal abbreviations, about 10 different abbreviations that aren't in Unicode.
The text has been printed using specially-developed fonts, and I have high-res images of the text.
I'm assuming some training is going to be needed, first to map the scribal abbreviations to ASCII, and then presumably corpus-specific training for the software to learn where the abbreviations tend to appear within words.
Could anyone recommend a (preferably) open-source package capable of handling this?
AFAIK there is no library (free or commercial) that can be used as-is for what you describe (a language with characters not representable by Unicode) ... BUT as a good starting point there is an opensource OCR called Tesseract which you could take and modify for your special scenario... another interesting base could be OCROpus... but beware: this will mean lots of work.
Related
I noticed that transcribing speech in multiple languages with openai whisper speech-to-text library sometimes accurately recognizes inserts in another language and would provide the expected output, for example: 八十多个人 is the same as 八十几个人. So 多 and 几 are interchangeable and they can both mean several.
Yet, the same audio input on a different pass (with the same model, or a smaller/bigger model) would intermittently result in glitches where the entire sentence is being translated rather than transcribed. I.e. a fragment would be translated either into the first or the second language that appears in the audio. With the example input above either the entire sentence would be in English (with Chinese bits translated to English), or the entire sentence would be in Chinese (with the English bits translated to Chinese). Important: in both cases no input language was specified, and no task type was passed (which implies the default --task transcribe).
The docs for whisper mention translation to English as the only available target language (with the option --task translate in the command line version), but there is no mention of translating to other target languages. Yet the behavior mentioned above indicates that the models are capable of doing translation to other languages too.
The question is if there is a known way to configure the models to do just text-to-text translation? Or is the behavior just some sort of glitch that is not something that can be 'exploited' or configured on a lower level that would allow using the models just for text translation between any of the supported languages?
According to a comment in the whisper's issue tracker this might be a possible answer:
From the paper, the dataset that was used did not use any English audio to polish text samples. The dataset was cleaned by using a different model to match spoken language with text language. If they did not match, the sample was excluded. An exception was made for a portion of the training data to match any spoken language to English text (X->en) translation.
So unfortunately there is no direct way, the model wasn't trained on it. For your use case, this can transcribe to English text, but there has to be some an outside system to translate from English text to Polish text.
The --language parameter is defined in the cli as:
--language
language spoken in the audio, specify None
to perform language detection (default: None)
Yet, despite the help text above this can have potentially useful undocumented side effects.
The 'exploit'
The undocumented glitch that was observed is that if you set a source language e.g. es but the audio input contains English then the English part of the input will be translated to Spanish. Parts of the audio input that are not in English will be transcribed although depending on the language it might not always work or it might generate garbage translations.
So the 'exploit' is that the models can be used to parse English audio and then translate it to a supported language.
The behaviour above occurs with the regular transcribe mode (the default, ie. --task transcribe), and is reproducible with both the original whisper implementation in python, as well as the CPU-optimized C++ port whisper.cpp which is using the same models but apparently with different parameters.
The quality of the non-English translation would depend on the language, and seems to be generally of lower quality that translating from English with the open-source huggingface models (e.g. Helsinki-NLP/opus-mt-es-en, facebook/m2m100_418M, facebook/m2m100_1.2B etc).
My question is, which Latex features aren't supported by Mathjax? For example, in Latex I can write $\today$ and it will return the current date. This is not possible in Mathjax.
In KaTeX, a Mathjax alternative, there seem to be more troublesome limitations such as \overrightarrow{AB} not working. I was wondering, what the current limitations of Mathjax are, in terms of latex rendering, before using it in a website instead of converting tex equations to png images and inserting those. I have noticed that Wikipedia uses the tex2png approach instead of Mathjax and was wondering whether they just did not want to depend on Mathjax, whether it's not fully supported by all browsers, whether it's too slow, whether the limited feature set of Mathjax is a problem or just legacy?
First and foremost, Mathjax, as its name suggest, supports mathematics typesetting for the web and is not a web-implementation of general-purpose Latex. Here's what this means most notably feature-wise:
No tables
No tikzpictures
No bibliographies
No support for units, e.g. \SI{10}{\hertz} is not possible (requiring the siunitx package in latex)
No special packages, for example no \uwave from package ulem
Within the math-world, Mathjax is covering almost everything. Here is a list of features that are not supported for mathematics typesetting:
Items that require the mathtools package, for example H \xrightharpoondown[under]{over} I\\.
The other question was, why Wikipedia isn't using Mathjax, but has chosen to convert Equations into a png. I think it's because they already had a working solution when Mathjax got popular and don't really have an incentive to switch to Mathjax. Mathjax especially shines, when you need an out of the box solution for rendering math on the web.
My program can read several dozen file formats, using the traditional approach where I write procedural code for each file format. Most of these formats have their own unique loader library, their own bugs, their own limitations, and the whole thing is a huge time sink for me. I'd like to support a ton of other formats, but they're mostly not worth my time because they're not popular enough.
I'd like to replace my existing loaders with a single loader powered by a file format descriptor. I'm certain that someone has created software to learn file formats by example. My existing loaders would make excellent fitness functions for those formats, and I can write fitness functions for new formats too.
My question is, what software can I use to "learn" file formats by example, and how can I convert that "learning" into a descriptor for use with a generic loader?
Unless you limit it in some massive ways, I don't think you're likely to get very far. This would be ideal but beyond the current state of the art. For an arbitrary formats, you cannot do this, for example if I give you 200 JPGs,PNGs,BMPs and GIFs it very highly unlikely that a learning system can learn the formats.
Here are some problems researchers have looked at:
Learning a regular expression from examples: look at this question:
Is it possible for a computer to "learn" a regular expression by user-provided examples?,
for example
Information extraction: I give you a list of classified ads from the
newspaper, for example apartments for rent. You need to extract the
number of bedrooms, the rent, the deposit and the size of the unit.
You can read more about it here:
http://en.wikipedia.org/wiki/Information_extraction
We are developing an Application which runs on various plattforms (Windows, Windows RT, MacOSX, iOS, Android).
The Problem is how to manage the different localizations on the different Platforms in an Easy Way. The Language Files on the different platforms have various formats (some are xml based, others are simple key-value pairs and others are totally crazy formats like on MacOS)
I'm sure, we aren't the first company with this problem, but I wasn't able to find an easy to use solution o achive the possibility to have one "datasource" where the strings are collected in different languages (the best would be an User Interface for the translators) and then can export it to the different formats for the different platforms.
Does anybody has a solution for this problem?
Greetings
Alexander
I recommend using GNU Gettext toolchain for management and at runtime use either
some alternate implementation for runtime reading like Boost.Locale,
own implementation (the .mo format is pretty trivial) or
use Translate Toolkit to convert the message catalogs to some other format of your liking.
You can't use the libintl component of GNU Gettext, because it is licensed under LGPL and terms of both Apple AppStore and Windows Live Store are incompatible with that license. But it is really trivial to reimplement the bit you need at runtime.
The Translate Toolkit actually reimplements all or most of GNU Gettext and supports many additional localization formats, but the Gettext .po format has most free tools for it (e.g. poedit for local editing and Weblate for online editing) so I recommend sticking with it anyway. And read the GNU Gettext manual, it describes the intended process and rationale behind it well.
I have quite good experience with the toolchain. The Translate Toolkit is easy to script when you need some special processing like extracting translatable strings from your custom resource files and Weblate is easy to use for your translators, especially when you rely on business partners and testers in various countries for most translations like we do.
Translate Toolkit also supports extracting translatable strings from HTML, so the same process can be used for translating your web site.
I did a project for iPhone and Android which had many translations and I think I have exactly the solution you're looking for.
The way I solved it was to put all translation texts in an Excel spreadsheet and use a VBA macro to generate the .string and .xml translation files from there. You can download my example Excel sheet plus VBA macro here:
http://members.home.nl/bas.de.reuver/files/multilanguage.zip
Just recently I've also added preliminary Visual Studio .resx output, although that's untested.
edit:
btw also my javascript xcode/eclipse converter might be of use..
you can store your translations on https://l10n.ws and get it via they API
Disclaimer: I am the CTO and Co-Founder at Tethras, but will try to answer this in a way that is not just "Use our service".
As loldop points out above, you really need to normalize your content across all platforms if you want to have a one-stop solution for managing your localized content. This can be a lot of work, and would require much coding and scripting and calling of various tools from the different SDKs to arrive at a common format that would service the localization needs of all the various file formats you need to support. The length and complexity of my previous sentence is inversely proportional to the amount of work you would need to do to arrive at a favorable solution for all of this.
At Tethras, we have built a platform that alleviates the need for multi-platform software publishers to have to do this. We support all of the native formats from the platforms you list above, and can leverage translations from one file format to another. For example, translate the content in Localizable.strings from your iOS app into a number of languages, then upload your equivalent strings.xml file from Android or foo.resx from Windows RT to the system, and it will leverage translations for you automatically. Any untranslated strings will be flagged and you can order updates for these strings.
In effect, Tethras is a CMS for localized content across many different native files formats.
Working in academia publishing CS/math, you sooner or later find yourself trying to publish in a journal that will only accept .doc/.rtf. This means tedious, boring hours of translating line after line, especially equations, from LaTeX to an inferior format. Over the years I have tried a number of export tools for LaTeX, but none, at least of the free ones, that I have been very satisfied with. I'd like this page to collect and monitor the best import/export tools for LaTeX, to .doc/.rtf, or to other useful (e.g. HTML, MATHML) formats.
Thus, what is your one favorite import or export LaTeX tool?
AFAIK there isn't really a convenient and effective way to achieve what you're trying to do. What I usually do in those rare occasions is that I export to pdf, then select all the text, and paste into word. It's horrible and messes things up and of course doesn't adjust your citations.
To this day I don't understand how people writing in scientific fields can write and publish in Word. It is common in some human-computer interaction literature but I have not seen it in other conferences and journals. May I ask which one it is?
Also, some places, once you've already been accepted, will be willing to accept a PDF if you push it with them. You may have to make little adjustment yourself. Negotiations sometimes work on this.
The UK TeX FAQ has been collecting answers on this for quite some time now. :)
See Conversion from (La)TeX to HTML and Other conversions to and from (La)TeX. There is another FAQ specifically about Converters between LaTeX and PC Textprocessors maintained by Wilfried Hennings.
For LaTeX to HTML there are LaTeX2HTML, TtH, Tex4ht, TeXpider and Hevea; in my experience TeX4ht is the best. For LaTeX to Word, you can go through RTF with TeX2RTF (not so good), or through Adobe Acrobat which can produce PDF that Word can read (not good either), or go through HTML as above, but best is to use tex4ht which can generate OpenOffice ODT format, from which conversion to Word is easy.
The UK TeX FAQ also has many other useful things; you should take a look.