Extracting data from Chinese language document

Extracting data from Chinese language document - machine-learning

Does workfusion have support for extracting data from Chinese language document by using OCR and Machine learning. Please advice.
Regards,
Sunil Prabakar C

In case you are looking for a generic answer: Yes, Chinese language is supported.
A bit more details:
WorkFusion OCR module supports about 200 languages (including Chinese Simplified and and Chinese Traditional)
WorkFusion ML module is language agnostic. Training set size may be required to be of a bigger size for less common languages, as well as configuration may need to include language specific features for better results.
WorkFusion RPA module is language agnostic. It can interact with applications with pretty much any language of the user interface. More technically precise: it’s character encoding-agnostic. More than 100 character encodings supported including all wide used (different versions of Unicode/UTF, ISO, ASCII, IBM, windows, many more).

Related

.NET Language for cleanest implementation of ISO 8583 parser

What would be the best .NET language for a new ISO 8583 parser, if we're talking about dev time and comprehensibility (not so much about performance)?
I'm currently mostly involved with C#, but have also worked with VB.Net, and have introductory level knowledge of functional programming (so F# is not completely out of the question).
If domain of problem is parsing finite depth textual data structures with bitmaps of fields, I'm asking which language is the best fit? (if your suggestion is Javascript or Perl I'll take a note of that too).

How do I specify language when storing strings?

I'm currently developing a system that supports several languages. I want to specify these languages as precisely as possible in the database in case of future integrations. (Yes I know it's a bit YAGNI)
I've found several ways to define a language
nb-NO
nb_NO
nb-no
nb_no
nb
These can all mean "Norwegian Bokmål". Which one, if any, is the most correct?
The Locale article on the ArchLinux Wiki specifies a Locale as language[_territory][.codeset][#modifier]. The codeset and modifier I guess are only relevant for input. But language is a minimum and territory may be nice to have should we implement cultural differences regarding currency and decimal points etc.
Am I overthinking it?

Look at BCP 47
https://tools.ietf.org/html/bcp47
In this day and age you would need at to support at least language, script, region (only language being mandatory to be present)
It depends a lot what you use this tags for.
If it is spoken content you might care about dialect (for instance Cantonese vs Mandarin Chinese), but not script. In written form you will care about script (Traditional vs. Simplified Chinese), but not dialect.
It also matters a lot the complete stack you use to process things. You might use minus as separator, use grandfathered ids, or the -u- extension (see bcp,) then discover that you use a programming language that "chokes" on it. Or you use "he" for Hebrew, but your language (cough Java cough) wants the deprecated "iw"
So you might decide to use the same locale id as your tech stack, or have a "conversion layer".
If you want things accessible from several technologies, then conversion layers is your only (reasonable) option.

What is the usual method of encoding documents in foreign alphabets?

What is the usual method of encoding documents in foreign alphabets for the purposes of programs that do terminal to terminal communications? There are two parts to this question: Latin alphabets and non-Latin.
I know that 8859-1 can handle most European languages, so is the usual practice in say Danish to just set your computer to 8859 and you are done? What about French and Polish?
For non-Latin alphabets like Russian, Armenian and Korean obviously you cannot use 8859-1. Do they just write documents in some other code page and have their computer set to that code page or do they use unicode or UTF-8 or do all three? What is the standard practice?
I am only interested in alphabetic systems. I know how the non-alphabetic systems (Chinese/Japanese) work, so no need to explain what they do.
My need here is to understand what kind of support to build into a terminal-based communication system that will be used by people talking to each other in different countries. For example, imagine you are writing an instant messaging system and need it to be interoperable between people in different countries.

For any system set up in this decade, you should expect and demand Unicode (though not necessarily UTF-8) and be done.
Historically, you would see all three of (1) use a legacy codepage or even (gasp) official character set for your locale (much depending on your OS and vendor -- Windows and Mac would traditionally gravitate to their own proprietary code tables, while Linux would use ISO-8859-x where available and applicable); (2) use something "close enough" and just wing it in corner cases (for example, ISO-8859-1 is in principle insufficient for Finnish, but people would just refrain from using the handful of words where it matters, or write them unaccented); and (3) use a local convention such as pr"efix"ed "accents or LaTeX \"acc\"ents or uetterly uenreaedaeble digraphs (these have some base in tradition e.g. in Germany where it is still okay to write "umlaeute" as a variant of "umläute").
It is not really correct to say that ISO-8859-1 "can handle most European languages". It is sufficient for most of the official national languages of Western Europe (especially if you are willing to compromise a bit, like the French grudgingly did) but completely inadequate for the majority of European languages. There is ISO-8859-2 and ISO-8859-3 etc which cater to the needs of groups of other European languages, but in many settings, interoperability with ISO-8859-1 was also desirable, so these were always a bit problematic.
For the specific character sets you ask about, there is ArmSCII for Armenian, a variety of Cyrillic encodings for Russian -- depending on where you look and who you ask, Windows codepage 1251 or KOI-8R would be regarded as the dominant one --, and similarly a variety of Korean standards, though KSC 5601 seems to dominate at least for email (the link has forward pointers to several others).
While Korean is nominally analyzable as a roughly alphabetic writing system, the traditional encoding approach has been to create glyphs for each possible combined syllable, resulting in a large character set which has more in common with Chinese or Japanese than with typical 8-bit alphabetic encodings. I believe the composing jamo characters have only become available for practical use when they were included in Unicode.
For a messaging system in particular, you have two choices, only one of which makes proper sense, really: Design the protocol to tag the encoding of every transmitted character, and implement transcoding in all clients; or just use Unicode everywhere.
The one remaining challenge is to ensure that each client has the necessary fonts to display the glyphs they receive. Things are slowly improving, but this is a complex matter.

There is no out of band information. With the legacy systems you are expected to know the usual encodings and try them by yourself till one works. Which is why anything but unicode is dumb today (and anything but uTF-8 as unicode encoding is fail too). It is not true that no one uses unicode today. utf-8 is the default xml encoding, the default ietf W3C encoding, the default Linux encoding, etc. Building a new multilingual system around anything but utf-8 today is a big mistake.

Internationalization and For Program Localization. i18n

I have several projects I've worked on that are setup for internationalization.
From the programming perspective, I have everything pretty much setup and put all of the string into an xml file or properties file. I wish to get these files translated into other languages, such as: Italian (it), Spanish (es), Germany (de), Brazillian Portugese (pt-br), Chinese Simplified (zh-cn), Chinese Traditional (zh-tw), Japanese (ja), Russian (ru), Hugarian (hu), Polish (pl), and French (fr).
I've considered using services like google translate, however I feel that this automatic translation tools are still a bit weak.
In summsary, I'm curious on if others have used professional translation services for their programs, if so which ones would people recommend and how did you coordinate the translation updates with the translation teams? Any idea on what I should expect to pay? Or is there a better way of doing this that I'm not aware of?

Machine translation services like Google, Bing etc. are not a good choice. As you mention, these services are in reality still in their infancy, and more importantly using them will most likely give your non-English customers a bad impression of your application.
If you want top quality translation, you will need to employ the services of a professional translation agency. Translators need to understand your application in order to translate the text correctly, so providing them with the application itself or screen captures of the English product will help.
You will pay per word - the rates vary from agency to agency, and also from language to language.
The other alternative is using crowd-sourced translations, from GetLocalization for example.
To summarize, proper localization is not just a matter of translating the text - you need to build a relationship with your translators, and ensure they understand your application and the context of the strings that they are translating, otherwise you will end up with a linguistically poor application, that will reflect badly on your company.

Resources for character and text processing (encoding, regular expressions, NLP)

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need to learn about advanced topics right away. But I need to know:
Bit and bytes level knowledge of encodings.
Characters and alphabets not used in English.
Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
Regular expressions.
Algorithm for text processing.
Parsing natural languages.
I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.
I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)

In addition to wikipedia, Joel Spolskys article on encoding is really good too.
This free character map is a nice resource for all unicode characters.
This regular expression tutorial can be helpful.
Specifically on NLP and Japanese, you could
take a look at this Japanese NLP
project.
On text processing, this Open
Source project can be useful.

As is usual for most general "I want to learn about X topic" questions, Wikipedia is a good place to start:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Natural_language_processing

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Extracting data from Chinese language document - machine-learning

Does workfusion have support for extracting data from Chinese language document by using OCR and Machine learning. Please advice. Regards, Sunil Prabakar C

Related

.NET Language for cleanest implementation of ISO 8583 parser

How do I specify language when storing strings?

What is the usual method of encoding documents in foreign alphabets?

Internationalization and For Program Localization. i18n

Resources for character and text processing (encoding, regular expressions, NLP)

Categories

Resources