I have a bit of an unorthodox question that I cannot think of an approach how to tackle. I have some letters written like this:
/\ |---\ /---\
/ \ |___/ |
/----\ | \ |
/ \ |___/ \---/
Now, the idea is to read this content (possibly from a text file) and parse it to the real letters they actually represent. So this should be parsed to ABC.
I understand this is not OCR, but I have no idea if something like that is possible. I am not asking for a solution, but rather, how would you best attack this problem? What would be a good criteria for distinguishing when a 'letter' starts and when does it end?
Based on the comments it sounds like you could store a character font map (2-dimensional array for each character) and then read the input file and buffer a number of lines equal to the height of the characters.
Then, for each group of lines you would want to segment the input based on the width of the characters and slide across horizontally, looking for matches against your font map.
If you need to support multiple fonts then things get more complicated and you'd benefit more from a neural-net approach to character recognition of sorts.
One important aspect to keep in mind about how OCR typically works is that it takes an arbitrary image and it "pixelates" it generating a much lower resolution image. In your case you've already got a "pixelated" representation of the image and all you'd have to do is read in the input and feed that into the rest of the pipeline.
I would still approach this as an OCR-esque problem.
You could first draw the characters onto an image and run it through an available OCR library.
Or you could do it yourself.
Pre-process it by converting vertical and horzitonal characters into lines first.
Then where there are forward and backslashes, approximate start and finish points of the curve by where they meet the previous horizontal and vertical (a different approach would be needed for letters such as 'o' or 'e').
Once you have this image a simple pattern analysis approach, such as naive bayes should be able to produce reliable results.
Whether the pre processing would actually provide accuracy improvements, i'm not sure
Related
Homograph is a word that shares the same written form as another word but has a different meaning, like right in the sentences below:
success is about making the right decisions.
Turn right after the traffic light
The English word "right", in the first case is translated to Swedish as "rätt" and to "höger" in the second case. The correct translation is possible by looking at the context (surrounding words).
Question 1. I wonder if fasttext aligned word embedding can come to help for translating these homograph words or words with several possible translations into another language?
[EDIT] The goal is not to query the model for the right translation. The goal is to pick the right translation when the following information is given:
the two (or several) possible translations options in the target language like "rätt" and "höger"
the surrounding words in the source language
Question 2. I loaded the english pre-trained vectors model and the English aligned vector model. While both were trained on Wikipedia articles, I noticed that the distances between two words were sort of preserved but the size of the dataset files (wiki.en.vec vs wiki.en.align.vec) are noticeably different (1GB). Wouldn't it make sense if we only use the aligned version? What information is not captured by the aligned dataset?
For question 1, I suppose it's possible that these 'aligned' vectors could help translate homographs, but still face the problem that any token only has a single vector – even if that one token has multiple meanings.
Are you assuming that you already know that right[en] could be translated into either rätt[se] or höger[se], from some external table? (That is, you're not using the aligned word-vectors as the primary means of translation, just an adjunct to other methods?)
If so, one technique that might help would be to see which of rätt[se] or höger[se] is closer to other words that surround your particular instance of right[en]. (You might tally each's rank-closeness to every word within n spots of right[en], or calculate their cosine-similarity to the average of the n words around right[en], for example.)
(You could potentially even do this with non-aligned word vectors, if your more-precise words have multiple, alternate, non-homograph/non-polysemous translations in English. For example, to determine which sense of right[en] is more likely, you could use the non-aligned English word vectors for correct[en] and rightward[en] – less polysemous correlates of rätt[se] & höger[se] – to check for similarity-to-surrounding words.)
A write-up that might create other ideas is "Linear algebraic structure of word meanings" which, quite surprisingly, is able to tease-out alternate meanings of homograph tokens even when the original word-vectors training was not word-sense-aware. (Might the 'atoms of discourse' in their model be equally findable across merged/aligned multi-language vector spaces, and then the closeness-of-context-words to different atoms a good guide to word-sense-disambiguation?)
For question 2, you imply the aligned word set is smaller in size. Have you checked if that's just because it includes fewer words? That seems the simplest explanation, and just checking which words are left out would let you know what you're losing.
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.
For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.
I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.
I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.
Can I achieve this with the help of OpenCV or do I need entirely different approach?
Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.
This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.
For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:
Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.
Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.
Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results.
There are other issues also but these are the basic ones.
In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.
Simple image binarization will remove the background content leaving only necessary content as shown here.
Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.
For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:
Caption title
header! header2 header3
row1cell1 row1cell2 row1cell3
row2cell1 row2cell2 row2cell3
As we can see here that result is quite accurate but there are some issues like
header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.
After post processing the OCR result it can be parsed to read the row and column values.
Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.
Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.