Their is any possibility to replace correct word (for eg: instead of effect,affect) in given input file by using fasttext library created by facebook
Not really; the library learns the text, as it is. There are no facilities for pre-evaluation and editing. You would have a hard time learning the problem from context: how do you determine that the actual use is, indeed, incorrect?
Related
I am dealing with tons of PDF documents (petetions data) filled with text data having numbers, tabular data etc. The objective of client is to summarize any such given document to reduce man-force in reading the entire document. I have tried conventional methods like lSA,Gensim-summarizer, BERT extractive summarizer, Pysummarizer.
The results are not at all good, Please suggest me any way where i can find a industry level summarizer(extrative/abstractive) that would give me a good start to solve this issue .
First, you will need to know exactly what data the company wants abstracted from the documents. After that, you may be able to convert the documents to raw text using OCR or some other PDF application, and then extract the data you need. If the company isn't being clear on how they want you to summarize the data, that would be something to talk to them about. It might be as simple as setting a title for the document, or classifying it. If it's document classification I can help you with that, I made a repo for that purpose a little while ago.
Local travel cards in Saint-Petersburg, Russia have got huge id numbers that aren't easy to read and type into a web page when topping up the card online. So I want to build a small app that would take a photo of a travel card and parse the number out.
The task is a bit easier than a free form recognition:
card is of the very well known size
id numbers are of known size, are located in the very well known location on a card and they are number only, no letters (okay, there are two variations I think and maybe they will add 1-2 more in the future)
even the font is known in advance
even the first several numbers are the same for most of the card (so far there are only two prefixes used)
How would you do it? Are there any libraries tuned not for the general OCR, but for a "hinted" OCR like I need?
Best regards,
Artem.
P.S.
Actually a free/cheap web service for this task would also be good enough
Yes Google has a library called Tesseract and there is an iOS SDK on Github you can import into your application. So you can use this SDK and it has some documentation that you can read that will explain how to set it up in your app. It has methods that will return you a string with the text of the card in the string. BUT it will be ALL of the text from the card. So best thing to do would be to:
1 "clip" the original image to extract a sub image that displays only the portion of the card you wish to get the numbers from.
2 Process this sub image through Tesseract to retrieve the string you are looking for.
3 Then parse through the string and pick out the data that you need.
But just be warned, it can be a bit quirky. This SDK tends to recognize words best from images that are scanned, not "taken a picture of". Because although it is an advance piece of technology, it isn't perfect. So to get it to work as perfectly as possible for you, try to get scanned copies of the originals.
Best of luck.
The ideal solution for you would have three components:
1) Detection of the card. This is useful because if you have the detection, then the end users have much easier time actually using the scanner, because they can place the phone above the card in an arbitrary direction
2) Accurate OCR component. Ideally, customizable for this exact font you have on the card, for the exact position on the card.
3) Parsing mechanism. This would enable you to obtain the exact string written on the card without writing huge amount of OCR parsing code.
BlinkID SDK has all this. It has a preset for detection cards in the ID-1 format. It has integrated OCR engine. And it provides RegexParser, where you can define the exact format of the text which you're trying to extract from the document.
BlinkID was initially built for scanning ID documents which have very similar properties as the problem you're trying to solve.
Note. I'm one of the developers working on BlinkID.
I have a service that allows user's (admins) to change the terminology the site uses. My designer wants me to use the format "A Group". The problem is, for some terminology, it should be "An" not "A".
Is there any way to reliably detect which to use? What about localization?
I can brute force it and get 90% of the way by checking the first letter for consonant vs vowel. That won't work for all words though. And that doesn't cover any language except English.
In my opinion you've got only 2 ways:
1- You need to check the first letter and process all the sentence by checking its letters to see if there is any non-English letters.
2- Provide a dictionary of English nouns then you can easily check your word to find if it needs an "a" or "an".
Although the "a versus an" issue is very specific, what you're describing here is a natural language processing issue. Essentially you are being asked to write code that generates a grammatically correct piece of text.
I think you should try to to explain the implications to the designer, especially if you end up localizing in other languages. Your time is probably better spent working on your app's business logic than on language processing.
I downloaded the EverNote API Xcode Project but I have a question regarding the OCR feature. With their OCR service, can I take a picture and show the extracted text in a UILabel or does it not work like that?
Or is the text that is extracted not shown to me but only is for the search function of photos?
Has anyone ever had any experience with this or any ideas?
Thanks!
Yes, but it looks like it's going to be a bit of work.
When you get an EDAMResource that corresponds to an image, it has a property called recognition that returns an EDAMData object that contains the XML that defines the recognition info. For example, I attached this image to a note:
I inspected the recognition info that was attached to the corresponding EDAMResource object, and found this:
the xml i found on pastie.org, because it's too big to fit in an answer
As you can see, there's a LOT of information here. The XML is defined in the API documentation, so this would be where you parse the XML and extract the relevant information yourself. Fortunately, the structure of the XML is quite simple (you could write a parser in a few minutes). The hard part will be to figure out what parts you want to use.
It doesn't really work like that. Evernote doesn't really do "OCR" in the pure sense of turning document images into coherent paragraphs of text.
Evernote's recognition XML (which you can retrieve after via the technique that #DaveDeLong shows above) is most useful as an index to search against; the service will provide you sets of rectangles and sets of possible words/text fragments with probability scores attached. This makes a great basis for matching search terms, but a terrible one for constructing a single string that represents the document.
(I know this answer is like 4 years late, but Dave's excellent description doesn't really address this philosophical distinction that you'll run up against if you try to actually do what you were suggesting in the question.)
I got a bunch of .DOC documents. I'm not even positive they are Word documents, but even if they are, I need to open and parse them with eg. Python to extract information from them.
Problem is, I couldn't figure out how they were encoded: UltraEdit's Conversion function wouldn't correct the text no matter which encoding I tried. OpenOffice 3.2 also failed displaying the contents correctly (guessing Windows-1252).
Here's an example, hoping that someone knows what pagecode it is:
"lÕAssemblŽe gŽnŽrale" instead of "l'Assemblée générale"
Thank you for any tip.
Greenstone digital library http://www.greenstone.org/ provides pretty good text extraction from word documents, including encoding detection.
Running msword in server mode gives you a range of scripting options- I'm sure detecting the encoding will be possible.