Is it possible to read the text on a gif with python? I don't think it is but if it is if you could show me how to do it that would be great. Heres what I've been trying to do
def getContents(url):
x = urllib.request.urlopen(url).read()# would add decode, but it gives errors
return x
doing "getContents(gif)" returns weird characters, that look like bytes "\x00" etc. Not sure if their is anything in python that can read them. I apologize if my description is too vague, but its the best I can describe it the problem is open a gif then reading it or opening a jpg and reading it would give read text that's returned.
This is called OCR. There are a number of such libraries, give tesseract a try:
http://code.google.com/p/pytesser/
Related
I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.
I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.
Can I achieve this with the help of OpenCV or do I need entirely different approach?
Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.
This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.
For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:
Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.
Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.
Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results.
There are other issues also but these are the basic ones.
In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.
Simple image binarization will remove the background content leaving only necessary content as shown here.
Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.
For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:
Caption title
header! header2 header3
row1cell1 row1cell2 row1cell3
row2cell1 row2cell2 row2cell3
As we can see here that result is quite accurate but there are some issues like
header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.
After post processing the OCR result it can be parsed to read the row and column values.
Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.
I'm running a genetic algorithm program and can find the best individual at the end of the run (hof[0]), but i want to know which generation produced it. Is there any attributes of hof[0] that will help print the individual and the generation that created it.
I tried looking at the manuals and Google for answers but could not find it anywhere.
I also couldn't find a list of the attributes of individuals that I could print. Can someone point to the right link and documentation to that.
Thanks
This deap post suggest tracking the logbook, or explicitly adding the generation to the individual along with fitness:
https://groups.google.com/g/deap-users/c/r7fZbMwHg3I/m/BAzHh2ogBAAJ
For the latter:
If you are working with the algo locally(recommended if working beyond a tutorial as something always comes up like adding plotting or this very questions) then you can modify the fitness update line to resemble:
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
ind.generation = gen # now we can: print(hof[0].gen)
if halloffame is not None:
halloffame.update(population)
There is no built in way to do this (yet/to the best of my knowledge), and implementing this so would probably be quite a large task. The simplest of which (simplest in thought, not in implementation) would be to change the individual to be a tuple, where tup[0] is the individual and tup[1] is the generation it was produced in, or something similar.
If you're looking for a hacky way, you could maybe try writing the children of each generation to a text file and cross-checking your final solution with the text file; but other than that I'm not sure.
You could always try posting on their Google Group, though it can take a couple of days for a reply.
Good luck!
I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...
I recorded a radio signal into a .wav, I can open it in audacity and see that there is binary data encoded using a certain algorithm. Does anyone know of a way to process the signal that is contained within the .wav? so that i can extract the binary data from it?
I know that I need to know the encoding algorithm for it to work properly, anyone know of any program that does something like that?
Thanks
sox will convert most audio formats to most other audio formats - including raw binary.
The .wav format is generally very simple and wav files usually don't have compressed data. It's quite feasible to parse it yourself, but much easier to use something already made. So the short answer is to find something that can read wav files in your language of choice.
Here's an example in Python, using the wave module:
import wave
w = wave.open("myfile.wav", "rb")
binary_data = w.readframes(w.getnframes())
w.close()
Now where you go depends on what else you want to do. binary_data is now a python string of the raw bytes. If you just want to chop this and repackage it, it's probably easiest to leave it in this form. If you want to manipulated the data, such as scale it, interpolate, filter, etc, you would probably want to convert this into a sequence of numbers, and for this, in Python, you'd want to convert it to a numpy array. You could do this yourself using the struct module, which is for interpreting strings as packed binary data, or you could just have read in the data using scipy.io.wave module which does this for you. As you can see, most of this becomes fairly language dependent quickly.
My school has Matlab but I can't use it at home so I am trying to learn Octave. I am having trouble saving plots as png files so I can put them in a report.
I read you can use print("filename.png") to save the plots, but I am getting some kind of error I am assuming is due to using latex in my labels
I am using
xlabel('\omega')
Error message: gdImageStringFT: Could not find/open font while printing string w with font Symbol
The plot still saves, but any label with latex in it just doesn't print at all. I know I could just avoid formatting the text, but it just looks so much nicer with latex.
Anyone know what I can do? (ps I am not very advanced with linux just fyi)
So what happens here is that for the png format Octave needs to have the Symbols font at its disposal if you want to include, e.g., greek letters. This is because png is a bitmap format and the letters are rasterized and printed into the picture.
The correct way, or at least the way most people circumvent Octave's / Mathematica's / etc. poor labeling, is to output encapsulated postscript (.eps) with dummy labels. These labels are kept separate in the eps format and one can then use the psfrag package in LaTeX to replace the dummy labels for correct labels. This allows for much better control over the label and gives you access to all of LaTeX's formatting and formulas.
Here or here is a hands on tutorial how to do this with Octave and gnuplot.