Are there any standard for weekly digital magazines? That understand meta data of a magazine? Eg. issue number, date, author, editorial columns, series etc., so that content is searchable and presentable in a better intuitive GUI?
First of all, you need to clarify your question. The first part seems to refer to a file format for digital magazines, the second part to a file format for the metadata associated with a digital publication.
With respect to the metadata: there are several standards used in the industry. For examples:
ONIX: http://www.editeur.org/8/ONIX/
Dublin Core: http://dublincore.org/
MARC: http://www.loc.gov/marc/
With respect to the file format for the publication itself, some formats that might be suitable for digital magazines:
PDF
EPUB (3) supports Fixed Layout, and the IDPF has a working group focussing on comics and digital magazine issues (Advanced/Hybrid Layout, or EPUB 3 AHL WG)
Amazon Kindle has its own KF8 format
HTML5 allows you to deliver pre-paginated contents by setting the viewport dimensions
Related
Local travel cards in Saint-Petersburg, Russia have got huge id numbers that aren't easy to read and type into a web page when topping up the card online. So I want to build a small app that would take a photo of a travel card and parse the number out.
The task is a bit easier than a free form recognition:
card is of the very well known size
id numbers are of known size, are located in the very well known location on a card and they are number only, no letters (okay, there are two variations I think and maybe they will add 1-2 more in the future)
even the font is known in advance
even the first several numbers are the same for most of the card (so far there are only two prefixes used)
How would you do it? Are there any libraries tuned not for the general OCR, but for a "hinted" OCR like I need?
Best regards,
Artem.
P.S.
Actually a free/cheap web service for this task would also be good enough
Yes Google has a library called Tesseract and there is an iOS SDK on Github you can import into your application. So you can use this SDK and it has some documentation that you can read that will explain how to set it up in your app. It has methods that will return you a string with the text of the card in the string. BUT it will be ALL of the text from the card. So best thing to do would be to:
1 "clip" the original image to extract a sub image that displays only the portion of the card you wish to get the numbers from.
2 Process this sub image through Tesseract to retrieve the string you are looking for.
3 Then parse through the string and pick out the data that you need.
But just be warned, it can be a bit quirky. This SDK tends to recognize words best from images that are scanned, not "taken a picture of". Because although it is an advance piece of technology, it isn't perfect. So to get it to work as perfectly as possible for you, try to get scanned copies of the originals.
Best of luck.
The ideal solution for you would have three components:
1) Detection of the card. This is useful because if you have the detection, then the end users have much easier time actually using the scanner, because they can place the phone above the card in an arbitrary direction
2) Accurate OCR component. Ideally, customizable for this exact font you have on the card, for the exact position on the card.
3) Parsing mechanism. This would enable you to obtain the exact string written on the card without writing huge amount of OCR parsing code.
BlinkID SDK has all this. It has a preset for detection cards in the ID-1 format. It has integrated OCR engine. And it provides RegexParser, where you can define the exact format of the text which you're trying to extract from the document.
BlinkID was initially built for scanning ID documents which have very similar properties as the problem you're trying to solve.
Note. I'm one of the developers working on BlinkID.
My program can read several dozen file formats, using the traditional approach where I write procedural code for each file format. Most of these formats have their own unique loader library, their own bugs, their own limitations, and the whole thing is a huge time sink for me. I'd like to support a ton of other formats, but they're mostly not worth my time because they're not popular enough.
I'd like to replace my existing loaders with a single loader powered by a file format descriptor. I'm certain that someone has created software to learn file formats by example. My existing loaders would make excellent fitness functions for those formats, and I can write fitness functions for new formats too.
My question is, what software can I use to "learn" file formats by example, and how can I convert that "learning" into a descriptor for use with a generic loader?
Unless you limit it in some massive ways, I don't think you're likely to get very far. This would be ideal but beyond the current state of the art. For an arbitrary formats, you cannot do this, for example if I give you 200 JPGs,PNGs,BMPs and GIFs it very highly unlikely that a learning system can learn the formats.
Here are some problems researchers have looked at:
Learning a regular expression from examples: look at this question:
Is it possible for a computer to "learn" a regular expression by user-provided examples?,
for example
Information extraction: I give you a list of classified ads from the
newspaper, for example apartments for rent. You need to extract the
number of bedrooms, the rent, the deposit and the size of the unit.
You can read more about it here:
http://en.wikipedia.org/wiki/Information_extraction
I would like to know what technologies are out there for generating high-resolution print-quality vector artwork from the web. And would like to know the pros and cons of each, please.
For example, a customer designs their own artwork for printing on some item using a combination of pre-defined images and free text. I want to output this in a 300dpi vector-based PDF that can be sent direct to a print production team who can then run it through the printer in whatever format without having to create the artwork.
Apologies if this has been asked, but I couldn't find an answer that clearly stated what technologies are there.
Many thanks
Ben
I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.
I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?
GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.
You could also use tesseract-ocr for the OCR part, it's also open source and in active development.
For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.
Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.
This all can be run in the background:
download web page (net/http)
save metadata + image file for each book (paperclip)
run GOCR on all the images
All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.
I'm looking for an internal representation format for text, which would support basic formatting (font face, size, weight, indentation, basic tables, also supporting the following features:
Bidirectional input (Hebrew, Arabic, etc.)
Multi-language input (i.e. UTF-8) in same text field
Anchored footnotes (i.e. a superscript number that's a link to that numbered footnote)
I guess TEI or DocBook are rich enough, but here's the snag -- I want these text buffers to be Web-editable, so I need either an edit control that eats TEI or DocBook, or reliable and two-way conversion between one of them and whatever the edit control can eat.
UPDATE: The edit control I'm thinking of is something like TinyMCE, but AFAICT, TinyMCE lacks footnotes, and I'm not sure about its scalability (how about editing 1 or 2 megabytes of text?)
Any pointers much appreciated!
FCKeditor has a great API, supports several programming languages (considering it is javascript this isn't hard to achieve), can be loaded through HTML or instantiated in code; but most of all, allows easy access to the underlying form field, so having a jQuery or prototype ajax buffer shouldn't be terribly difficult to achieve.
The load time is very quick compared to previous versions. I'd give it a whirl.
In my experience a two-way conversion between HTML and XML formats like TEI or DocBook is very hard to make 100% reliable.
You could use Xopus (demo) to have your users directly edit TEI or DocBook XML. Xopus is a commercial browser based XML editor designed specifically for non-technical users. It supports bidi and UTF-8. The WYSIWYG view is rendered using XSLT, so that gives you sufficient control to render footnotes the way you describe.
As TEI and DocBook don't have means to store styling information, those formats will not allow your users to change font face, size and weight. But I think that is a good thing: users should insert headers and emphasis, designers should pick font face and size.
Xopus has a powerful table editor and indentation is handled by nesting sections or lists and XSLT reacting to that.
Unfortunately Xopus 3 will only scale to about 200KB of XML, but we're working on that.
I can't really decide on one of them. IMHO they are all not very good and complete. They all have their advantages and clear disadvantages. If TinyMCE is your favorite then afaik, it also does tables.
This list will probably come in handy: WysiwygEditorComparision.
I've also used FCKEditor and it performed well and was easy to integrate into my project. It's worth checking out.
Small correction to laurens' answer above: As of now (May 2012), Xopus supports UTF8, but not BiDi editing. Right-to-left text is displayed fine if it came from another source, cannot be edited correctly.
Source: I was recently asked to evaluate this, so have been testing it.