Extracting ePub Excerpt - epub

I've read about the ePub format, standard, structure, readers, tools and available developer techniques to manipulate/convert/create ePubs but there is no such thing as a magical function (so far) to extract a particular length of characters to create an excerpt of the book. And that's precisely what I'm looking for: A way to extract the first X words of an ePub.
The first approach I'm considering (not my favorite btw) is creating a parser to read all the ePub metadata and start parsing the xml files in the right order until I have enough words to create the excerpt of a determined ePub (I will appreciate some feedback in this direction)
The second way (which I can't find so far) is an existent tool/function or parser (in any language) which returns (hopefully) the plain text of the ePub so I can collect the first X words in order to create my excerpt.
Do you know about any tool which can help me achieve the second option?

You should have a look at Apache Tika: http://tika.apache.org/
You can use it from command line, or as a java library or even in server mode to extract text from ePub.
Hope this will help,
F.

Jose,
I'm not aware of any tool to do what you want. Let me comment on your first approach, though. If you do find a tool I hope these comments allow you to evaluate it.
I think your approach is fine and, if you want to do a good job of creating an extract, you may want to own this step anyway. I would suggest you,
grab the OPF file and look for a GUIDE section. If a GUIDE section exists, check the types that are given. Some are probably not relevant for an excerpt (cover,title-page,copyright-page). Many books will not have the types explicitly stated but this should help where they do.
now go through the files in sequence in the SPINE section, excluding anything that is irrelevant, and read through enough XHTML files to get your excerpt.
while in the OPF file grab a bunch of metadata if this is relevant for the excerpt (title, creator, date are mandatory, I think, and some authors will also put in a whole bunch of other metadata such as keywords).
If you are creating a mini-EPUB with this excerpt you will need to pick up any CSS, Audio, Video, Image and Custom Font files that get referenced in the XHTML files used to make your excerpt. You may even choose to use the original cover file for the cover file of your excerpt epub.
If you working with fixed layout books with fun stuff like Read Aloud AND you want to create a mini-EPUB as an excerpt, you may be better off going with a page count rather than a word count. Don't forget to include any SMIL files into your excerpt and to make it look nice: (i) don't split a two page spread and (ii) make sure that the first page is an odd numbered page if odd in the original or even if even numbered in the original - to do this you may need to add a blank filler page (get the odd/even wrong and subsequent two page spreads won't be facing each other)
I hope that helps.

Related

Parsing XPS or PDF and inserting data into a Word Template?

So, I have an option of sending a document from a database to print either in PDF or XPS. I need to be able to extract specific data, such as name, date, etc. from one of those formats and inserting that data into a word template. The word template is not editable. You can only type within fields... each field has a heading before it, such as name, dob, etc.
Basically I need to be able to automate transferring that information from the PDF or XPS file into the word template.
I'm familiar enough with C++, Python and Java.. so I have no language preference -- whatever gets the job done.
Could you suggest a way I can manage to accomplish this? I've having a bit of a difficulty figuring out the way I can parse/extract data from one of those file types and which file type would be a better candidate. And I definitely have no idea how I can automate the population of fields in the Word Template.
Oh and forgot to mention, this is on Windows 7 (and maybe 8, but mostly 7) machines.
Thank a lot for your help in advance!
This is for anyone who has the same sort of question, so this is how I did it:
I used PDFBox (http://pdfbox.apache.org/) to parse the document and extract the needed data and then I used docx4j (http://www.docx4java.org/trac/docx4j) to insert data into word template. Both are incredible tools and have excellent communities that help out almost immediately.

EverNote OCR feature?

I downloaded the EverNote API Xcode Project but I have a question regarding the OCR feature. With their OCR service, can I take a picture and show the extracted text in a UILabel or does it not work like that?
Or is the text that is extracted not shown to me but only is for the search function of photos?
Has anyone ever had any experience with this or any ideas?
Thanks!
Yes, but it looks like it's going to be a bit of work.
When you get an EDAMResource that corresponds to an image, it has a property called recognition that returns an EDAMData object that contains the XML that defines the recognition info. For example, I attached this image to a note:
I inspected the recognition info that was attached to the corresponding EDAMResource object, and found this:
the xml i found on pastie.org, because it's too big to fit in an answer
As you can see, there's a LOT of information here. The XML is defined in the API documentation, so this would be where you parse the XML and extract the relevant information yourself. Fortunately, the structure of the XML is quite simple (you could write a parser in a few minutes). The hard part will be to figure out what parts you want to use.
It doesn't really work like that. Evernote doesn't really do "OCR" in the pure sense of turning document images into coherent paragraphs of text.
Evernote's recognition XML (which you can retrieve after via the technique that #DaveDeLong shows above) is most useful as an index to search against; the service will provide you sets of rectangles and sets of possible words/text fragments with probability scores attached. This makes a great basis for matching search terms, but a terrible one for constructing a single string that represents the document.
(I know this answer is like 4 years late, but Dave's excellent description doesn't really address this philosophical distinction that you'll run up against if you try to actually do what you were suggesting in the question.)

ePub specifications clarification needed

just wanted some help on ePub specifications.. Is it mandatory for the toc.ncx to have the src(ie. the xhtml) .I have observed the same content src is also available in the .opf file.
Yes, that is mandatory and that's a design problem :
Overlap between NCX and OPF metadata
Because the NCX is borrowed from another standard, there is some
overlap between the information encoded in the NCX and that in the
OPF. This is rarely a problem when you generate EPUBs
programmatically, where the same code can output to two different
files. Take care to put the same information in both places, as
different EPUB readers might use the values from one or the other.
Source
That's not the case anymore in the future version of the norm (ePub 3), but please note that no device support it currently.

TeX: Add blank page after every content page

I'm currently writing my bachelor thesis and my university wants a one sided print. The printing and binding will be done by a professional print company. They only accept two sided manuscripts.
Because of that I need to add a blank page after every page of content. I don't want to do this manually using \newpage or \clearpage because there are too many pages. Is there any, maybe low level, TeX command or package to do this? Or can you suggest another tool that does this without breaking the PDF?
Thanks for your help!
One option you might look into is to use a double sided layout that allows separate formatting for the even vs. odd pages: e.g. the book class allows this. Then you will need to define the even pages to be blank (presumably you don't want headers printed, or the page count to increment).
An alternative (if you can't get this to look correct for what you need) would be to do the layout in single sided (so that page numbering, etc. is all taken care of), then have a separate latex document which includes the pages, one at a time (pdfpages may be a good package to do this properly), and then insert blank pages (with no headers/etc.) in-between. This may end up being more work, but if you have trouble with formatting, it may be the easier way to go.
I suspect that you'd be better off doing this by manipulating the output PDF, rather than changing the LaTeX.
For example, if you're able to print to a file on your platform, there might be options in the print dialogue to tweak this. Your PDF viewer may be able to arrange this, if only by inserting blanks every second page. Or there may be a GUI or command-line tool to do the reshuffling for you.
Having said that, I've no specific recommendations for what tool you could use. A quick look around suggests strongly that the pstops tool might be able to do something along these lines, but that only helps if you're generating your PDF from postscript.
So no recipe, I'm afraid, but this'll probably be a better direction to look.
(or, meta answer: find a different print shop, or phone again and hope you get someone who gives you a different answer!)

Setting up help for a Delphi app

What's the best way to set up help (specifically HTML Help) for a Delphi application? I can see several options, all of which has disadvantages. Specifically:
I could set HelpContext in the forms designer wherever appropriate, but then I'm stuck having to track numbers instead of symbolic constants.
I could set HelpContext programmatically. Then I can use symbolic constants, but I'd have more code to keep up with, and I couldn't easily check the text DFMs to see which forms still need help.
I could set HelpKeyword, but since that does a keyword lookup (like Application.HelpKeyword) rather than a topic jump (like Application.HelpJump), I'd have to make sure that each of my help pages has a unique, non-changing, top-level keyword; this seems like extra work. (And there are HelpKeyword-related VCL bugs like this and this.)
I could set HelpKeyword, set an Application.OnHelp handler to convert HelpKeyword requests to HelpJump requests so that I can assign help by topic ID instead of keyword lookup, and add code such as my own help viewer (based on HelpScribble's code) that fixes the VCL bugs and lets HelpJump work with anchors. By this point, though, I feel like I'm working against the VCL rather than with it.
Which approach did you choose for your app?
When I first started researching how to do this several years ago, I first got the "All About help files in Borland Delphi" tutorial from: http://www.ec-software.com/support_tutorials.html
In that document, the section "Preparing a help file for context sensitive help" (which in my version of the document starts on page 28). It describes a nice numbering scheme you can use to organize your numbers into sections, e.g. Starting with 100000 for your main form and continuing with 101000 or 110000 for each secondary form, etc.
But then I wanted to use descriptive string IDs instead of numbers for my Help topics. I started using THelpRouter, which is part of EC Software's free Help Suite at: http://www.ec-software.com/downloads_delphi.html
But then I settled on a Help tool that supported string ID's directly for topics (I use Dr. Explain: http://www.drexplain.com/) so now I simply use HelpJump, e.g.:
Application.HelpJump('UGQuickStart');
I hope that helps.
We use symbolic constants. Yes, it is a bit more work, but it pays off. Especially because some of our dialogs are dynamically built and sometimes require different help IDs.
I create the help file, which gets the help topic ID, and then go around the forms and set their HelpContext values to them. Since the level of maintenance needed is very low - the form is unlikely to change help file context unless something major happens - this works just fine.
We use Help&Manual - its a wonderful tool, outputting almost any format of stuff you could want, doc, rtf, html, pdf - all from the same source. It will even read in (or paste from rtf (eg MSWord). It uses topic ID's (strings) which I just keep a list of and I manually put each one into a form (or class) as it suits me. Sounds difficult but trust me you'll spend far longer hating the wrong authouring tool. I spent years finding it!
Brian

Resources