DOCBOOK to EPUB File Size - epub

Some eBook reading devices (like older Kindles) perform better with OEBPS/Text file sizes in the 350KB range. When you go over that, page load and scrolling can be a miserable user experience.
Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I couldn't find the answer to this at docbook.org.

Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
The "DocBook to EPUB publishing flow" (DocBook XSL) will split the input XML into smaller output files.
This process is called "chunking" and is described in detail here: http://www.sagehill.net/docbookxsl/Chunking.html (this is a section from the book DocBook XSL: The Complete Guide).
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I am not completely sure what you mean by "repair the anchor IDs", but the chunking process does ensure that cross-references and entries that go in to *.opf and *.ncx files are correct.
EPUB is one of many output formats that can be created from DocBook sources. If you have never used DocBook XSL before, you should read "DocBook XSL: The Complete Guide" (see link above). This book does not cover EPUB output specifically (it was written before the EPUB stylesheets had been developed).
DocBook XSL provides stylesheets for both EPUB 2 and EPUB 3 (most of the effort goes into EPUB 3 these days):
README for EPUB 2: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub/README.
README for EPUB 3: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub3/README.

Best practices are to create separate HTML files for chapters (and sometimes sections).
As long as your file separates things into one of these elements, you can use chunking to produce the results you want.
All the anchor IDs will work like a charm. Even the indices will work!

Related

OpenXml get page number to which each paragraph in a .docx file

I have a Word docx file and I want to retrieve all the paragraphs in OpenXml with c#.
I need to know:
1.-The number of pages of the Documents.
2.-The page number to which each paragraph belongs.
Can you show an example where the paragraphs of the document are read?
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
Few month ago, I reprogramed a python package call docx2python to do similar thing. I reproduced a structured(with level) xml format file from a docx file. As far as I know, a paragraph contains several Runs and each Run contain one only text. You can read this document to see how to do it. Plain paragraphes are not hard to read. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 . Glad if this helps.

Extracting structural data from ODP or ODF files

I'm trying to extract the information hierarchy within ODP (OpenDocument Presentation) files : Titles, subtitles, body text...
Do you know any tool or technique that would do the job?
Else, is there a mean to parse those ODP documents in order to extract styling informations?
So I can later deduce the document structure from its styling.
I'm afraid the structure of the XML file inside the ODP file could depend on softwares or versions. So that, I'd rather find a high level solution than parsing directly this XML file.
As I couldn't find any tool that would enable to extract outline, titles, text... from presentation files, I created Exide, an open source API supporting ODP, PPTX and beamer files, it enables:
Slide title extraction
Slide body text extraction
Named-entities recognition (unaccurate)
Emphasized text recognition
URLs recognition
Structure detection and outline generation
Recognition of the following silde types :
Introduction
Conclusion
Definition
Example
Table of contents
References
Section header
For more information, check out the github page of the project.

How to create a reflowable content from the PDF?

I am going to developing an application, which is an epub. I have PDF files. I need to make those files as reflowable content(epub)... Then only the PDF files will be viewable in mobiles, tablets... etc.. Please suggest the solutions to make reflowable contents from the PDF...
If you don't mind using an open source software, go with Sigil.
If you want to learn innards of how to create by hand, or some tool of your own, Follow this. (This is a one month course, So you will not get all the content in one day, though).
Create the folder structure.
In a folder of your choice, create the following: META-INF (folder), OEBPS (folder), mimetype ( a file with exactly same name ).
Put application/epub+zip in the file mimetype. No spaces no lines.
Convert your PDF to text format. In Adobe acrobat, you will have file > export> .
Read the content from PDF, you will find some conclusions of how you can split them in to chapters or sub reading topics. Split according to the understanding of the book, and make so many text files.
Make sub folder structure. Make Images, Text, Styles (folders) content.opf, toc.ncx (files) inside OEBPS folder.
Put all your split files in Text folder created in step 5.
put all images extracted in pdf in Images folder
Put any styles (not describing here,) in Styles folder.
In the META-INF folder created in step 1, create a file called container.xml and fill with the following: <?xml version="1.0"?><container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container"> <rootfiles> <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/> </rootfiles></container>.
If you are able to do these many things sincerely, ping again, I would try to tell you what you should put in content.opf, and toc.ncx in created in step 5.
As an example, You can use some example from my site. Download from here and use them with caution. Do not distribute.
We're opening up a beta for our web based pdf reflow viewer at the beginning of 2015. Feel free to sign up to be part of our beta test. More info here:
http://flexpaper.devaldi.com/reflow-pdf-documents.jsp

Search Words in pdf files

Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

Reading ePub format

I am trying to develop an iPhone application to read ePub files. Is there any framework available to develop this? I have no idea about how to read this file format. I tried to parse a sample file with .epub extension using NSXML Parser, but that fails.
The EPUB format brings together a bunch of different specifications / formats:
one to say what the content of the book should look like (a subset of XHTML 1.1 + CSS)
one to define a "manifest" that lists all of the files that make up that content (OPF, which is an XML file)
one to define how everything is packaged up (OEBPS: a zip file of everything in the manifest plus a few extra files)
The specs look a bit daunting but actually once you've got the basics (unzipping, parsing XML) down it's not particularly difficult or complex.
You'll need to work out how to download the EPUB, to unzip it somewhere, to parse the manifest and then to display the relevant content.
Some pointers if you're just starting out:
parse xml
unzip
To display content just use a UIWebView for now.
Here's a high level step by step for your code:
1) create a view with a UIWebView
2) download the EPUB file
3) unzip it to a subdirectory in your app's documents folder using the zip library, linked above
4) parse the XML file at META-INF/container.xml (if this file doesn't exist the EPUB is invalid) using TBXML, linked above
5) In this XML, find the first "rootfile" with media-type application/oebps-package+xml. This is the OPF file for the book.
6) parse the OPF file (also XML)
7) now you need to know what the first chapter of the book is.
a) each <item> in the <manifest> element has an id and an href. Store these in an NSDictionary where the key is the id and the object is the href.
b) Look at the first <itemref> in the <spine>. It has an idref attribute which corresponds to one of the ids in (a). Look up that id in the NSDictionary and you'll get an href.
c) this is the the file of the first chapter to show the user. Work out what the full path is (hint: it's wherever you unzipped the zip file to in (3) plus the base directory of the OPF file in (6))
8) create an NSURL using fileURLWithPath:, where the path is the full path from (7c). Load this request using the UIWebView you created in (1).
You'll need to implement forward / backward buttons or swipes or something so that users can move from one chapter to another. Use the <spine> to work out which file to show next - the <itemrefs> in the XML are in the order they should appear to the reader.
Apparently EPUB is "just" an XML format, so if you have an xml parser and the spec it should be okay.
Plus a little tuto? Have fun!
EDIT: you could also read some code here, this is for generating epub, not reading them but the code may be useful.
EDIT again: And see links to related question in the right sidebar, there are some links in the answers to free ebook reader which support ePub.
EDIT 3: You should add a comment when you edit your question so people who answer you can continue the discussion (if you don't comment we're not noticed of your edit).
So, The parsing fail because you didn't read the spec or related questions on Stack Overflow... *.epub file are a zipped folder containing XML file(s), not plain xml.
I read through this tutorial once (free registration required, sorry) and it gave me a great introduction to ePub. deverloperWorks tutorial here
I highly suggest you look at some of the XML processing libraries. If you just want to get specific information out of the XML file, then you can pick the right parsing strategy.
there is an open source project fbreader,
it also support iphone
http://www.fbreader.org/about.php
I'm playing arround to create an epub-framework for iphone apps.
At the moment (I really just startet) i can generate a title page with links to the chapters.
My approach is
Use quickconnect iphone framework as
a layer (maybe i change to phonegap)
which basically allows for javascript
apps as iphone apps
Add the UNZIPed epub as a ressource to the project
Parse the whole thing with a customized version of the epub.js (somewhere on google-code)
Right now I'm looking into pageflip, some kind of gui and minor usability issues (save the current page beingviewed)
I hope that give's you an idea on how to start
Jonathan Wight (schwa) has developed a ObjC solution for parsing and displaying ePub documents on the iPhone. It's part of his TouchCode open source repository.

Resources