Extracting structural data from ODP or ODF files - parsing

I'm trying to extract the information hierarchy within ODP (OpenDocument Presentation) files : Titles, subtitles, body text...
Do you know any tool or technique that would do the job?
Else, is there a mean to parse those ODP documents in order to extract styling informations?
So I can later deduce the document structure from its styling.
I'm afraid the structure of the XML file inside the ODP file could depend on softwares or versions. So that, I'd rather find a high level solution than parsing directly this XML file.

As I couldn't find any tool that would enable to extract outline, titles, text... from presentation files, I created Exide, an open source API supporting ODP, PPTX and beamer files, it enables:
Slide title extraction
Slide body text extraction
Named-entities recognition (unaccurate)
Emphasized text recognition
URLs recognition
Structure detection and outline generation
Recognition of the following silde types :
Introduction
Conclusion
Definition
Example
Table of contents
References
Section header
For more information, check out the github page of the project.

Related

Apache Tika round-trip: Rebuild document using extracted text

Is it possible to use Apache Tika to extract text, modify the extraction and then inject it back to the original document?
I imagine it could be possible by modifying the parser code so it could be run again and reinsert text instead of extracting it, but is there any feature already to do this?
Useful for document translation.

DOCBOOK to EPUB File Size

Some eBook reading devices (like older Kindles) perform better with OEBPS/Text file sizes in the 350KB range. When you go over that, page load and scrolling can be a miserable user experience.
Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I couldn't find the answer to this at docbook.org.
Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
The "DocBook to EPUB publishing flow" (DocBook XSL) will split the input XML into smaller output files.
This process is called "chunking" and is described in detail here: http://www.sagehill.net/docbookxsl/Chunking.html (this is a section from the book DocBook XSL: The Complete Guide).
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I am not completely sure what you mean by "repair the anchor IDs", but the chunking process does ensure that cross-references and entries that go in to *.opf and *.ncx files are correct.
EPUB is one of many output formats that can be created from DocBook sources. If you have never used DocBook XSL before, you should read "DocBook XSL: The Complete Guide" (see link above). This book does not cover EPUB output specifically (it was written before the EPUB stylesheets had been developed).
DocBook XSL provides stylesheets for both EPUB 2 and EPUB 3 (most of the effort goes into EPUB 3 these days):
README for EPUB 2: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub/README.
README for EPUB 3: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub3/README.
Best practices are to create separate HTML files for chapters (and sometimes sections).
As long as your file separates things into one of these elements, you can use chunking to produce the results you want.
All the anchor IDs will work like a charm. Even the indices will work!

Using UITextView and TextKit to display .docx

I'm writing an app in which I'd like to be able to open, edit, and write .docx files. I've read documentation and tutorials on TextKit and understand how attributes can be applied to text and I know that .docx is basically a .zip containing some XML. However I'm at a loss of how to read a .docx in Objective-C and place it into a UITextView and then write it back again. Any idea on how to go about this?
You could try http://libopc.codeplex.com/ which per that link is a
ISO/IEC 29500 standard conformant,
cross-platform,
open source,
standard C99-based
implementation of Part II (OPC) and Part III (MCE) of the ISO/IEC 29500 specification (OOXML).
UITextView only supports text, so you probably want to implement your own UIView subclass to display the docx file.
Unzip the collection of Open Packaging Conventions.
Implement a view that conforms to the markup languages you want to support that you found in step 1.
The 7,000 page document describing the Open Office document standard is here. http://www.ecma-international.org/publications/standards/Ecma-376.htm

Search Words in pdf files

Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

Reading ePub format

I am trying to develop an iPhone application to read ePub files. Is there any framework available to develop this? I have no idea about how to read this file format. I tried to parse a sample file with .epub extension using NSXML Parser, but that fails.
The EPUB format brings together a bunch of different specifications / formats:
one to say what the content of the book should look like (a subset of XHTML 1.1 + CSS)
one to define a "manifest" that lists all of the files that make up that content (OPF, which is an XML file)
one to define how everything is packaged up (OEBPS: a zip file of everything in the manifest plus a few extra files)
The specs look a bit daunting but actually once you've got the basics (unzipping, parsing XML) down it's not particularly difficult or complex.
You'll need to work out how to download the EPUB, to unzip it somewhere, to parse the manifest and then to display the relevant content.
Some pointers if you're just starting out:
parse xml
unzip
To display content just use a UIWebView for now.
Here's a high level step by step for your code:
1) create a view with a UIWebView
2) download the EPUB file
3) unzip it to a subdirectory in your app's documents folder using the zip library, linked above
4) parse the XML file at META-INF/container.xml (if this file doesn't exist the EPUB is invalid) using TBXML, linked above
5) In this XML, find the first "rootfile" with media-type application/oebps-package+xml. This is the OPF file for the book.
6) parse the OPF file (also XML)
7) now you need to know what the first chapter of the book is.
a) each <item> in the <manifest> element has an id and an href. Store these in an NSDictionary where the key is the id and the object is the href.
b) Look at the first <itemref> in the <spine>. It has an idref attribute which corresponds to one of the ids in (a). Look up that id in the NSDictionary and you'll get an href.
c) this is the the file of the first chapter to show the user. Work out what the full path is (hint: it's wherever you unzipped the zip file to in (3) plus the base directory of the OPF file in (6))
8) create an NSURL using fileURLWithPath:, where the path is the full path from (7c). Load this request using the UIWebView you created in (1).
You'll need to implement forward / backward buttons or swipes or something so that users can move from one chapter to another. Use the <spine> to work out which file to show next - the <itemrefs> in the XML are in the order they should appear to the reader.
Apparently EPUB is "just" an XML format, so if you have an xml parser and the spec it should be okay.
Plus a little tuto? Have fun!
EDIT: you could also read some code here, this is for generating epub, not reading them but the code may be useful.
EDIT again: And see links to related question in the right sidebar, there are some links in the answers to free ebook reader which support ePub.
EDIT 3: You should add a comment when you edit your question so people who answer you can continue the discussion (if you don't comment we're not noticed of your edit).
So, The parsing fail because you didn't read the spec or related questions on Stack Overflow... *.epub file are a zipped folder containing XML file(s), not plain xml.
I read through this tutorial once (free registration required, sorry) and it gave me a great introduction to ePub. deverloperWorks tutorial here
I highly suggest you look at some of the XML processing libraries. If you just want to get specific information out of the XML file, then you can pick the right parsing strategy.
there is an open source project fbreader,
it also support iphone
http://www.fbreader.org/about.php
I'm playing arround to create an epub-framework for iphone apps.
At the moment (I really just startet) i can generate a title page with links to the chapters.
My approach is
Use quickconnect iphone framework as
a layer (maybe i change to phonegap)
which basically allows for javascript
apps as iphone apps
Add the UNZIPed epub as a ressource to the project
Parse the whole thing with a customized version of the epub.js (somewhere on google-code)
Right now I'm looking into pageflip, some kind of gui and minor usability issues (save the current page beingviewed)
I hope that give's you an idea on how to start
Jonathan Wight (schwa) has developed a ObjC solution for parsing and displaying ePub documents on the iPhone. It's part of his TouchCode open source repository.

Resources