Apache Tika round-trip: Rebuild document using extracted text - apache-tika

Is it possible to use Apache Tika to extract text, modify the extraction and then inject it back to the original document?
I imagine it could be possible by modifying the parser code so it could be run again and reinsert text instead of extracting it, but is there any feature already to do this?
Useful for document translation.

Related

Extracting structural data from ODP or ODF files

I'm trying to extract the information hierarchy within ODP (OpenDocument Presentation) files : Titles, subtitles, body text...
Do you know any tool or technique that would do the job?
Else, is there a mean to parse those ODP documents in order to extract styling informations?
So I can later deduce the document structure from its styling.
I'm afraid the structure of the XML file inside the ODP file could depend on softwares or versions. So that, I'd rather find a high level solution than parsing directly this XML file.
As I couldn't find any tool that would enable to extract outline, titles, text... from presentation files, I created Exide, an open source API supporting ODP, PPTX and beamer files, it enables:
Slide title extraction
Slide body text extraction
Named-entities recognition (unaccurate)
Emphasized text recognition
URLs recognition
Structure detection and outline generation
Recognition of the following silde types :
Introduction
Conclusion
Definition
Example
Table of contents
References
Section header
For more information, check out the github page of the project.

document language meta not included by default

I'm playing with Apache Tika (1.13) and noticed that the language tag not is included for any of the documents that I run through tika-app --metadata.
What is the proper way to include/force language detection for all documents? Is it possible to do though configuration or may be I have to add a new parser adding this meta data, or override an existing parser in the chain?
Thanks!

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

How to read pdf and extract text from pdf in symfony1.1?

I am working on Symfony-1.1 in an existing project. How can I read pdf files and extract text from them?
It's not a Symfony 1.1 related question, actually. It's a PHP one. There several libraries to handle PDFs in PHP. Following are some suggestions.
https://github.com/smalot/pdfparser
http://pastebin.com/dvwySU1a
http://www.pdflib.com/
If you just need to parse pdf in anyway and then process the text in PHP, you can also consider using a java library like the following.
http://pdfbox.apache.org/ (Is there a PDF parser for PHP?)

SVG files in Raphael, can they be used?

I have an SVG file that I would like to display via Raphael (each svg file is a node in a tree I'm trying to draw, the actual connections of the tree will be made by raphael). I tried something like:
var vector_image = paper.image("test.svg", 50,50,50,50);
but no dice, seems only "real" image files like png or jpeg are accepted? I find this very strange as Raphael itself uses Scalable Vector Graphics.
Is there anyway (short of parsing the SVG files into javascript snippets and pasting them into the html document) to display existing SVG files using Raphael (or any other vector based javascript graphical engine?)
If parsing it will have to be, is there any easy way to do this, short of just manually scraping the files? I'm running this code on a Ruby on Rails server, so I'd like to avoid solutions outside this framework, if possible (I've heard of one PHP solution through this site...I'd rather code by hand than add another language onto this project).
-Jenny
It's currently not possible to display existing SVG with Raphael, and there are apparently no plans for the implementation of SVG editing (see this forum post).
As for alternative JavaScript libraries, a newer alternative is Snap.svg, which can load external SVG files via its Snap.load() function.

Resources