Apache Tika Office to PDF conversion - apache-tika

I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?

Related

Is it possible to read non-text files into a google dataflow pipeline?

I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.
There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java

How to read pdf and extract text from pdf in symfony1.1?

I am working on Symfony-1.1 in an existing project. How can I read pdf files and extract text from them?
It's not a Symfony 1.1 related question, actually. It's a PHP one. There several libraries to handle PDFs in PHP. Following are some suggestions.
https://github.com/smalot/pdfparser
http://pastebin.com/dvwySU1a
http://www.pdflib.com/
If you just need to parse pdf in anyway and then process the text in PHP, you can also consider using a java library like the following.
http://pdfbox.apache.org/ (Is there a PDF parser for PHP?)

Apache Tika alternatives for ios

I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios?
If there is no such library for ios, you can tell me tools that can extract specified file format.
Thank you in advance.
libopc for extracting text from docx, xlsx, pptx.
Antiword for older MS formats.
You can extract strings from a PDF using CoreGraphics also, and
using PDFiPhone too.
If you're also looking for extracting text from a HTML document, have a look at NSXMLParser.

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.

Opening a Word (.doc) Document from a stream

I have a bunch of Word documents (.doc) stored in my SQL database that I need open, clean the properties such as Title, Subject etc and then save the file back to the database.
Is it even possible to open a ".doc" file from a stream?
Word is not able to open .doc files from a stream in memory. To open the file you would have to save the document to a temporary location first.
However, Word's little-known RTF converter interface can be used to load documents from streams in RTF format. If using RTF instead of the binary format is an option for you1, you might want to have a look at the WinWord Converter SDK:
How to Obtain the WinWord Converter SDK (GC1039)
For an import converter you would have to implement the ForeignToRtf method that will be called by Word to receive the RTF input.
1Actually you can still save the files in the .doc format; however, you would have to convert the .doc file to RTF first using the SDK and then open the RTF stream in Word. The conversion from the binary format to RTF and vice versa should be mostly lossless as the RTF format has been developed in sync with the binary format. However, it should be borne in mind that using the RTF converter interface will not allow you to use any of the new features introduced with OpenXML/Office 2010.
I'm pretty sure the Word DOCUMENT object implements IPersistStream (the COM interface). I +KNOW+ it implements IPersistFile.
It's not the easiest thing to work with, and since it's COM, it doesn't interoperate well with .net streams, but I believe it'd be doable using IPersistStream.

Resources