I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios?
If there is no such library for ios, you can tell me tools that can extract specified file format.
Thank you in advance.
libopc for extracting text from docx, xlsx, pptx.
Antiword for older MS formats.
You can extract strings from a PDF using CoreGraphics also, and
using PDFiPhone too.
If you're also looking for extracting text from a HTML document, have a look at NSXMLParser.
Related
I am working on Symfony-1.1 in an existing project. How can I read pdf files and extract text from them?
It's not a Symfony 1.1 related question, actually. It's a PHP one. There several libraries to handle PDFs in PHP. Following are some suggestions.
https://github.com/smalot/pdfparser
http://pastebin.com/dvwySU1a
http://www.pdflib.com/
If you just need to parse pdf in anyway and then process the text in PHP, you can also consider using a java library like the following.
http://pdfbox.apache.org/ (Is there a PDF parser for PHP?)
Is it possible to convert a doc file to a pdf file programmatically, with out using WORD application/third party tools. Preferably in Delphi XE4. If so, how?
Yes, you can convert .doc/.docx files to .pdf without Word and without third-party controls. The specifications are publically available - [simply] read and parse the .doc/.docx file according to the specification and generate the content according to the .pdf specification.
Here is the specification for MS-DOC (.doc) file format :
MS-DOC Specification (622 pages) -- Word97 through 2007
MS-DOCX Extensions Specification (105 pages) -- Word2010 through 2013
See also - Open Document and OpenXML Format
And the specification for the .pdf format :
PDF Reference (1310 pages)
Really, I think you'll find you probably want to use a third party component...
I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?
I know that I can use ImageMagick's convert tool to turn different image files into PDF documents. However, is there some way to specify what version of PDF document I want to use for the output? Can I convert an image to a PDF v1.4 document?
I am trying to find a way to automate the conversion of image files (probably SVG) to PDF files that need to be sent to a printing service. The printer's service requires the PDF files to meet certain requirements, and one of them is that the PDF file is v1.4. My version of convert is "6.5.7-8 2010-12-02 Q16".
Thanks,
Carl
This question on superuser.com
https://superuser.com/questions/193791/batch-convert-pdf-versions
will give you some hints how to change the version number in the PDF afterwards.
I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.