Does Apache tika support text extraction of documents created on gdrive - apache-tika

Mimetype of a document created on gsuite (gdrive) is "application/vnd.google-apps.document",
Mimetype of a uploaded document remains intact as "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
tika document suggests "application/vnd.openxmlformats-officedocument.wordprocessingml.document" is supported by tika but it doesn't talk about "application/vnd.google-apps.document".
When we tried it on test setup Tika was able to extract the text out of both the documents.
How to confirm the behaviour, Will it work seamlessly for all documents created on gdrive (though that mimetype is not mentioned in the tika supported type list) OR above was just one exceptional occurrence ?

Related

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

Which file formats can be previewed on CKAN Data Preview tool?

I am working on CKAN and will like to know the appropriate file formats that can be previewed on CKAN. I could not find any information on this topic online, so I decided to start this topic and hope to garner more responses on this which will be useful to CKAN developers in future. Here's a list of file formats that I have gathered after experimenting with my own CKAN and looking through other CKAN instances such as those from UK and Australia.
Can be previewed:
CSV (Comma separated values)
XLS (Microsoft Excel Binary File Format)
HTML (HyperText Markup Language)
JSON (JavaScript Object Notation)
PDF (Portable Document Format)
RSS (Really Simple Syndication)
TXT (Plain Text)
WMS (Web Map Service)
XML (eXtensible Markup Language)
Cannot be previewed:
DOC (Microsoft Word)
RDF (Resource Description Framework)
HTML (HyperText Markup Language)
KML (Keyhole Markup Language)
SHP (Shapefile)
WFS (Web Feature Service)
XLSX (Microsoft Excel Open XML Document)
ZIP (archive)
Help add on to my list and correct me if any of the above is wrong, then I will update the list above. Thanks! ;)
For each ckan release, the data viewer's functionality may differ.
Refer to the DataViewer section in the documents of the CKAN version that you are using.
http://docs.ckan.org/en/latest/maintaining/data-viewer.html
With some simple tweaks to the config file XLSX files can be previewed, as can Tab separated text files (tsv format/extension).
Edit the config.ini file to include
ckan.datapusher.formats = csv xls xlsx tsv application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
HTML and RDF are also previwable:
refer CKAN documentation http://docs.ckan.org/en/latest/maintaining/configuration.html?highlight=preview#ckan-preview-loadable

Apache Tika Office to PDF conversion

I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?

Apache Tika alternatives for ios

I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios?
If there is no such library for ios, you can tell me tools that can extract specified file format.
Thank you in advance.
libopc for extracting text from docx, xlsx, pptx.
Antiword for older MS formats.
You can extract strings from a PDF using CoreGraphics also, and
using PDFiPhone too.
If you're also looking for extracting text from a HTML document, have a look at NSXMLParser.

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.

Resources