Is it possible to read non-text files into a google dataflow pipeline? - google-cloud-dataflow

I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.

There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java

Related

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

How to read pdf and extract text from pdf in symfony1.1?

I am working on Symfony-1.1 in an existing project. How can I read pdf files and extract text from them?
It's not a Symfony 1.1 related question, actually. It's a PHP one. There several libraries to handle PDFs in PHP. Following are some suggestions.
https://github.com/smalot/pdfparser
http://pastebin.com/dvwySU1a
http://www.pdflib.com/
If you just need to parse pdf in anyway and then process the text in PHP, you can also consider using a java library like the following.
http://pdfbox.apache.org/ (Is there a PDF parser for PHP?)

google cloud dataflow read data from compressed data

I'm trying to use google cloud dataflow to read data from GCS and load to BigQuery tables, however the files in GCS are compressed(gzip), is there any class can be used to read data from compressed/gzipped files?
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.

Apache Tika Office to PDF conversion

I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?

Apache Tika alternatives for ios

I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios?
If there is no such library for ios, you can tell me tools that can extract specified file format.
Thank you in advance.
libopc for extracting text from docx, xlsx, pptx.
Antiword for older MS formats.
You can extract strings from a PDF using CoreGraphics also, and
using PDFiPhone too.
If you're also looking for extracting text from a HTML document, have a look at NSXMLParser.

Resources