Apache Tika SourceCode Parser - parsing

I am trying to use the SourceCodeParser in Apache Tika to parse a c++ source file. However I am unable to retrieve the context. Has anyone worked with such a requirement before and could share their experience/knowledge.
Thanks

Related

Converting an AVDL file into something Apache's avro python package can parse

What I would like to be able to do is take an .avdl file and parse it into python. I would like to make use of the information from within python.
According to the documentation, Apache's python package does not handle .avdl files. I need to use their avro-tools to convert the .avdl file into something it does know how to parse.
According to the documentation at https://avro.apache.org/docs/current/idl.html, I can convert a .avdl file into a .avpr file with the following command:
java -jar avro-tools.jar idl src/test/idl/input/namespaces.avdl /tmp/namespaces.avpr
I ran through my .avdl file through Avro-tools, and it produced an .avpr file.
What is unclear is how I can use the python package to interpret this data. I tried something simple...
schema = avro.schema.parse(open("my.avpr", "rb").read())
but that generates the error:
SchemaParseException: No "type" property:
I believe that avro.schema.parse is designed to parse .avsc files (?). However, it is unclear how I can use avro-tools to convert my .avdl into .avsc. Is that possible?
I am guessing there are many pieces I am missing and do not quite understand (yet) what the purpose of all of these files are.
It does appear that an .avpr is a JSON file (?) so I can just read and interpret it myself, but I was hoping that there would be a python package that would assist me in navigating the data.
Can anyone provide some insights into this? Thank you.
The answer is to use the idl2schemata command with avro-tools.jar, providing it with an output directory to which it can write the .avsc files. The .avsc files can then be read AVRO python package.
For example:
java -jar avro-tools.jar idl2schemata src/test/idl/input/namespaces.avdl /tmp/

Is it possible to read non-text files into a google dataflow pipeline?

I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.
There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java

Parse or convert .pb files under .sonar folder

I'm using sonarqube 5.6.5,everything works well . Now i need to parse the issues.PB file generated under .sonar/batch-report/ folder. i tried using jsonformat but it is not working.
They are "Protobuf" format, which is a format by google for serializing data. You can get started here or find for example a tutorial here on how to use it in Java.
What I don't understand is that your question has a tag "protobuf-net", which github page explains very well how to use it (in .NET).

Grails 2.4.4 How do I export excel file?

I've looked at some plugins but no success.
I tried Export Plugin 1.6 as well but the view doesn't recognize r:.. and export:.. tags.
What is the best way to export rows of data from postgresql database into an excel file from a click of a button?
Thank you.
you could create a gsp which renders a .csv-file and set the content-type of the response to application/vnd.ms-excel within the controller.
that's the easiest way, but you will not be able to control the format of cells.
Apache POI - as mentioned by Abincepto - is another solution which is more complex but gives you full control over the generated excel file
Did you try directly with apache poi ?
From the website:
The Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
EDIT:
Here is a tutorial: Read / Write Excel file in Java using Apache POI
and a quick guide
EDIT2:
I just found another link using Grails that could help you. The example use another library: jexcelapi
The export plugin is dependent on the resources plugin. You can add the resources plugin and try again. I use resources 1.2.8. Also you need to add this to your dependencies:
dependencies {
............
// Needed for the export plugin?
compile 'commons-beanutils:commons-beanutils:1.8.3'
plugins {
............
runtime ":resources:1.2.8"

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

Resources