How do I remove NLP from Tika? - apache-tika

I just want to extract data from documents. So I do not think I need OpenNLP.
Is there a way to easily take it out so my Tika is lighter?

I have same need just like you.The way I deal with it is that I delete the module directly.In my program I delete the NLP and NL module and it works normally.

Related

Should I use one XML Parser for every XML feed I have, or should I write a parser for each XML feed I have?

So I help write an app for my university and I'm wondering what's the best way to handle multiple XML feeds (like scores for sports, class information, etc.).
Should I have one XML parser that can handle all feeds? Or should I write a parser for each feed? We're having trouble deciding the best way to implement it.
This is iOS and we use a mix of Swift 3 and Objective-C
I think the right strategy is to write a base class that handles common data types like integers, booleans, strings, etc., and then write derived classes for each type of feed. This is the strategy I use in my XML parser, which is based on the data structures and Apple's XML parser as described here:
https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/NSXML_Concepts/NSXML.html
Personally I prefer to use the XPath data models where you can query the XML tree for a specific node using a path-like string.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

tool to extract data structures from unclean data

I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data
namely:
field:
name:value
fieldset:
nombre <FieldSet>
field,
.
.
.
field(n)
table
nombre <table>
head(1)... head(n)
val(1)... val(n)
.
.
.
I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?
I am aware of Antlr but understand this is more geared towards tree construction, an not independent bits of data (am I wrong about this?)
Does anyone have any suggestions for the problem as a whole?
I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.
I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!
We ended up using antlr for this, it required us to make multiple lexers where one lexer would manipulated the input for the next lexer.
Another project is pads - wrote in C
You should use "bnflite"
https://github.com/r35382/bnflite
Using this template library you need to develop BNF like gramma for your text by means of classes and overloaded operators directly in C++ code.
The benefit is that such gramma is easily adjustable to your source

Need help to gather data from a txt file and insert in a web page?

could someone advise me on the most efficient way to gather data from one source, select a specific piece of data and insert it in a web page? Specifically, I wish to:
Call up this buoy data text file: http://www.ndbc.noaa.gov/data/realtime2/46237.txt
Find the water temperature and insert that value in my web page.
First big question: What scripting language should I use? (I'm assuming Fortran is not an option :-)
Second not so big question: This same data set is available in graphic and xml format. Would either of these data formats be more useful than the .txt file?
Thanks in advance.
Use Perl.
(Hey, you asked. Normally one programs in whatever language one would normally use.)
The XML format won't be much more useful than the text format.
This text file format is just about as simple as it could ever get. Just about any scripting or general purpose programming language will work. The critical part is to split each line on a regex "\s+". i.e. in Python it would be:
import re
theFileObject = open('/path/to/downloaded/file.txt')
for line in theFileObject.readlines():
columns = re.split(r'\s+', line)
# each column is columns[0] through columns[19]
So basically choose whatever programming language seems easiest to you. Any .NET language would be equally capable, as well as ruby, python, scheme, etc. I personally have a distaste for perl because I find it very difficult to read

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Resources