How to apply a Hive schema to unstructured text? - parsing

I have a space delimited text file representing some logs data. For simplicity, headers would be:
'date', 'time',’query’,’host’
And a record would look like:
2001-01-01 01:02:04 irfjrifjWt.f=32&ydeyf myhost
A simple Hive table with space delimited fields will read this data correctly. However I want to do further parsing of the query string.
Within this text are tags that I want to parse into Hive columns.
Here’s a de-identified example of a couple of query strings:
ofifnmfiWT.s=12&ifmrinfnWT.df=hello’&oirjfirngirngWT.gh=32&iurenfur
ggfWT.gh=12&WT.ll=12&uyfer3d
Tags have the format WT.xx, followed by an =, followed by the value of the tag, followed by an &.
The order of the tags and the presence of each tag varies from record to record. The only thing I could define ahead is a set of tags I want to parse. In the example above, let’s say I’m interested in tags [WT.gh, WT.s]. Then (making up date time and host), my Hive table would look like:
Date time host WT.s WT.gh
2011-01-01 05:03:03 myhost1 12 32
2011-01-01 05:03:03 myhost1 NULL 12
I could easily parse the query string with Python and regex, and just create a second .txt file with the original record, plus a series of new values with the parsed tags, but that seems a waste of time and it doesn’t look like it is utilizing schema on read principles.
I might be wrong in my thinking, since I’m new to this, but I was wondering if there is a way to apply a schema on this data that would inherently do the parsing for me.
If not, what solution would you recommend?

Related

Querying lucene index with arbitrary long article text to check for all matches within article (through neo4j)

I'm trying to query the lucene index I've added to a neo4j field (it's a "name" field, that isn't very long, one to ten words at most).
What I do right now is take all the text in a given webpage, sanitize it with a javascript function to keep only words, spaces and alphanumeric characters, and use that to query my index.
.replace(/[^\w\s]|or|and|not|return+/gi, "") // <- escaping the input
I'm not sure if the length of the search text is limited somehow, but results do seem to disappear after about 1050 words (~6500 characters).
Ideally, I'd like to be able to use a couple thousand words in one query, with the end goal of highlighting the matches found within the webpage itself.
Why is my query not returning any results past a certain number of characters ? Am I missing some keyword in my escaping regex ?
Is what I'm trying to achieve feasible ? Is there a better approach I could use ?
Thanks for reading :)
(for anyone finding this, I found a somewhat related question here: Handling large search queries on relatively small index documents in Lucene)

NLP approaches to identify dates/time expressions in text

I need to develop an application which identifies the date inside the given text using some NLP approach. Let's assume I have a data in DB with dates column "from", "to" and if the text is below,
Get data between 1st August and 15th August
I need to identify the dates and form the query to retrieve the data. I used Natty NLP and I was able to identify the dates. But I'm stuck for more complex time expressions like:
Get data uploaded next week
Get data uploaded last week
Here for the first one I need to identify next week Monday's date and Sunday's date and form the query same for the 2nd one. But with Natty it gives me next week from today's date. What other solutions exist? Or do I need to manipulate the expression by coding? I am using Java.
Your questions is a bit confusing, but I guess you want to achieve two things:
Identify words that represent a time expression
Map these words to a formal machine-readable representation
If that is what you need check the duckling framework, it identifies time expressions, and it normalise them into a single unique formal date representation.
Note that you need to pass a reference date, for ambiguous time expressions.
You can run it as a service and call it from your code.

how to change the format of a field when using parse to select fields in sumologic

I am totally new to sumologic platform. I am trying to select fields from the log data. The simple code is:
| parse "transactionNumber=*|" as transactionNumber
| parse "message=*|" as message
My transaction number is a very long numbers, such as 123456789987654321. So, when I 'Export(Display Fields)' to save the result to csv file, it will be automatically transfer to scientific notation such as 123e+15.
So, how to change the format, let's say from number to character, so that I can get the real numbers in csv?
I think the simple way is save the file as txt, instead of csv.
But this is not related to sumo logic programming. So I think this is not a very "descent" way.

Force field/tag when inserting data on influxdb

I am just starting with influxdb as a time series database and i was trying to create some measurements, however it seems like influx automatically determines which fields of the measurements are tags and which are fields, is there any way to force one or another at insertion time (first insertion of a measurement)?? or any other way whatsoever?
No, InfluxDB won't automatically determine which are fields and which are tags. Your "insert string" structure "helps InfluxDB in determining which are tags and which are fields.
For example:
cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000
In this string, "cpu_load_short" is the measurment name (observe that there is no = sign to form a key/value pair), followed by semi colon.
host=server01,region=us-west are tags (in key value formats)
Rignt after tags you can see that there is space, which tells following the space are "Fields".
You may refer this documentation for more information.

How to handle Dates as long numbers

I am trying to store Dates on my nodes on the database. I am loading data using webadmin and the csv importer, my problem is that data is saved as string and i need it to be long, i have found some methods to cast some types like toInt() but there is no equivalent for long type.
I have a node that contains two date fields, ArrivalDate and DepartureDate, it is a long number in the csv file but once the query is executed in neo4j the field is stored as a string. The problem is that i cannot make a query to compare Dates since they are strings, a sample query i want to run is like this:
match(p:Person)--(s:Stay)
where (s.ArrivalDate)<=634924360000000000 and s.DepartureDate>=634924360000000000)
So i would get all the people staying in that Date.
I have done some research, and also asked before here, maybe the question was not that good explained.
For references: i am using the webadmin to load csv files for the bulk load but then my app is in c# and i am using neo4jclient to work with the DB.
Neo4j 2.1 (which is about to release rather soon) has a Cypher command LOAD CSV. You can use toInt function to convert a string to a numeric value during import. Example:
LOAD CSV WITH HEADERS FROM 'file:/mnt/teamcity-work/42cff4ac2707ec23/target/community/cypher/docs/cypher-docs/target/docs/dev/ql/load-csv/csv-files/file.csv'
AS line
CREATE (:Artist { name: line.Name, year: toInt(line.Year)})

Resources