In my project I need to aggregate many data in one string and later parse it out.
The data is related to people, it need to record people_ids in different state and age group, and their counts.
For example, we have 5 people named John Smith in CA, 2 people between 20-29, 2 between 30-39, 1 between 40-49; 2 people named John Smith in NY, 1 between 20-29 and 1 between 30-39. Then the string will be somewhat like this,
John smith| [CA#5: 20-29#2{pid_1, pid_2};30-39#2{pid_3,pid_4};40-49#1{pid_5}] [NY#2: 20-29#1{pid_6};30-39#1{pid_7}]
It not necessarily be the same format, but whatever format easy to parse out. Is there any good way to do this? How about Json format?
And if it looks like the above format, if I want all John Smith in CA between age 30-39, how should I parse out the data?
Thanks a lot!!
From my understanding of your post, this might be a format you're looking for (as represented in JSON).
Keep in mind that there are gems that can generate and parse JSON for you.
{
"name": "John Smith",
"states": {
"CA": {
"total": 5,
"ages": {
"20-29": [pid_1, pid_2],
"30-39": [pid_3, pid_4],
"40-49": [pid_5]
}
},
"NY": {
"total": 2,
"ages": {
"20-29": [pid_6],
"30-39": [pid_7]
}
}
}
}
Related
I currently have a project about NLP, I try to use NLTK to recognize a PERSON name. But, the problem is more challenging than just finding part-of-speech.
"input = "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."
So, the challenge is: I just want to get the attorney's name as the return from the whole document, not the other person, so "John Smith", part-of-speech: PERSON, occupation: attorney. The return could look like this, or just "John Smith".
{
"name": "John Smith",
"type": "PERSON",
"occupation": "attorney"
}
I have tried NLTK part-of-speech, also the Google Cloud Natural Language API, but it just helped me to detect the PERSON name. How can I detect if it is an attorney? Please guide me to the right approach. Do I have to train my own data or corpus to detect "attorney". I have thousands of court document txt files.
The thing with pre-trained Machine Learning models is that there is not much space for flexibility in what you want to achieve. Tools such as Google Cloud Natural Language offer some really interesting functionalities, but you cannot make them do other work for you. In such a casa, you would need to train your own models, or try a different approach, using tools such as TensorFlow, which require a high expertise in order to obtain decent results.
However, regarding your precise use case, you can use the analyzeEntities method to find named entities (common nouns and proper names). It turns out that, if the word attorney is next to the name of the person who is actually an attorney (as in "I am John, and my attorney James is working on my case." or your example "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."), it will bind those two entities together.
You can test that using the API Explorer with this call I share, and you will see that for the request:
{
"document": {
"content": "I am John, and my attorney James is working on my case.",
"type": "PLAIN_TEXT"
},
"encodingType": "UTF8"
}
Some of the resulting entities are:
{
"name": "James",
"type": "PERSON",
"metadata": {
},
"salience": 0.5714066,
"mentions": [
{
"text": {
"content": "attorney",
"beginOffset": 18
},
"type": "COMMON"
},
{
"text": {
"content": "James",
"beginOffset": 27
},
"type": "PROPER"
}
]
},
{
"name": "John",
"type": "PERSON",
"metadata": {
},
"salience": 0.23953272,
"mentions": [
{
"text": {
"content": "John",
"beginOffset": 5
},
"type": "PROPER"
}
]
}
In this case, you will be able to parse the JSON response and see that James is (correctly) connected to the attorney noun, while John is not. However, as per some tests I have been running, this behavior seems to be only reproducible if the word attorney is next to one of the names you are trying to identify.
I hope this can be of help for you, but in case your needs are more complex, you will not be able to do that with an out-of-the-box solution such as Natural Language API.
In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.
I want to gather data and then write a method to generate records based on said data. After running the method, I want to have a series of Movies and MovieRelations (which associates similar movies with each other). Each Movie will have a title, release_date, and several similar Movies through a MovieRelation. Each MovieRelation will have a movie_a_id and a movie_b_id.
The simplest way I've come up with would be to write a text document with the movies and their individual data separated by two different special symbols, to mark where the text should be broken up into separate movies, and where the movies should be broken up into their individual pieces of data, like this:
Title#Release Date#Similar Movie A#Similar Movie B%Title2#Release Date2#Similar Movie 2A#Similar Movie 2B#Similar Movie 2C
Then I could copy and paste the raw text into a method similar to this:
"X Men#11-02-2010#Hulk#Logan%Sing#12-04-2017#Zootopia#Pitch Perfect#Monster U"
.split('%').map.each do |movie_data|
#movie = Movie.create()
movie_data.split('#').map.each_with_index do |individual_data, index|
if index == 1
#movie.name = individual_data
elsif index == 2
#movie.release_date = individual_data
elsif index > 2
MovieRelation.create(movie_a_id: #movie.id, movie_b_id: Movie.find_by(name: individual_data))
end
end
#movie.save
end
So in the end, I should have 2 Movies and 5 MovieRelations.
I think this would work, but it seems pretty hacky. Is there a better way to accomplish this?
Before you start trying to create your own format, I'd suggest looking at YAML or JSON, which are well established, well supported, are internet standards with established syntax, and have parsers/serializers for the major languages so your data won't be locked to just your application.
Here's a starting point:
require 'yaml'
data = {
'title' => 'Raiders of the Lost Ark',
'release_date' => '12 June 1981',
'similar_movies' => [
{
'title' => 'Indiana Jones and the Last Crusade',
'release_date' => '24 May 1989',
'similar_movies' => nil
},
{
'title' => 'Indiana Jones and the Temple of Doom',
'release_date' => '23 May 1984',
'similar_movies' => nil
}
]
}
puts data.to_yaml
That outputs:
---
title: Raiders of the Lost Ark
release_date: 12 June 1981
similar_movies:
- title: Indiana Jones and the Last Crusade
release_date: 24 May 1989
similar_movies:
- title: Indiana Jones and the Temple of Doom
release_date: 23 May 1984
similar_movies:
YAML is parsed using the Psych class so see the Psych documentation's load, load_file and maybe load_stream methods to learn how to read that data and convert it back to a Ruby object.
Similarly you could use JSON:
require 'json'
puts data.to_json
Which outputs:
{"title":"Raiders of the Lost Ark","release_date":"12 June 1981","similar_movies":[{"title":"Indiana Jones and the Last Crusade","release_date":"24 May 1989","similar_movies":null},{"title":"Indiana Jones and the Temple of Doom","release_date":"23 May 1984","similar_movies":null}]}
Or, if you need "pretty":
puts JSON.pretty_generate(data)
{
"title": "Raiders of the Lost Ark",
"release_date": "12 June 1981",
"similar_movies": [
{
"title": "Indiana Jones and the Last Crusade",
"release_date": "24 May 1989",
"similar_movies": null
},
{
"title": "Indiana Jones and the Temple of Doom",
"release_date": "23 May 1984",
"similar_movies": null
}
]
}
JSON lets us use JSON['some JSON as a string'] or JSON[a_ruby_hash_or_array] as a shortcut to parse or serialize respectively:
foo = JSON[{'a' => 1}]
foo # => "{\"a\":1}"
JSON[foo] # => {"a"=>1}
In either case, experiment with using Ruby to build your starting hash and let it emit the serialized version, then pipe that output to a file and begin filling it in.
If you want to use an ID for a related movie instead of the name you'll have to order your records in the file so the related movies occur first, remember what those IDs are after inserting them, then plug them into your data. That's really a pain. Instead, I'd walk through the object that results from parsing the data, extract all the related movies, insert them, then insert the main record. How to do that is left for you to figure out, but it's not too hard.
Parsing the string
For your code, you don't need an index, if or case but just split and splat :
input = 'X Men#11-02-2010#Hulk#Logan%Sing#12-04-2017#Zootopia#Pitch Perfect#Monster U'
input.split('%').each do |movie_data|
title, date, *related_movies = movie_data.split('#')
puts format('%-10s (%s) Related : %s', title, date, related_movies)
end
It outputs :
X Men (11-02-2010) Related : ["Hulk", "Logan"]
Sing (12-04-2017) Related : ["Zootopia", "Pitch Perfect", "Monster U"]
Saving data
You're trying to solve a problem that has already been solved. MovieRelations belong to, well, a relational database!
You could do all the imports, sorts and filters with a database (e.g. with Rails or Sequel). Once you're done and would like to export the information as plain text, you could dump your data into YAML/SQL/JSON.
With your current format, you'll only run into problems when you want to update relations, delete a movie or insert a movie with % or # in the title.
I have following Twitter data and I want to design a schema for the same .The queries which I would need to perform would be following:
get tweets volume for time interval,tweets with corresponding user info,tweets with corresponding topic info etc... Based on the below data ,anyone tell where designing of schema is correct.. (make rowkey as id+timestamp, column family as user ,others grouped into primary column . Any Suggestions ?
{
"created_at":"Tue Feb 19 11:16:34 +0000 2013",
"id":303825398179979265,
"id_str":"303825398179979265",
"text":"Unleashing Innovation Conference Kicks Off - Wall Street Journal (India) http:\/\/t.co\/3bkXJBz1",
"source":"\u003ca href=\"http:\/\/dlvr.it\" rel=\"nofollow\"\u003edlvr.it\u003c\/a\u003e",
"truncated":false,
"in_reply_to_status_id":null,
"in_reply_to_status_id_str":null,
"in_reply_to_user_id":null,
"in_reply_to_user_id_str":null,
"in_reply_to_screen_name":null,
"user":{
"id":948385189,
"id_str":"948385189",
"name":"Innovation Plaza",
"screen_name":"InnovationPlaza",
"location":"",
"url":"http:\/\/tinyurl.com\/ee4jiralp",
"description":"All the latest breaking news about Innovation",
"protected":false,
"followers_count":136,
"friends_count":1489,
"listed_count":1,
"created_at":"Wed Nov 14 19:49:18 +0000 2012",
"favourites_count":0,
"utc_offset":28800,
"time_zone":"Beijing",
"geo_enabled":false,
"verified":false,
"statuses_count":149,
"lang":"en",
"contributors_enabled":false,
"is_translator":false,
"profile_background_color":"131516",
"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg",
"profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/781710342\/17a75bf22d9fdad38eebc1c0cd441527.jpeg",
"profile_background_tile":true,
"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg",
"profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3205718892\/8126617ac6b7a0e80fe219327c573852_normal.jpeg",
"profile_link_color":"009999",
"profile_sidebar_border_color":"FFFFFF",
"profile_sidebar_fill_color":"EFEFEF",
"profile_text_color":"333333",
"profile_use_background_image":true,
"default_profile":false,
"default_profile_image":false,
"following":null,
"follow_request_sent":null,
"notifications":null
},
"geo":null,
"coordinates":null,
"place":null,
"contributors":null,
"retweet_count":0,
"entities":{
"hashtags":[
],
"urls":[
{
"url":"http:\/\/t.co\/3bkXJBz1",
"expanded_url":"http:\/\/dlvr.it\/2yyG5C",
"display_url":"dlvr.it\/2yyG5C",
"indices":[
73,
93
]
}
],
"user_mentions":[
]
},
"favorited":false,
"retweeted":false,
"possibly_sensitive":false
}
If you are 100% sure that the ID is unique, you could this one as your row key to store the bulk of the data:
303825398179979265 -> data_CF
You column family data_CF would be defined in these lines:
"created_at":"Tue Feb 19 11:16:34 +0000 2013"
"id_str":"303825398179979265"
...
"user_id":948385189 { take note here I'm denormalizing your dictionary }
"user_name":"Innovation Plaza"
It gets a little trickier for the lists. The solution is to put something that will make it unique prefixed by the category:
"entities_hashtags_":"\x00" { Here \x00 is a dummy value }
For the URL, if the ordering is not important, you may prefix it with a UUID. It will guarantee that it is unique.
The advantage with this approach is if you need to update a field in this data, it will be done atomically since HBase guarantees row atomicity.
For the second question, if you need instantaneous aggregated information, you will have to precompute it and store it as you said in another table. If you want this data to be generated through M/R, you may put the timestamp + row id if it is to be time based. By topic would be something like topic + row id. This allows you to write Prefix scans with start stop row that will include only the time range or the topic you are interested.
Have fun !
My application is trying to match an incoming string against documents in my Mongo Database where a field has a list of keywords. The goal is to see if the keywords are present in the string.
Here's an example:
Incoming string:
"John Doe is from Florida and is a fan of American Express"
the field for the documents in the MongoDB has a value such as:
in_words: "georgia,american express"
So, the database record has inwords or keywords separate by comman and some of them are two words or more.
Currently, my RoR application pulls the documents and pulls the inwords for each one issuing a split(',') command on the inwords, then loops through each one and sees if it is present in the string.
I really want to find a way to push this type of search into the actual database query in order to speed up the processing. I could change the in_words in the database to an array such as follows:
in_words: ["georgia", "american express"]
but I'm still not sure how to query this?
To Sum up, my goal is to find the person that matches an incoming string by comparing a list of inwords/keywords for that person against the incoming string. And do this query all in the database layer.
Thanks in advance for your suggestions
You should definitely split the in_words into an array as a first step.
Your query is still a tricky one.
Next consider using a $regex query against that array field.
Constructing the regex will be a bit hard since you want to match any single word from your input string, or, it appears any pair of works (how many words??). You may get some further ideas for how to construct a suitable regex from my blog entry here where I am matching a substring of the input string against the database (the inverse of a normal LIKE operation).
You can solve this by splitting the long string into seperate tokens and put them in to the separate array. And use $all query to effectively find the matching keywords.
Check out the sample
> db.splitter.insert({tags:'John Doe is from Florida and is a fan of American Express'.split(' ')})
> db.splitter.insert({tags:'John Doe is a super man'.split(' ')})
> db.splitter.insert({tags:'John cena is a dummy'.split(' ')})
> db.splitter.insert({tags:'the rock rocks'.split(' ')})
and when you query
> db.splitter.find({tags:{$all:['John','Doe']}})
it would return
> db.splitter.find({tags:{$all:['John','Doe']}})
{ "_id" : ObjectId("4f9435fa3dd9f18b05e6e330"), "tags" : [ "John", "Doe", "is", "from", "Florida", "and", "is", "a", "fan", "of", "American", "Express" ] }
{ "_id" : ObjectId("4f9436083dd9f18b05e6e331"), "tags" : [ "John", "Doe", "is", "a", "super", "man" ] }
And remember, this operation is case-sensitive.
If you are looking for a partial match, use $in instead $all
Also you probably need to remove the noise words('a','the','is'...) before insert for accurate results.
I hope it is clear