I currently have a project about NLP, I try to use NLTK to recognize a PERSON name. But, the problem is more challenging than just finding part-of-speech.
"input = "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."
So, the challenge is: I just want to get the attorney's name as the return from the whole document, not the other person, so "John Smith", part-of-speech: PERSON, occupation: attorney. The return could look like this, or just "John Smith".
{
"name": "John Smith",
"type": "PERSON",
"occupation": "attorney"
}
I have tried NLTK part-of-speech, also the Google Cloud Natural Language API, but it just helped me to detect the PERSON name. How can I detect if it is an attorney? Please guide me to the right approach. Do I have to train my own data or corpus to detect "attorney". I have thousands of court document txt files.
The thing with pre-trained Machine Learning models is that there is not much space for flexibility in what you want to achieve. Tools such as Google Cloud Natural Language offer some really interesting functionalities, but you cannot make them do other work for you. In such a casa, you would need to train your own models, or try a different approach, using tools such as TensorFlow, which require a high expertise in order to obtain decent results.
However, regarding your precise use case, you can use the analyzeEntities method to find named entities (common nouns and proper names). It turns out that, if the word attorney is next to the name of the person who is actually an attorney (as in "I am John, and my attorney James is working on my case." or your example "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."), it will bind those two entities together.
You can test that using the API Explorer with this call I share, and you will see that for the request:
{
"document": {
"content": "I am John, and my attorney James is working on my case.",
"type": "PLAIN_TEXT"
},
"encodingType": "UTF8"
}
Some of the resulting entities are:
{
"name": "James",
"type": "PERSON",
"metadata": {
},
"salience": 0.5714066,
"mentions": [
{
"text": {
"content": "attorney",
"beginOffset": 18
},
"type": "COMMON"
},
{
"text": {
"content": "James",
"beginOffset": 27
},
"type": "PROPER"
}
]
},
{
"name": "John",
"type": "PERSON",
"metadata": {
},
"salience": 0.23953272,
"mentions": [
{
"text": {
"content": "John",
"beginOffset": 5
},
"type": "PROPER"
}
]
}
In this case, you will be able to parse the JSON response and see that James is (correctly) connected to the attorney noun, while John is not. However, as per some tests I have been running, this behavior seems to be only reproducible if the word attorney is next to one of the names you are trying to identify.
I hope this can be of help for you, but in case your needs are more complex, you will not be able to do that with an out-of-the-box solution such as Natural Language API.
Related
In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.
Let's say that I'm making a Cloudant database to store all the service records for my fleet of cars (I'm not, but the problem is pretty much the same.) To do this, I have two types of records:
Cars:
{
"type": "Car",
"_id": "VIN 1",
"plateNumber": "ecto-1",
"plateState": "NY",
"make": "Cadillac",
"model": "Professional Chassis",
"year": 1959
}
{
"type": "Car",
"_id": "VIN 2",
"plateNumber": "mntclmbr",
"plateState": "VT",
"make": "Jeep",
"model": "Wrangler",
"year": 2016
}
And service records:
{
"type": "ServiceRecord",
"_id": "service1",
"carServiced": "VIN 1",
"date": [1984, 6, 8],
"item": "Cleaning (Goo)",
"cost": 300
}
{
"type": "ServiceRecord",
"_id": "service2",
"carServiced": "VIN 1",
"date": [1984, 6, 9],
"item": "Cleaning (Marshmellow)",
"cost": 800
}
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": "VIN 2",
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
There's a couple things to note about how this works:
The VIN number of a car will never change is used as the document _id.
The service records for a car should not be lost if the car is registered in a new state, or with a new plate number.
Due to the volume of cars, and how often they need repairs, it's not reasonable to edit a car's document if a service record needs to be added, removed, or changed.
Currently, I have a couple views to look up information.
First, I've got a map from license plate to VIN:
function(doc){
if (doc.type == "Car"){
emit([doc.plateState, doc.plateNumber], doc._id);
}
}
// Results in:
["NY", "ecto-1"] -> "VIN 1"
["VT", "mntclmbr"] -> "VIN 2"
Second, I've got a map map from all the cars' VINs to the service records:
function(doc){
if (doc.type == "ServiceRecord"){
emit(doc.carServiced, doc);
}
}
// Results in:
"VIN 1" -> {"_id": "service1", ...}
"VIN 1" -> {"_id": "service2", ...}
"VIN 2" -> {"_id": "service3", ...}
Finally, I've got a map map from all the cars' VINs and service dates to the specific service that happened on that date:
function(doc){
if (doc.type == "ServiceRecord"){
var key = [doc.carServiced, doc.date[0], doc.date[3], doc.date[2]];
emit(key, doc);
}
}
// Results in:
["VIN 1", 1984, 6, 8] -> {"_id": "service1", ...}
["VIN 1", 1984, 6, 9] -> {"_id": "service2", ...}
["VIN 2", 2016, 4, 2] -> {"_id": "service3", ...}
With these three maps, I can find three different things:
The VIN of any car by its license plate.
The service records of any car by its VIN.
The service records of any car by its VIN for any particular year, month, or day.
However, can't find all the service records of a car by its license plate. (At least not in one step.) To do that, I would need a map like this:
["NY", "ecto-1"] -> {"_id": "service1", ...}
["NY", "ecto-1"] -> {"_id": "service2", ...}
["VT", "mntclmbr"] -> {"_id": "service3", ...}
And to make it even MORE complicated, I'd like to be able to look up service records by license plate AND date, with a map like this:
["NY", "ecto-1", 1984, 6, 8] -> {"_id": "service1", ...}
["NY", "ecto-1", 1984, 6, 9] -> {"_id": "service2", ...}
["VT", "mntclmbr", 2016, 4, 2] -> {"_id": "service3", ...}
Unfortunately, I don't know how to generate maps like these because the key requires information from two documents. I can only get plate information from Car documents and I can only get service information (including the document _id for the value of emit) from ServiceRecord documents.
So far, my only thought is to do two queries: one to get the VIN from the plate info, and another to get the service records from the VIN. They'll be fast queries, so it's not a huge problem, but I feel like there's a better way.
Anyone know what that better way might be?
(Bonus: The two-query method does not allow for finding all service records by state in an efficient way. The last map I describe would be able to do that. So bonus internet-points for anyone who can describe a solution that provides that functionality as well.)
**Edit: Another issue, here, was suggested as a possible duplicate. It is definitely a similar problem, however the solutions provided do not solve this issue. Specifically, the top solution suggests storing an document's position within the tree. In this case, that would be something like "index":[State, Number, Year, Month, Day]" in a ServiceRecord document. However, we can't do that because the plate information can easily change.
Hopefully you are still around. The gist of the answer is : in CouchDb when you feel a need to do joins you are 99% of the times doing something wrong. What you need to do is have all the information you need in one document.
You need to get into the habit of thinking about how you are going to query your data when you design what to save. You will find that replacing the "relational normalization" habit with this habit is healthy.
What you can do here is save the licence plate number in the service record document. Don't be afraid to denormalize. A service record should therefore look like this :
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": "VIN 2",
"carPlateNumber": "mntclmbr",
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
And you can easily do whatever you want from here. That being said, the architect I am can smell that you are likely to invent new ways to query this data every month. For this reason, I'd personally prefer to store the whole car document within the service record :
{
"type": "ServiceRecord",
"_id": "service3",
"carServiced": {
"type": "Car",
"_id": "VIN 2",
"plateNumber": "mntclmbr",
"plateState": "VT",
"make": "Jeep",
"model": "Wrangler",
"year": 2016
},
"date": [2016, 4, 2],
"item": "Alignment",
"cost": 150
}
This is absolutely fine. Especially since a service record is a snapshot in time and you do not need to worry about updating the information. I actually find that this is one of the scenarios where CouchDb particularly shines as storing a snapshot basically is a free lunch (as opposed to managing a cars_snapshot table in a relational system). And we tend to forget it but very often (especially as far as sales are concerned), we are interested in snapshots, not up-to-date relational data (what was the customer name at the time he bought, what was the tax rate at the time he bought, etc.). But relational systems put us in the "most up to date by default" habit because snapshot management involves a significant overhead there.
The bottom line is that this kind of denormalization is absolutely fine in CouchDb. You are in the intended usage and won't be bitten in the back down the road. As CouchDb puts it : just relax ;)
It sounds like chained mapreduce could provide your solution?
https://examples.cloudant.com/sales/_design/sales/index.html
I'm currently using the example data on console.neo4j.org to write a query that outputs hierarchical JSON.
The example data is created with
create (Neo:Crew {name:'Neo'}), (Morpheus:Crew {name: 'Morpheus'}), (Trinity:Crew {name: 'Trinity'}), (Cypher:Crew:Matrix {name: 'Cypher'}), (Smith:Matrix {name: 'Agent Smith'}), (Architect:Matrix {name:'The Architect'}),
(Neo)-[:KNOWS]->(Morpheus), (Neo)-[:LOVES]->(Trinity), (Morpheus)-[:KNOWS]->(Trinity),
(Morpheus)-[:KNOWS]->(Cypher), (Cypher)-[:KNOWS]->(Smith), (Smith)-[:CODED_BY]->(Architect)
The ideal output is as follows
name:"Neo"
children: [
{
name: "Morpheus",
children: [
{name: "Trinity", children: []}
{name: "Cypher", children: [
{name: "Agent Smith", children: []}
]}
]
}
]
}
Right now, I'm using the following query
MATCH p =(:Crew { name: "Neo" })-[q:KNOWS*0..]-m
RETURN extract(n IN nodes(p)| n)
and getting this
[(0:Crew {name:"Neo"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (2:Crew {name:"Trinity"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (3:Crew:Matrix {name:"Cypher"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (3:Crew:Matrix {name:"Cypher"}), (4:Matrix {name:"Agent Smith"})]
Any tips to figure this out? Thanks
In neo4j 3.x, after you install the APOC plugin on the neo4j server, you can call the apoc.convert.toTree procedure to generate similar results.
For example:
MATCH p=(n:Crew {name:'Neo'})-[:KNOWS*]->(m)
WITH COLLECT(p) AS ps
CALL apoc.convert.toTree(ps) yield value
RETURN value;
... would return a result row that looks like this:
{
"_id": 127,
"_type": "Crew",
"name": "Neo",
"knows": [
{
"_id": 128,
"_type": "Crew",
"name": "Morpheus",
"knows": [
{
"_id": 129,
"_type": "Crew",
"name": "Trinity"
},
{
"_id": 130,
"_type": "Crew:Matrix",
"name": "Cypher",
"knows": [
{
"_id": 131,
"_type": "Matrix",
"name": "Agent Smith"
}
]
}
]
}
]
}
This was such a useful thread on this important topic, I thought I'd add a few thoughts after digging into this a bit further.
First off, using the APOC "toTree" proc has some limits, or better said, dependencies. It really matters how "tree-like" your architecture is. E.g., the LOVES relation is missing in the APOC call above and I understand why – that relationship is hard to include when using "toTree" – that simple addition is a bit like adding an attribute in a hierarchy, but as a relationship. Not bad to do but confounds the simple KNOWS tree. Point being, a good question to ask is “how do I handle such challenges”. This reply is about that.
I do recommend upping ones JSON skills as this will give you much more granular control. Personally, I found my initial exploration somewhat painful. Might be because I'm an XML person :) but once you figure out all the [, {, and ('s, it is really a powerful way to efficiently pull what's best described as a report on your data. And given the JSON is something that can easily become a class, it allows for a nice way to push that back to your app.
I have found perf to also be a challenge with "toTree" vs. just asking for the JSON. I've added below a very simplistic look into what your RETURN could look like. It follows the following BN format. I'd love to see this more maturely created as the possibilities are quite varied, but this was something I'd have found useful thus I’ll post this immature version for now. As they say; “a deeper dive is left up to the readers” 😊
I've obfuscated the values, but this is an actual query on what I’ll term a very poor example of a graph architecture, whose many design “mistakes” cause some significant performance headaches when trying to access a holistic report on the graph. As in this example, the initial report query I inherited took many minutes on a server, and could not run on my laptop - using this strategy, the updated query now runs in about 5 seconds on my rather wimpy laptop on a db of about 200K nodes and .5M relationships. I added the “persons” grouping alias as a reminder that "persons" will be different in each array element, but the parent construct will be repeated over and over again. Where you put that in your hand-grown tree, will matter, but having the ability to do that is powerful.
Bottom line, a mature use of JSON in the RETURN statement, gives you a powerful control over the results in a Cypher query.
RETURN STATEMENT CONTENT:
<cypher_alias>
{.<cypher_alias_attribute>,
...,
<grouping_alias>:
(<cypher_alias>
{.<cypher_alias_attribute,
...
}
)
...
}
MATCH (j:J{uuid:'abcdef'})-[:J_S]->(s:S)<-[:N_S]-(n:N)-[:N_I]->(i:I), (j)-[:J_A]->(a:P)
WHERE i.title IN ['title1', 'title2']
WITH a,j, s, i, collect(n.description) as desc
RETURN j{.title,persons:(a{.email,.name}), s_i_note:
(s{.title, i_notes:(i{.title,desc})})}
if you know how deep your tree is, you can write something like this
MATCH p =(:Crew { name: "Neo" })-[q:KNOWS*0..]-(m)
WITH nodes(p)[0] AS a, nodes(p)[1] AS b, nodes(p)[2] AS c, nodes(p)[3] AS d, nodes(p)[4] AS e
WITH (a{.name}) AS ab, (b{.name}) AS bb, (c{.name}) AS cb, (d{.name}) AS db, (e{.name}) AS eb
WITH ab, bb, cb, db{.*,children:COLLECT(eb)} AS ra
WITH ab, bb, cb{.*,children:COLLECT(ra)} AS rb
WITH ab, bb{.*,children:COLLECT(rb)} AS rc
WITH ab{.*,children:COLLECT(rc)} AS rd
RETURN rd
Line 1 is your query. You save all paths from Neo to m in p.
In line 2 p is split into a, b, c, d and e.
Line 3 takes just the namens of the nodes. If you want all properties you can write (a{.*}) AS ab. This step is optional you can also work with nodes if you want to.
In line 4 you replace db and eb with a map containing all properties of db and the new property children containing all entries of eb for the same db.
Lines 5, 6 and 7 are basically the same. You reduce the result list by grouping.
Finally you return the tree. It looks like this:
{
"name": "Neo",
"children": [
{
"name": "Morpheus",
"children": [
{"name": "Trinity", "children": []},
{"name": "Cypher","children": [
{"name": "Agent Smith","children": []}
]
}
]
}
]
}
Unfortunately this solution only works when you know how deep your tree is and you have to add a row if your tree is one step deeper.
If someone has an idea how to solve this with dynamic tree depth, please comment.
I am building a site with a database of users. I am using arbor.js to build a graph for each user. The graph is a tree-like structure with edges and nodes that looks something like this (I had an image ready to go but apparently don't have enough reputation yet):
vehicle
/ \
/ \
car truck
/
/
sedan
and is represented by the following JSON:
{
"nodes":{
"vehicle":{
"color":"black",
"label":"vehicle"
},
"car":{
"color":"orange",
"label":"car"
},
"truck":{
"color":"red",
"label":"truck"
},
"sedan":{
"color":"red",
"label":"sedan"
}
},
"edges":{
"vehicle":{
"car":{
"weight":5,
"directed":true,
"color":"orange"
},
"truck":{
"weight":5,
"directed":true,
"color":"red"
}
},
"car":{
"sedan":{
"weight":2,
"directed":true,
"color":"orange"
}
}
}
}
Each graph will always have a nodes and edges object with dynamic nodes and edges. Their respective attributes (color, label, weight etc.) will be fixed.
I am trying to figure out how best to model this data for each user. I am using Rails with MongoDB (Mongoid), because I understand that MongoDB can save objects as documents in the database. I'm pretty sure each user will have a graph model which I can define, but beyond that I'm not sure how to handle the nodes and edges.
I'm guessing the solution will involve has_many, embeds_many, or possibly serialize, but I'm unclear on how to use these with a mix of fixed and dynamic data.
Also, it would be nice to retrieve the data exactly the way it looks above so I can easily create the graph when loading it from disk.
Any help would be appreciated!
In case all you need is to perform graph operations only per user. You can follow this model.
{
"nodes": [{"type": "vehicle", "color":"black", "label": "vehicle"},
{"type": "car", "color":"orange", "label": "car"},
{"type":"truck", "color":"red", "label":"truck"},
{"type": "sedan", "color":"red", "label":"sedan"}
],
"edges": {
"vehicle": [
{"type": "car", "weight": 5, "color": "orange"},
{"type": "truck", "weight": 5, "color": "red"}
],
"car": [
{"type": "sedan", "weight": 2, "color": "orange"}
],
"sedan": [],
"truck":: []
}
It is like you are storing a multimap for edges. Also it is self suggestive whether its a bi-directional or not. For individual user's graph to be processed independently, it is a pretty natural model you can go with.
Tell me if it meets your requirement. Also, until you specify what kind of queries you want to perform over your collection, its not possible to suggest a model.
Also if you are starting your project you can explore some graph databases as well like neo4j
In my project I need to aggregate many data in one string and later parse it out.
The data is related to people, it need to record people_ids in different state and age group, and their counts.
For example, we have 5 people named John Smith in CA, 2 people between 20-29, 2 between 30-39, 1 between 40-49; 2 people named John Smith in NY, 1 between 20-29 and 1 between 30-39. Then the string will be somewhat like this,
John smith| [CA#5: 20-29#2{pid_1, pid_2};30-39#2{pid_3,pid_4};40-49#1{pid_5}] [NY#2: 20-29#1{pid_6};30-39#1{pid_7}]
It not necessarily be the same format, but whatever format easy to parse out. Is there any good way to do this? How about Json format?
And if it looks like the above format, if I want all John Smith in CA between age 30-39, how should I parse out the data?
Thanks a lot!!
From my understanding of your post, this might be a format you're looking for (as represented in JSON).
Keep in mind that there are gems that can generate and parse JSON for you.
{
"name": "John Smith",
"states": {
"CA": {
"total": 5,
"ages": {
"20-29": [pid_1, pid_2],
"30-39": [pid_3, pid_4],
"40-49": [pid_5]
}
},
"NY": {
"total": 2,
"ages": {
"20-29": [pid_6],
"30-39": [pid_7]
}
}
}
}