I'm training my LUIS app with utterances, and I've added the prebuilt entity "number". I've noticed when you're labeling utterances for training, you can remove custom labels that are automatically guessed but you can't remove prebuilt entity labels. For example, if I have this utterance:
I can click on Person (a custom entity I created) and remove that label but I can't do the same for number (the prebuilt entity). I'd like to do it for this utterance because the number should actually be "point five" -- e.g. "0.5", and thus I want to label it with another entity that indicates it's a number.
Is this possible?
Unfortunately we cannot remove the prebuilt entity labels as the users have minimal control over it. If you are not satisfied with the quality of the existing prebuilt entity, then an alternative could be to either build your own simple entity or regular expression entity to replace it.
Related
I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.
I have two entities created in LUIS. One entity to identify the AlphaNumeric word and another one to identify a word with a pattern. Both entities are created using a regular expression.
To identify alphanumeric I used - \w+\d+ regular expression.
To identify the word with a pattern I used - ^venid\d+ (words like venid12345, venid32310...)
These two entities are mapped to two different INTENTS. But actually how much I trained the LUIS, still the first entity is only getting recognized. How to overcome this?
Add the regex entities from Entities tab and train the app, then add utterances from the intents tab for the respective intent. This should enable the model to pickup the regex entities irrespective of the mapping. Here is a screen shot of both these entities getting recognized for some alphanumeric patterns.
numericalpha is the first pattern and vendor is the second pattern that i have added in the LUIS app.
I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.
In early editions of Neo4j, Super nodes were typically seen as a bad thing for performance. I have not seen too much about that recently with the 2.X and 3.X releases so was wondering if that was still a problem.
The issue I have is I need to store a finite number of options for a specific Node type. For example, Person and favorite colors. I can store an array in the Person Node that stores the colors the user likes, or I can create a Node for each color and then create a relationship from the Person to the Color Node. It seems the super node option would be faster to query but am worried as super nodes were bad in the past.
If I am trying to look up people who like a specific color, what's the recommended way to store such data in Neo?
I think the major issue here will be that the Color node will become a very connected node.
Maybe you need an Options subgraph to have a template of these options and then :
copy template node option to link this copy with the main entity node
or
copy only the choosen option into a property of the main entity node, like your array proposition
or, if your option has no properties
add label onto your main entity node
I think, even with the increase in performance of hyperlinked nodes with newer Neo4j versions, the read/write time will be always more than the one who has less.
I hope this help a bit.
I have a text corpus of many sentences, with some named entities marked within it.
For example, the sentence:
what is the best restaurant in wichita texas?
which is tagged as:
what is the best restaurant in <location>?
I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more complete (number of entities within it). I have lists of similar entities, including ones which doesn't appear in the corpus but I would like to have some probability of inserting them in my replacements.
Can you recommend on a method or direct me to a paper regarding this?
For your specific question:
This type of work, assuming you have an organized list of named entities (like a separate list for 'places', 'people', etc), generally consists of manually removing potentially ambiguous names (for example, 'jersey' could be removed from your places list to avoid instances where it refers to the garment). Once you're confident you removed the most ambiguous names, simply select an appropriate tag for each group of terms ("location" or "person", for instance). In each sentence containing one of these words, replace the word with the tag. Then you can perform some basic expansion with the programming language of your choice so that each sentence containing 'location' is repeated with every location name, each sentence containing 'person' is repeated with every person name, etc.
For a general overview of clustering using word-classes, check out the seminal Brown et. al. paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9919&rep=rep1&type=pdf