I want to create a model that detects the gender based on a full name.
I have two dictionaries with male & female names. I want to develop a model to classify previously unseen names.
I need to determine the gender after the NER (name entity recognition) process. This delivers a PERSON entity with any one of these characteristics:
FULL NAME (John Travolta)
NAME only (John)
SURNAME only (Travolta)
I can do male vs female determination on (given) name only. The model needs to handle SURNAME only, classifying it as NO_GENDER.
I know that surnames can be noisy, but I must deal with them, because they could be a part of the input.
First, pre-process the data: in a full-name input, keep only the name (see below). Apply this to unknown input as well.
I suggest that you train a multi-class SVM. You already know the three classes. Make up the following training (labeled) data:
NO_GENDER: names on both the girls' and boys' lists
FEMALE: names on only the girls' list
MALE: names on only the boys' list
NO_GENDER: known surnames
NO_GENDER: non-name strings
Essentially,you train this to recognize FEMALE, MALE, and everything else.
PREPROCESS
This will give you some troubles, due to varying name formats. You may have trouble with compound names, such as
Bobby Jo male name with female modifier
van der Waal compound surname with male-looking prefix
St. John surname with gendered primary
Haley-Christopher hyphenated surname, genedered
If you pre-process the inputs, you may have some trouble spotting the proper division in, say, Billy Jean St. John or Marie-Therese von Klaus.
Related
We have to extract an entity which is inside another entity, any idea on how can we annotate the training data to train a NER model for this task. We are using Flair model for custom entity training and prediction.
Ex: Text: "" Address: 123, ABC Company, 4th floor, xyz street, state, country.""
We have a sample like this, where whole text itself is an entity of type "Address" and in the same text we have another entity called "Company Name".
For train a flair model, we are converting the data into BIEO format, not sure how to annotate the data and train the model.
We came up with a solution to handle this scenario by training two models, one for address and other for company name.
Comment your approach on how we can handled this kind of a scenario in much better way.
Say I have a Personstable with attributes {name, pet}. How do I select the names of people where they have one of each kind of pet (dog, cat, bird), but a person only has one of each kind of pet if they pet is in the table.
Example: Bob, Dog and Bob, Cat are the only rows in the table. Therefore, Bob has one of each kind of pet. But the moment Lynda, Bird are added, Bob doesn't have one of each type of pet anymore.
I think the first step to this is to π(pet). You get a list of all kinds of pets since relational algebra removes duplicates. Not sure what to do after this, but I have think I need to join π(pet) and Persons.
I've tried a few things like Natural Join and Cross products but I haven't arrived at a result yet and I'm out of ideas.
The answer to the question can be found with the Division operator:
Persons ÷ πpet(Persons)
This relational algebra expression returns a relation with only the column name, containing all the names of the persons that have all the different kind of pets currently present in the Persons table itself.
The division is an operator that, in some sense, is the inverse of the product operator (the name is derived exactly from this fact). It is a derived operator that can be defined in terms of projection, set difference and product (see for instance this answer).
I'm working on an ecommerce application. Most of the products i have contains the category attribute, but some do not (about 70-30%). I was trying to use Weka to detect the category, but the attributes i have are strings (name, brand, price, description, category) so all classifiers are not working as it need the attributes to be numeric, nominal, or binary.
Did any one faced such problem before ?
just make discrete continuous attributes and then it will work, because some of the algorithms does not work with continues values.
Use "StringToWordVector" filter that will convert your string attribute(s) to numeric attributes.
Please help me understand the difference between Named Entity Recognition and Named Entity Extraction.
Named Entity Recognition is recognition of the surface form of an Entity (person, place, organization), i.e. "George Bush" or "Barack Obama" are "PERSON" entities in this text string.
Entity Extraction will extract additional information as attributes from the text string. For example in the sentence "George W. Bush was president before President Obama" recognizing "Obama" as a person with attribute "title=president".
But if you look at software the distinction is often blurred.
There is no such a thing as Named Entity Extraction.
Paraphrasing better the sentence I would say that Named Entity Extraction is simple the process of concrete extracting previously recognized named entities. So, in a sense, there is no real theoretical knowledge that is relevant to this task, is just a matter of defining the mechanical operation.
If we are instead interested in extracting all the specific entities or the additional information regarding them from a piece text, than we have to look at information or knowledge extraction.
For information extraction you could for example ask to extract all the names of cities, or e-mail addresses, that appear in a corpus of documents. For such a task Named Entity Extraction could be used. You could even go much more generic, asking simply to extract general knowledge, for example in the form of relations (relation extraction).
For more details I would suggest the Natural Language Processing chapter of the book Artificial Intelligence: A Modern
Approach.
Suppose I have a rails tables containing information chosen from a set number of options. For example, a field named sex could be either Male or Female. A field named Bodytype will be either slim, curvy, etc.
My question is, what is better practice, to store those values as integers or strings?
In the first case, of course, the integers will be converted into text (In the controller?).
Thanks for all your help.
If are you are not storing millions of records, storing them as strings is fine. What you are loosing is ability to quickly update these categories, but if they are pretty much static and not going to change over time, it shouldn't matter. Sex should probably be called gender and Bodytype body_type.
You should always use an index to identify attributes within your tables.
So for example you tables will look like this
Gender Table
id | sex
1 | Female
2 | Male
Figure Table
id | body_type
1 | slim
2 | curvy
You then reference those values based on the id
http://use-the-index-luke.com/