I have a data set that includes job titles, and I would like to cluster them.
Job titles include:
Automotive Service Worker
Community Police Services Aide
DEPUTY SHERIFF
COUNSELOR, JUVENILE HALL
Swimming Instructor
FIREFIGHTER
Porter
Account Clerk
Deputy Sheriff
Assistant Retirement Analyst
POLICE OFFICER III
Patient Care Assistant
Public Service Trainee
PUBLIC RELATIONS OFFICER
SPECIAL NURSE
I'm going to clean the titles (remove unneeded characters, capitalize all the titles, etc) in order to make things a little easier to work with. Once I vectorize the corpus, the dimensionality is going to be very very large. Which clustering algs would you recommend for a problem like this? Does KMeans behave well for high dimensional problems?
Use brown clustering. The implementation is available here.
Related
I've been working on a research project. I have a database of Wikipedia descriptions of a large number of entities, including sportspersons, politicians, actors, etc. The aim is to determine the type of entity using the descriptions. I have access to some data with the predicted type of entity which is quite accurate. This will be my training data. What I would like to do is train a model to predict the dominant type of entity for rest of the data.
What I've done till now:
Extracted the first paragraph, H1, H2 headers of Wiki description of the entity.
Extracted the category list of the entity on the wiki page (The bottom 'Categories' section present on any page like here.
Finding the type of entity can be difficult for entities that are associated with two or more concepts, like an actor who later became a politician.
I want to ask as to how I create a model out of the raw data that I have? What are the variables that I should use to train the model?
Also are there any Natural Language Processing techniques that can be helpful for this purpose? I know POS taggers can be helpful in this case.
My search over the internet has not been much successful. I've stumbled across research papers and blogs like this one, but none of them have relevant information for this purpose. Any ideas would be appreciated. Thanks in advance!
EDIT 1:
The input data is the first paragraph of the Wikipedia page of the entity. For example, for this page, my input would be:
Alan Stuart Franken (born May 21, 1951) is an American comedian, writer, producer, author, and politician who served as a United States Senator from Minnesota from 2009 to 2018. He became well known in the 1970s and 1980s as a performer on the television comedy show Saturday Night Live (SNL). After decades as a comedic actor and writer, he became a prominent liberal political activist, hosting The Al Franken Show on Air America Radio.
My extracted information is, the first paragraph of the page, the string of all the 'Categories' (bottom part of the page), and all the headers of the page.
From what I gather you would like to have a classifier which takes text input and predicts from a list of predefined categories.
I am not sure what your level of expertise is, so I will give a high level overview if additional people would like to know about the subject.
Like all NLP tasks which use ML, you are going to have to transform your textual domain to a numerical domain by a process of featurization.
Process the text and labels
Determine the relevant features
Create numerical representation of features
Train and Test on a Classifier
Process the text and labels
the text might have some strange markers or things that need to be modified to make it more "clean". this is standard as a text normalisation step.
then you will have to keep the related categories as labels for the texts.
It will end up being something like the following:
For each wiki article:
Normalise wiki article text
Save associated categories labels with text for training
Determine the relevant features
Some features you seem to have mentioned are:
Dominant field (actor, politician)
Header information
Syntactic information (POS Tags) are local (token level), but can be used to extract specific features such as if words are proper nouns or not.
Create numerical representation of features
Luckily, there are ways of doing auto-encoding, such as doc2vec, which can make a document vector from the text. Then you can add additional bespoke features that seem relevant.
You will then have a vector representation of features relevant to this text as well as the labels (categories).
This will become your training data.
Train and Test on a Classifier
Now train and test on a classifier of your choice.
Your data is one-to-many as you will try to predict many labels.
Try something simple just to seem if things work as you expect.
You should test your results with a cross validation routine such as k-fold validation using standard metrics (Precision, Recall, F1)
Clarification
Just to help clarify, This task is not really a named entity recognition task. It is a kind of multi-label classification task, where the labels are the categories defined on the wikipedia pages.
Named Entity Recognition is finding meaningful named entities in a document such as people, places. Usually something noun-like. This is usually done on a token level whereas your task is on a document level it seems.
I have a large table of used cars.
The header looks like this:
maker | model | year | kilometers | transmission | gas_type | price
I made a prediction model, that work like this: every time I wanted to know the price of a car, I filtered the data by maker and model, and then I run a quadratic Regression, using year and kilometers as parameters.
The results are OK, but not for every car.
The problem is that there are different "versions" for the same maker and model.
(It is not the same a FULL version than a simple version, or 4WD, or Leather Seats, etc. )
How can I identify the differences? Can I use some kind of clustering to identify different version between cars with the same model and maker.
Any help will be appreciated
That's not a clustering problem, just a sub-model feature. Also, you might want to differentiate between a sub-model (standard, Luxury Edition, hatchback, etc.) from model-independent features (4WD, leather seats, premium sound system, sun roof, etc.). The sub-model would likely be a single feature (text column), while the options would be individual features (Boolean column).
UPDATE AFTER OP CLARIFICATION
I see: those features are output, not input.
Yes, you can use clustering. However, that may or may not identify sub-models (your "version"). If you cluster only observations that have very similar use (kilometeres) and all other features equal, you will find some useful clustering. However, this will work only to the extent that the version is a major factor in the remaining price variation. You may find that your clustering is also affected by geographic region and other factors.
I have a market transactions dataset including time stamps and goods as follow.
John always buy milk and bread in Super Market. Besides that, he also buys some goods like the following:
On Monday, John bought milk, bread {beer, chocolate}.
On Tuesday, John bought milk, bread {potato}.
On Wednesday, John bought milk, bread {chocolate, avocado, peanuts}.
Can we answer the question: "What will he buy on Thursdays?".
For example: He will buy {beer, avocado} besides milk and bread on Thursdays.
I think it is a kind of multiple regression. Which model can I use to predict a set of goods in this case?
If I correctly understand your question than it is a Multi-Label classification.
You have some input features (dayofweek, HasBoughtMilk HasBoughtBread etc). And you want to predict several other labels (Beer, Avocado) based on them. You could do this with sklearn easily, it supports multilabel classification.
If you want to consider what was bought on previous days (since it could affect your labels) you could do this in 2 ways:
1) Add synthetic features like binaries which show 'HasBoughtBread this week already'
2) Or use RNNs which are good at handling time series.
The problem you are exposing seems to be a textbook case for Random Forests. The inference rules you are trying to express fit really well in decision trees. Random Forests would provide you with a flexible model and fast to train.
Of course this not the only way, you could use SVMs or some deep learning like RNNs, but it feel like using a bazooka to swat a fly for me.
Cheers,
Quentin
This depends on the actual factors you're trying to model. Are some items dependent on one another? Is there an actual time element in the data, or are we just conditioned to infer it?
Assuming that you have a time element, you will definitely want some order of time-series analysis, a sequencing of purchases, perhaps with actual time lags. For instance, if John doesn't go to the store one day, what happens to his purchases? Do we need to learn how often some things get bought? Does one product purchase hasten or delay another?
These considerations suggest either pre-processing the data (for time lags) or a RNN, LSTM, or Q-net delay of some sort. Naive Bayes or Random Forest might be of some help, but you'd still need to pre-process the time relations first.
Objective
To clarify by having what traits or attributes, I can say an analysis is inferential or predictive.
Background
Taking a data science course which touches on analyses of Inferential and Predictive. The explanations (what I understood) are
Inferential
Induct a hypothesis from a small samples in a population, and see it is true in larger/entire population.
It seems to me it is generalisation. I think induct smoking causes lung cancer or CO2 causes global warming are inferential analyses.
Predictive
Induct a statement of what can happen by measuring variables of an object.
I think, identify what traits, behaviour, remarks people react favourably and make a presidential candidate popular enough to be the president is a predictive analysis (this is touched in the course as well).
Question
I am bit confused with the two as it looks to me there is a grey area or overlap.
Bayesian Inference is "inference" but I think it is used for prediction such as in a spam filter or fraudulent financial transaction identification. For instance, a bank may use previous observations on variables (such as IP address, originator country, beneficiary account type, etc) and predict if a transaction is fraudulent.
I suppose the theory of relativity is an inferential analysis that inducted a theory/hypothesis from observations and thought experimentations, but it also predicted light direction would be bent.
Kindly help me to understand what are Must Have attributes to categorise an analysis as inferential or predictive.
"What is the question?" by Jeffery T. Leek, Roger D. Peng has a nice description of the various types of analysis that go into a typical data science workflow. To address your question specifically:
An inferential data analysis quantifies whether an observed pattern
will likely hold beyond the data set in hand. This is the most common
statistical analysis in the formal scientific literature. An example
is a study of whether air pollution correlates with life expectancy at
the state level in the United States (9). In nonrandomized
experiments, it is usually only possible to determine the existence of
a relationship between two measurements, but not the underlying
mechanism or the reason for it.
Going beyond an inferential data analysis, which quantifies the
relationships at population scale, a predictive data analysis uses a
subset of measurements (the features) to predict another measurement
(the outcome) on a single person or unit. Web sites like
FiveThirtyEight.com use polling data to predict how people will vote
in an election. Predictive data analyses only show that you can
predict one measurement from another; they do not necessarily explain
why that choice of prediction works.
There is some gray area between the two but we can still make distinctions.
Inferential statistics is when you are trying to understand what causes a certain outcome. In such analyses there is a specific focus on the independent variables and you want to make sure you have an interpretable model. For instance, your example on a study to examine whether smoking causes lung cancer is inferential. Here you are trying to closely examine the factors that lead to lung cancer, and smoking happens to be one of them.
In predictive analytics you are more interested in using a certain dataset to help you predict future variation in the values of the outcome variable. Here you can make your model as complex as possible to the point that it is not interpretable as long as it gets the job done. A more simplified example is a real estate investment company interested in determining which combination of variables predicts prime price for a certain property so it can acquire them for profit. The potential predictors could be neighborhood income, crime, educational status, distance to a beach, and racial makeup. The primary aim here is to obtain an optimal combination of these variables that provide a better prediction of future house prices.
Here is where it gets murky. Let's say you conduct a study on middle aged men to determine the risks of heart disease. To do this you measure weight, height, race, income, marital status, cholestrol, education, and a potential serum chemical called "mx34" (just making this up) among others. Let's say you find that the chemical is indeed a good risk factor for heart disease. You have now achieved your inferential objective. However, you are satisfied with your new findings and you start to wonder whether you can use these variables to predict who is likely to get heart disease. You want to do this so that you can recommend preventive steps to prevent future heart disease.
The same academic paper I was reading that spurred this question for me also gave an answer (from Leo Breiman, a UC Berkeley statistician):
• Prediction. To be able to predict what the responses are going to be
to future input variables;
• [Inference].23 To [infer] how nature is associating the response
variables to the input variables.
Source: http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
There are several data sets for automobile manufacturers and models. Each contains several hundreds data entries like the following:
Mercedes GLK 350 W2
Prius Plug-in Hybrid Advanced Toyota
General Motors Buick Regal 2012 GS 2.4L
How to automatically divide the above entries into the manufacturers (e.g. Toyota ) and models (e.g. Prius Plug-in Hybrid Advanced) by using only those files?
Thanks in advance.
Machine Learning (ML) typically relies on training data which allows the ML logic to produce and validate a model of the underlying data. With this model, it is then in a position to infer the class of new data presented to it (in the classifier application, as the one at hand) or to infer the value of some variable (in the regression case, as would be, say, an ML application predicting the amount of rain a particular region will receive next month).
The situation presented in the question is a bit puzzling, at several levels.
Firstly, the number of automobile manufacturers is finite and relatively small. It would therefore be easy to manually make the list of these manufacturers and then simply use this lexicon to parse out the manufacturers from the model numbers, using plain string parsing techniques, i.e. no ML needed or even desired here. (alas the requirement that one would be using "...only those files" seems to preclude this option.
Secondly, one can think of a few patterns or heuristics that could be used to produce the desired classifier (tentatively a relatively weak one, as the patterns/heuristics that come to mind ATM seem relatively unreliable). Furthermore, such an approach is also not quite an ML approach in the common understanding of the word.