Custom Features using scikit-learn - machine-learning

I am working on a project to classify short text.
One requirement I have is along with the vectorizing the short text, I will like to add additional feature like length of the text, number of url's etc as features for each input.
Is is supported in scikit-learn?
Link to any example notebook or a video with be very help.
Thanks,
Romit.

You can combine features extracted by different transfomers (e.g. one that extracts Bag of Words (BoW) features with one that extracts other statistics) by using the FeatureUnion class.
The normalization of those features and there small number with respect to the number of distinct BoW features could be problematic. Whether or not this is problem depends on the assumptions made by the models trained downstream and on the specific data and target task.

I haven't used FeatureUnion class. However my approach was simpler and rather straight forward. Extract the features from your custom pipeline and append it with what you extracted from scikit-learn pipeline. This is nothing but appending array in numpy/scipy.
Precautions:
a) You must remember what are the feature-id's extracted from your custom pipeline. This will help you in appending arrays, without mixing things.
b)You would have to do normalization(as required) of your custom pipeline features.
Solution:
Write a custom feature extractor class. Wrap functionality like feature extraction, normalization etc into it.

Related

Get sentence vector for a K-means clustering task

I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html

Best way to treat (too) many classes in one categorical variable

I'm working on a ML prediction model and I have a dataset with a categorical variable (let's say product id) and I have 2k distinct products.
If I convert this variable with dummy variables like one hot enconder, the dataset may have a size of 2k times the number of examples (millions of examples), but it's too many to be processed.
How is this used to be treated?
Should I use the variable only with the whitout the conversion?
Thanks.
High cardinality of categorial features is a well-known problem and "the best" way typically depends on the prediction task and requires a trial-and-error approach. It is case-dependent if you can even find a strategy that is clearly better than others.
Addressing your first question, a good collection of different encoding strategies is provided by the category_encoders library:
A set of scikit-learn-style transformers for encoding categorical variables into numeric
They follow the scikit-learn API for transformers and a simple example is provided as well. Again, which one will provide the best results depends on your dataset and the prediction task. I suggest incorporating them in a pipeline and test (some or all of) them.
In regard to your second question, you would then continue to use the encoded features for your predictions and analysis.

Can CRFs (Conditional Random Fields) be used to label whole sentences?

I'm trying to use Machine Learning to label sentences
(each sentence with a single label, I assume sentences are independent from each other).
I thought linear CRF model would be ok for this case, but I have some questions.
I tried using CRF++ (other implementations I saw seem to have analogical formats).
It uses sentences as input, but the output label is assigned to each
token. How to use a single label for the whole sentence?
(The hack I thought of would be to assign a significant
label only to dot in the test data and treat it as the output label
for the whole sentence.)
How can sentences of different length be used?
The training configuration requires to specify which tokens are taken into
consideration when analysing the current token. But a sentence can have
a large or small number of tokens and I want to use all tokens from a sentence
(not more or less), to utilise the whole information.
From this question it seems that what I'm trying to do is possible (a single label for the whole sequence),
but I don't know how to format training data for that.
I think you are using the wrong tool for the job. To classify the entire sentence you could try using something like Facebook's fasttext.
https://github.com/facebookresearch/fastText
As Ashemah said, maybe you are using the wrong tool. CRFs are typically used if you want to label sequences, e.g. a sequence of words or even a sequence of sentences. But, as you assume that your sentences are independent of each other, you might want to look at each of them independently. Therefore, your task is not sequence labeling but a simple classification. For that you can use several other models such as SVM, Naive Bayes, kNN, and many more.

How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.

How to include datetimes and other priority information for clustering?

I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
make a mapping (int -> term) of all terms in the input and store into a dictionary
convert all input documents into a normalized sparse vector
do clustering
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck

Resources