How to apply tf-idf on multiple predictors, don't want to concatenate into a single column - vectorization

I have two predictors - want to vectorize each one of them using tf-idf (don't want to concatenate them since we need to have separate vocabulary for each). Should I apply the tf-idf vectorizers on each and then join the features.
For e.g. If i apply tf-idf on predictor1, I get 100 features from that and 200 from predictor2. My features for the training data would simply be 300 (100+200). Am i thinking correctly here?
I will get two matrices from this (one for each predictor), can i concatenate these using numpy functions and use them as features?

Your suggestion on getting this done is correct. The most common way of using two vectors like this is to concatenate them into a longer vector and then feed it to the model.
If, for some reason, this doesn't work out for you, we can explore alternatives based on what your constraints are.
For example, if your constraint is total dimension size, one way to solve this would be to create a multilayered MLP autoencoder
We can train it with the combined vectors as both input and output until the encoder is trained
Subsequently, we can use any intermediate layer's activations as input to our model
It would be easier to suggest a solution if you can describe your constraints in the question.

Related

Can I use embedding layer instead of one hot as category input?

I am trying to use FFM to predict binary labels. My dataset is as follows:
sex|age|price|label
0|0|0|0
1|0|1|1
I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:
sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1
Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.
My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.
I have no any other related dataset, so I can not use any
pre-trained embedding models. I can only randomly initialize the embedding
weights and the train it by my own dataset. Will this way approach
work?
If I use embedding layer instead of one hot encoding, does it
mean that each attribute will belongs one field?
What is the difference between these two methods? Which is better?
Yes you can use embeddings and that approach does work.
The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.
A higher embedding size means it will capture more details on the relationship between the categorical variables.
In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.
On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.
For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.

Mutli-Class Text Classifcation (using TFIDF and SVM). How to implement a scenario where one feedback may belong to more than one class?

I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.

What does the ranker in Weka PCA tell us about feature selection?

I have a data set that is 31000 rows with 13 attributes. But because most are categorical I had to use NominalToBinary for those attributes so the attributes grew to 61.
I have sampled the data to 18000 rows and applied the PCA with ranker in Weka. centerData is false so it should normalise it for me.
This is my result:
0.945 1 -0.367Marial_Status= Married-civ-spouse-0.365Relationship= Husband+0.298Marial_Status= Never-married+0.244Age=0_23+0.232Gender= Female...
I understand that the ranking is the variance. So rank 1 is 94.5%? Now the issue I have with feature selecting is how do i know which ones to keep? Most of these attributes are categorical and changed to numeric for the PCA. So with the original data-set with both categorical and numeric, with respects to this output what is it saying about feature selecting?
PCA assumes numerical data. If you binary encode you categorical variables you basically take a hammer and make you data fit your models assumption.
Another way to deal with categorical features are non-linear feature transformations which will find a way to represent distances between categories in a suitable way. A quick google search provided Categorical Principal Components Analysis (CTPCA) for me. Maybe have a look at this.

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real.
I have created two sets of features:
A) Bigram Term Frequency-Inverse Document Frequency
B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...
Which is the best way to combine the TFIDF features with the other features for a single prediction?
Thanks a lot to everyone.
Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.
Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).
That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.
What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones

Resources