predict text classification using python - machine-learning

I want to predict a text classification which is based on the correlation of the text in the training data set.
For eg. This is my training data:
"Mouse M325",
"Mouse for xyz M325",
"M325 Mouse logitech",
"Logitech mouse number M325"
As it is visible that Mouse and M325 definitely have a high correlation when compared with M325 and Logitech or others.
I want to use the correlations to predict a classification for the next dataset.
For eg. If the next data is "Mouse used by Alex number is M325"...it should give me "Mouse M325" as text classifier and notify in a separate tab that the model has predicted this description but it was not something that Machine had seen earlier in trained data.
Like,Result Model has predicted . How to solve this?

Related

Average prediction vs. average label of a model with log loss

In model with log loss, if I understand correctly, the average prediction will be aligned with average label on the training data.
My question is, does that also hold after slicing by feature values. E.g. if there's a feature with 2 values, A and B, would the average prediction align with average label on both (1) examples with feature value A, and (2) examples with feature value B?
If so, what's the intuition behind that?

Can K-means do dimensionality reduction?

My question is if we have 10 columns continuous variable,
can we do k-means to shrink 10 columns to 1 with corresponding cluster labels
and then do decision tree or logistic regression?
if a new data comes in, use k-mean result to determine its label and go to the machine learning model.
K-means is absolutely not a dimensionality reduction technique. Dimensionality reduction algorithms map the input space to a lower dimensional input space, while what you are proposing is mapping the input space directly to the output space which consists of the set of all integer labels.

Why does having too many principal components for handwritten digits classification result in less accuracy

I'm currently using PCA to do handwritten digits recognition for MNIST database (each digit has about 1000 observations and 784 features). One thing I have found confusing is that the accuracy is the highest when it has 40 PCs. If the number of PCs grows from this point, the accuracy starts to drop continuously.
From my understanding of PCA, I thought the more components I have, the better I can describe a dataset. Why does the accuracy becomes less if I have too many PCs?
In order to identify the optimum number of components, you need to plot the elbow curve
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The idea behind PCA is to reduce the dimensionality of the data by finding the principal components.
Lastly, I do not think that PCA can overfit the data as it is not a learning/ fitting algorithm.
You are just trying to project the data based on eigen-vectors to capture most of the variance along an axis.
This video should help: https://www.youtube.com/watch?v=_UVHneBUBW0

What is loss_cls and loss_bbox and why are they always zero in training

I'm trying to train a custom dataset on using faster_rcnn using the Pytorch implementation of Detectron here. I have made changes to the dataset and configuration according to the guidelines in the repo.
The training process is carried out successfully, but the loss_cls and loss_bbox values are 0 from the beginning and even though the training is completed, final output cannot be used to make an evaluation or an inference.
I would like to know what these two mean and how to get those values to change during the training. The exact model I'm using here is e2e_faster_rcnn_R-50-FPN_1x
Any help regarding this would be appreciated. I' using Ubuntu 16.04 with Python 3.6 on Anaconda, CUDA 9, cuDNN 7.
What are the two losses?
When training a multi-object detector, you usually have (at least) two types of losses:
loss_bbox: a loss that measures how "tight" the predicted bounding boxes are to the ground truth object (usually a regression loss, L1, smoothL1 etc.).
loss_cls: a loss that measures the correctness of the classification of each predicted bounding box: each box may contain an object class, or a "background". This loss is usually called cross entropy loss.
Why are the losses always zero?
When training a detector, the model predict quite a few (~1K) possible boxes per image. Most of them are empty (i.e. belongs to "background" class). The loss function associate each of the predicted boxes with the ground truth boxes annotation of the image.
If a predicted box has a significant overlap with a ground truth box then loss_bbox and loss_cls are computed to see how well the model is able to predict the ground truth box.
On the other hand, if a predicted box has no overlap with any ground truth box, than only loss_cls is computed for the "background" class.
However, if there is only very partial overlap with ground truth the predicted box is "discarded" and no loss is computed. I suspect, for some reason, this is the case for your training session.
I suggest you check the parameters that determines the association between predicted boxed and ground truth annotations. Moreover, look at the parameters of your "anchors": these parameters determines the scale and aspect ratios of the predicted boxes.

SMOTE oversampling for anomaly detection using a classifier

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE
SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

Resources