Roboflow dropping cyrillic labeled objects when creating dataset version - roboflow

I have my russian license plate symbols classification dataset labeled from scratch. While labeling, there was no problem in naming classes with cyrillic letters. Everything is showing correctly in "health check" tab. However, when I try to create a new version, all objects of cyrillic named classes are being implicitly dropped. And when I say implicitly, I mean that actual amount of images in training set is less than amount stated in "training set" tab(4202 actual vs 7102 stated), the same with validation and test.
I've tried:
Creating dataset with augmentation and only dropping "res" dummy class and augmentation
The same but with remapping all cyrillic to latin
No augmentation with dropping only res
No augmentation no dropping
Dropping all classes
And result (except augmentation) was always the same.
Actual last image of training set
Stated amount of images in the training set
Roboflow has "dropped" all the 23 classes, but images are still there
Health check tab

We ended up dropping non ascii chars. I went ahead and shared this UX with the team as its something that can be addressed in the labeling process

Related

Feature selection needed before train-test split due to the small size of test set and small size of instance. What should be done?

I am working on an NLP project where I need to predict correct classes of short sentences -- which are instances in my case. I am using root-words as features. My dataset is not too large (about 6000 instances/sentences). Since there are too many features I used MI based feature-selection method to reduce the number of features to about 1000.
My problem is: if I split the dataset and then do feature selection on training set only, then the model/classifier is built based on features available in training set only -- most of which (features in trained model) are absent in the testing set. As a result our model may perform very bad.
What should I do to resolve this issue?
I am currently selecting features first and then doing CV. I know that this approach may cause data leakage from test set to train set. But I'm still doing that because of the aforementioned issue.

Machine learning, Do unbalanced non-numeric variable classes matter

If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?
For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?
Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.
Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.
I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.
An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.
Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.

sequence to sequence learning for language translation, what about unseen words

sequence to sequence learning is a powerful mechanism for language translation, especially using it locally in a context specific case.
I am following this pytorch tutorial for the task.
However, the tutorial did not split the data into training and testing.
You might think its not a big deal, just split it up, use one chunk for training and the other for testing. But it is not that simple.
Essentially, the tutorial creates the indices of the seen words while leading the dataset. The indices are simply stored in the dictionary. This is before going to the encoder RNN, just a simple conversion kind of task from words to the numbers.
If data is split up at random, what happens is, one of the keyword may not appear in the sentences from the training set, and so may not have an index at all. If it shows up at the time of testing, what should be done?
Extend the dictionary?
Sequence to sequence models performance strongly depend on count of unique words in vocabulary. Each unique word has to be encountered a number of times in training set, such that model can learn it correct usage. Words that appears few times cannot be used by the model, as model can't learn enough information about such words. In practice, the size of the dictionary is usually reduced, replacing the rare words with a special "UNK" token. Therefore, if a new word occurs during testing, it can be assumed that it is rare (since it never appears in the training set) and replace it with "UNK".

How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .
Thanks
The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it
1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning
2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain
3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.

Handle data present only in training set for Machine Learning

So i have a classification problem, in which i have to classify crimes into categories based on different features ( SF Crime competition on Kaggle for those who are familiar). An interesting aspect which takes place in this data set is that there are 2 extra features "Descript"and "Resolution",both containing short text, which are present ONLY in the training set and not in the test set. They both have little pieces of text as values such as "UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE","VIOLATION OF STAY AWAY ORDER",etc.
My question is, how can I use these fields even though they appear only in the training set ? Currently I am discarding them, but I want to extract some info from them.

Resources