Different format for training and testing data [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How can I use different format of data for training and testing?
Currently I am working on a classification problem and the data format is different for training and testing. So, it is not able to classify properly. But my usecase is such that I have to use in that way only. Below is my format.
I have the following structure for training now:
--------------------------------------------------------------
| Attribute_Names | Attribute_Values | Category |
--------------------------------------------------------------
| Brand | Samsung, Nokia, OnePlus | Mobile |
| RAM, Memory | 2 GB, 4 GB, 3 GB, 6GB | Mobile |
| Color,Colour | Black, Golden, White | Mobile |
--------------------------------------------------------------
| Fabric, Material | Cloth, Synthetic, Silk | Ethnic Wear |
| Pattern, Design | Digital, floral print | Ethnic Wear |
--------------------------------------------------------------
Testing: 'Samsung Galaxy On Nxt 3 GB RAM 16 GB ROM Expandable Upto 256 GB 5.5 inch Full HD Display'
I have also posted this problem with explanation in data-science group:
product-classification-in-e-commerce-using-attribute-keywords
Any help is greatly appreciated.

As a very general principle, the format of the training & test data must be the same. There are not simple workarounds for this. There can be cases of course where the information contained in the attributes of the said datasets is essentially the same, only formatted differently; such cases are handled with appropriate data preprocessing.
If your data gathering process changes at some point in time, collecting different attributes, there are some choices available, highly dependent on the particular case: if you happen to collect more attributes that the ones with which you have trained your initial classifier, you can choose to simply ignore them, with the downside of possibly throwing away useful information. But if you happen to collect less attributes, there is not much you can do, other than possibly treat them as missing values in your prediction pipeline (provided that your classifier can indeed handle such missing values without significant decrease in performance); the other choice would be to re-train your classifier by dropping from your initial training set the attributes not collected in your new data gathering process.
Machine learning is not magic; on the contrary, it is subject to very real engineering constraints, the compatibility of the training and test data being one of the most fundamental ones. If you look around for real-world cases of deployed ML models, you'll see that the issue of possible periodic re-training of the deployed models pops up almost everywhere...

So based on your description I think what you have is a bunch of data, where some are labeled and most are not, which leaves you to a semi-supervised kind of setting.
Now, testing different data than what you trained is definitely a no-go. What you should do is mix-up the data so that you distribute your labels amongst both training and test set, that is if you wanna use a supervised methodology of course. Given that you did that you could use some techniques for label propagation, where you e.g. cluster similar data and assign the most frequent label inside that group, this can be as easy as a simple k-means clustering or as difficult as creating your own graph structure and implement a custom algorithm. So assuming you have fully labeled data now you can train your model on a mixture of real labels and assigned ones and test it the same way.
This is the only way you get a reliable performance evaluation. You obviously could also try to just train on the completely labeled data and evaluate on a mixture of real labels and assigned ones but I'd assume this would lead over-fitting.
Another related way would be to go create some kind of word vectors (google word2vec e.g.) and then use your labeled data as clusters (one per category) assign new unlabeled data to the most similar cluster (there are many variants and ways to do this).
There are many ways to tackle such a problem that are not some standard algorithms in a textbook but rather things you have to implement yourself or at least you need to do some serious preprocessing to make your data fit a standard algorithm.
I hope this helps a bit.

Related

Can different summary metrics of a single feature be used as a features for k-means clustering?

I have a scenario where i wanted to understand the customers behavior pattern and group them into different segments/clusters for an e-commerce platform. I choose to un-supervised machine learning algorithm: k-means clustering to accomplish this task.
I have purchase_orders data available to me.
In the process of preparing my data set, i had a question: Can different summary metrics like (Sum, Avg, Min, Max, Standard Deviation) of a feature be considered into different features. Or should i take only one summary metric (say, sum of total transaction amount of a customer over multiple orders) of a feature.
Will this effect how the functioning of the k-means algorithm works?
Which of the below two data formats mentioned below, that i can feed to my algorithm be optimal to derive good results :
Format-1:
Customer ID | Total.TransactionAmount | Min.TransactionAmount |
Max.TransactionAmount | Avg.TransactionAmount |
StdDev.TransactionAmount | TotalNo.ofTransactions and so on...,
Format-2:
Customer ID | Total.TransactionAmount | TotalNo.ofTransactions and so
on...,
(Note: Consider "|" as feature separator)
(Note: Customer ID is not fed as input to the algo)
Yes you can, but whether this is a good idea is all but clear.
These values will be correlated and hence this will distort the results. It will likely make all the problems you already have (such as the values not being linear, of the same importance and hence need weighting, and of similar magnitude) worse.
With features such as "transaction amount"mand "number of transactions" you already have some pretty bad scaling issues to solve, so why add more?
It's straightforward to write down your objective function. Put your features into the equation, and try to understand what you are optimizing - is this really what you need? Or do you just want some random result?

Handing high Cardinality features with supervised ratio and weight of evidence [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Improving Article Classifier Accuracy

I've built an article classifier based on Wikipedia data that I fetch, which comes from 5 total classifications.
They are:
Finance (15 articles) [1,0,0,0,0]
Sports (15 articles) [0,1,0,0,0]
Politics (15 articles) [0,0,1,0,0]
Science (15 articles) [0,0,0,1,0]
None (15 random articles not pertaining to the others) [0,0,0,0,1]
I went to wikipedia and grabbed about 15 pretty lengthy articles from each of these categories to build my corpus that I could use to train my network.
After building a lexicon of about 1000 words gathered from all of the articles, I converted each article to a word vector, along with the correct classifier label.
The word vector is a hot array, while the label is a one hot array.
For example, here is the representation of one article:
[
[0,0,0,1,0,0,0,1,0,0,... > 1000], [1,0,0,0] # this maps to Finance
]
So, in essence, I have this randomized list of word vectors mapped to their correct classifiers.
My network is a 3 layer, deep neural net that contains 500 nodes on each layer. I pass through the network over 30 epochs, and then just display how accurate my model is at the end.
Right now, Im getting about 53% to 55% accuracy. My question is, what can I do to get this up into the 90's? Is it even possible, or am I going to go crazy trying to train this thing?
Perhaps additionally, what is my main bottleneck so to speak?
edited per comments below
Neural networks aren't really designed to run best on single machines, they work much better if you have a cluster, or at least a production-grade machine. It's very common to eliminate the "long tail" of a corpus - if a term only appears in one document one time, then you may want to eliminate it. You may also want to apply some stemming so that you don't capture multiples of the same word. I strongly advise to you try applying TFIDF transformation to your corpus before pruning.
Network size optimization is a field unto itself. Basically, you try adding more/less nodes and see where that gets you. See the following for a technical discussion.
https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
It is impossible to know without seeing the data.
Things to try:
Transform your word vector to TFIDF. Are you removing stop words? You can add bi-grams/tri-grams to your word vector.
Add more articles - it could be difficult to separate them in such a small corpus. The length of a specific document doesn't necessarily help, you want to have more articles.
30 epochs feels very low to me.

Classification with no classifiers in the training data

I have data of news over the period of years and I want to train the data such that whenever I give it a testing news it returns me its the industry to which the news is related for example 'manufacturing' or 'finance'. It can be done using a classification algorithm but unfortunately I do not have the classifiers to train the data as well. My data looks like this:
ID | News
1 | News1
2 | News2
3 | News3
If the data would have been in the following form then I could apply classification algorithms to classify the industry:
ID | News | Industry Related to
1 | News1 | Manufacturing
2 | News2 | Finance
3 | News3 | e-commerce
But you know news apis does provide industry related to news. How can I train my model in this case ?
There are different ways to achieve this, and each has advantages and disadvantages. The problem you describe isn't an easy one.
I can't give a general and correct answer to this question, as it depends heavily on what you are trying to achieve.
What you are trying to do is called unsupervised learning
. Generally the Google-term you could use is "classify unlabeled data".
The Wikipedia-article of this topic has a very good overview of techniques that you may use. Since machine-learning problems often aren't clear-cut and algorithms chosen very a lot per project (size of the dataset, processing-power, cost of misclassification, ...) no one will be able to give you a general perfect answer without actually knowing your data and problem in detail.
Personally from just reading your post my first approach would be to use a clustering-algorithm (like k-means-clustering (see the Wikipedia-article, I cannot post more than two links), using the cosine-similarity of the texts) to generate different clusters of News, and then look through these clusters, label them manually, and use the result as training data - or automatically generate labels using tf*idf (see the Wikipedia-article, I cannot post more than two links)
However, the results of this may be very good, very bad or anything in between.
Recent advances in zero-shot and few-shot learning can let you build your classifier with little (100 - 200 training data) or no training data at all. Your classifier still retains all the benefits of a supervised classifier and gives you the control to decide your categories.
I have built one such system and you can try out the demo on your own categories and data to see the system in action.

what should I do when training set contains some error data in supervised classification?

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?
Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.
Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.
People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

Resources