Difference Between Datasets - machine-learning

Difference Between Datasets - machine-learning

Here is the problem statement:
I have 2 datasets from different years(2013 dataset and 2014 dataset), the data is multivariate with each dataset containing 38 attributes, I want to find out any difference/delta that might have occured in between two datasets in these consecutive years, this difference should be a numerical value.
So far I have applied following techniques:
1)ANOVA (This tells me that difference is there but it doesn't tell me how much the difference is)
2)Wilcoxon-Mann-Whitney U test (Same problem as ANOVA)
3)Finding the Mean Square Error between the mean of the datasets.
Questions:
1) Is their any other method/test that can be applied which would give me a numerical value of the difference between datasets?
2) If I label the 2013 dataset as "1" and 2014 dataset as "2" then can the weight's of neural network trained to classify these dataset be used to somehow find the difference between datasets?
Note: Due to confidentiality agreement I cannot share the data here.

Don't know if you have found an answer or not.
Have you tried using RMSE? You can create a score for every column of a dataset and then combine them to get an average score for the whole data.
It's not a perfect method but it should give a scale of difference when comparing multiple dataset to eachother.
If you did find a better answer than what I suggested, please so let me know as I would be interested in it.
All the best.

Related

Unsupervised or Supervised Machine Learning

I dont quite get what kind of Machine Learning Problem this is.
I do have Data consisting of time and a specific count.
_time
count
7:15
190
7:20
240
and so on.
With this Data I would like to create a model and "predict" the count value of specific times. The following Data looks like this:
_time
count
7:30
7:35
For this Data i use the trained model and get a valid count out of it. Now I am wondering if it is supervised (because in the model we know the true counts and apply it to another time with unknown count) or if it is unsupervised.

I will quote an explanation on a blog since I think it is well done and answer your question later.
"There are two main types of learning: supervised and unsupervised. The main difference between the two types is that supervised learning is truth-based. In other words, we have prior knowledge of what the output values of our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and the desired results, best approximates the relationship between input and output observable in the data. In contrast, unsupervised learning has no labeled outcomes. Its goal is therefore to infer the natural structure present in a set of data points."
So your problem is supervised, because you have an element of answer, which is a count that you already know, on which you will base yourself to deduct other counts

In the dataset, _time is a feature, and count is the target (also know as "label"). When there is labelled data, it is a supervised machine learning problem.
You can read this article for more details.

Continuous or categorical data in data science

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.

The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.

But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks

Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

scikitlearn - how to model a single features composed of multiple independant values

My dataset is composed of millions of row and a couple (10's) of features.
One feature is a label composed of 1000 differents values (imagine each row is a user and this feature is the user's firstname :
Firstname,Feature1,Feature2,....
Quentin,1,2
Marc,0,2
Gaby,1,0
Quentin,1,0
What would be the best representation for this feature (to perform clustering) :
I could convert the data as integer using a LabelEncoder, but it doesn't make sense here since there is no logical "order" between two differents label
Firstname,F1,F2,....
0,1,2
1,0,2
2,1,0
0,1,0
I could split the feature in 1000 features (one for each label) with 1 when the label match and 0 otherwise. However this would result in a very big matrix (too big if I can't use sparse matrix in my classifier)
Quentin,Marc,Gaby,F1,F2,....
1,0,0,1,2
0,1,0,0,2
0,0,1,1,0
1,0,0,1,0
I could represent the LabelEncoder value as a binary in N columns, this would reduce the dimension of the final matrix compared to the previous idea, but i'm not sure of the result :
LabelEncoder(Quentin) = 0 = 0,0
LabelEncoder(Marc) = 1 = 0,1
LabelEncoder(Gaby) = 2 = 1,0
A,B,F1,F2,....
0,0,1,2
0,1,0,2
1,0,1,0
0,0,1,0
... Any other idea ?
What do you think about solution 3 ?
Edit for some extra explanations
I should have mentioned in my first post, but In the real dataset, the feature is the more like the final leaf of a classification tree (Aa1, Aa2 etc. in the example - it's not a binary tree).
A B C
Aa Ab Ba Bb Ca Cb
Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2
So there is a similarity between 2 terms under the same level (Aa1 Aa2 and Aa3are quite similar, and Aa1 is as much different from Ba1 than Cb2).
The final goal is to find similar entities from a smaller dataset : We train a OneClassSVM on the smaller dataset and then get a distance of each term of the entiere dataset

This problem is largely one of one-hot encoding. How do we represent multiple categorical values in a way that we can use clustering algorithms and not screw up the distance calculation that your algorithm needs to do (you could be using some sort of probabilistic finite mixture model, but I digress)? Like user3914041's answer, there really is no definite answer, but I'll go through each solution you presented and give my impression:
Solution 1
If you're converting the categorical column to an numerical one like you mentioned, then you face that pretty big issue you mentioned: you basically lose meaning of that column. What does it really even mean if Quentin in 0, Marc 1, and Gaby 2? At that point, why even include that column in the clustering? Like user3914041's answer, this is the easiest way to change your categorical values into numerical ones, but they just aren't useful, and could perhaps be detrimental to the results of the clustering.
Solution 2
In my opinion, depending upon how you implement all of this and your goals with the clustering, this would be your best bet. Since I'm assuming you plan to use sklearn and something like k-Means, you should be able to use sparse matrices fine. However, like imaluengo suggests, you should consider using a different distance metric. What you can consider doing is scaling all of your numeric features to the same range as the categorical features, and then use something like cosine distance. Or a mix of distance metrics, like I mention below. But all in all this will likely be the most useful representation of your categorical data for your clustering algorithm.
Solution 3
I agree with user3914041 in that this is not useful, and introduces some of the same problems as mentioned with #1 -- you lose meaning when two (probably) totally different names share a column value.
Solution 4
An additional solution is to follow the advice of the answer here. You can consider rolling your own version of a k-means-like algorithm that takes a mix of distance metrics (hamming distance for the one-hot encoded categorical data, and euclidean for the rest). There seems to be some work in developing k-means like algorithms for mixed categorical and numerical data, like here.
I guess it's also important to consider whether or not you need to cluster on this categorical data. What are you hoping to see?

Solution 3:
I'd say it has the same kind of drawback as using a 1..N encoding (solution 1), in a less obvious fashion. You'll have names that both give a 1 in some column, for no other reason than the order of the encoding...
So I'd recommend against this.
Solution 1:
The 1..N solution is the "easy way" to solve the format issue, as you noted it's probably not the best.
Solution 2:
This looks like it's the best way to do it but it is a bit cumbersome and from my experience the classifier does not always performs very well with a high number of categories.
Solution 4+:
I think the encoding depends on what you want: if you think that names that are similar (like John and Johnny) should be close, you could use characters-grams to represent them. I doubt this is the case in your application though.
Another approach is to encode the name with its frequency in the (training) dataset. In this way what you're saying is: "Mainstream people should be close, whether they're Sophia or Jackson does not matter".
Hope the suggestions help, there's no definite answer to this so I'm looking forward to see what other people do.

what should I do when training set contains some error data in supervised classification?

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?

Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.

Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.

People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart