Should I keep/remove identical training examples that represent different objects? - machine-learning

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.
Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?

If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).
To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.
My advice is to leave the duplicates in the data.

Related

Proper evaluation of a Recommender System

Suppose you are given two Recommender Systems to evaluate, A and B. Model A is trained with large data, and model B with small data (this implies that A would have a larger pool of items to pick for recommendations).
How would you compare the two models? One strategy would be to calculate precision and recall in the scenario where both models are subjected to their data sets ('A' with big data, 'B' with small data) with a 80/20 split, and then calculate precision and recall. However, I'm not sure if the precision and recall results are comparable in this case. What do you think?
Another approach would be to train A with big data, train B with small data, but fix the test set (meaning, the test set would be the same for both A and B). But isn't this "unfair", given that model A is based on big-data, and therefore has a larger pool of items to recommend from?
How would you compare the two models?

Is there a way to quickly decide which variables to use for model fitting and selection?

I loaded a dataset with 156 variables for a project. The goal is to figure out a model to predict a test data set. I am confused about where to start with. Normally I would start with the basic linear regression model, but with 156 columns/variables, how should one start with a model building? Thank you!
The question here is pretty open ended.
You need to confirm whether you are solving for regression or classification.
You need to go through some descriptive statistics of your data set to find out the type of values you have in the dataset. Are there outliers, missing values, columns whose values are in billions as against columns who values are in small fractions.
If you have categorical data, what type of categories do you have. What is the frequency count of the categorical values.
Accordingly you clean the data (if required)
Post this you may want to understand the correlation(via pearsons or chi-square depending on the data types of the variables you have) among these 156 variables and see how correlated they are.
You may then choose to get rid of certain variables after looking at the correlation or by performing a PCA (which helps to retain high variance among the dataset) and bringing the dataset variables down to fewer dimensions.
You may then look at fitting regression models or classification models(depending on your need) to have a simpler model at first and then adjusting things as you look at improving your accuracy (or minimizing the loss)

Machine Learning model generalisation

I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.
In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.

Optimizing Neural Network Input for Convergence

I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.

objects classification, mutually related features

Looking for some inspirations on how to address the following problem:
there is a collection of multiple worlds,
each world has a collection of objects,
a single object, or a group of objects, may have a maximum of one category assigned,
some categories are mutually related - i.e., the fact that object1 in group1 belongs to categoryA, increases a chance that some other group containing the same object1 belongs to categoryB
Having a dataset with multiple worlds fully described - the target is to take a completely new world and correctly categorize the objects and groups.
I would appreciate some ideas on how to address it.
My approach was to write classifiers that learn different characteristics of objects and groups based on the learning data, and then assign scores (a number between 0 and 1) to different combinations of objects in the unknown world. The problem I'm facing though is how to provide the final response. With like 20 classifiers and each assigning scores to multiple groups, it's difficult to say. For example, sometimes multiple classifiers return scores with very small values, that sum up to a big number, and that shades the fact that one very rare classifier returned 1.

Resources