Are duplicates useful in data sets? - machine-learning

I downloaded Skin Segmentation Data Set and found that it contains a lot of duplicates.
For example, this row 0 128 0 2 encountered 199 times.
Please, supply a few examples when duplicates is good and when is evil.

Yes of course, because if it is a random sample, that represents the underlying distribution in the data, that tells you that this particular value has a higher probability. Removing duplicates will just render the dataset pretty useless.

It is important.
For example: If row 'a' appears 5 times in your data and another row, 'b', appears only once, then you will want to classify row 'a' better than 'b' because when you will calculate the cost function, row 'a' will appear more time and have a bigger influence on the cost.
And, if your training represents well the test data, then there is a high probability that row 'a' will appear more times than row 'b' there.

Related

Is it possible to solve a fractional knapsack including negative values using greedy algorithm?

I have a problem which I think can be converted to a variant of
fractional knapsack problem.
The objective function is in the form of:
$\sum_{i} x_iv_i$
However, my problem differs in that it allows $v_i$ s and $x_i$ to be negative.
I want to prove that this problem can be solved using the greedy algorithm (explained in the link).
I have tested this for many test cases and greedy algorithm seems to solve it, but I want a definite
proof that greedy algorithm is still applicable given the extra constraint.
In the fractional knapsack problem, you find the Value/Weight of every item that you may put in the knapsack, and sort these items from the best V/W ratio to the worst. You then start with the best ratio, and fill the knapsack is either full or you run out. If you run out, you then head to the next item in the list and fill the knapsack with it. This pattern continues until the knapsack is full. It is greedy, because once we sort this list we know that we can confidently add the items fractionally in this order and that we will end with the greatest potential value in the bag.
By allowing the values and "weights" to be negative, as in this problem, however, the algorithm is no longer greedy. It is ruined by the fact that an item could have a negative "weight" and negative value, resulting in a positive V/W ratio. For example, take the following list of items:
V=-1, W=-1 -> V/W = 1.0
V=.9, W=1 -> V/W = 0.9
V=.8, W=1 -> V/W = 0.8
Following the greedy algorithm, we would want to add as much of item 1 as exists, because it has the best V/W ratio. However, adding item 1 really hurts us in the long run, because we are losing more value per weight then we can add later on. For example, let's assume the |W|=10 for each, and the max weight of the knapsack is 10. By adding all of 1, we will have a weight of -10 and a value of -10. Then we add all of 2, which results in a weight of 0 and a value of -1. Then we add all of 3, which results in a weight of 10 and a value of 7.
If instead of this, we just added all of item 2 from the start, we would have a weight of 10 and a value of 9. Therefore by contradiction, if weight and value can be negative, the algorithm is NOT a greedy algorithm.

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

Association Rule - Non-Binary Items

I have studied association rules and know how to implement the algorithm on the classic basket of goods problem, such as:
Transaction ID Potatoes Eggs Milk
A 1 0 1
B 0 1 1
In this problem each item has a binary identifier. 1 indicates the basket contains the good, 0 indicates it does not.
But what would be the best way to model a basket which can contain many of the same good? E.g., take the below, very unrealistic example.
Transaction ID Potatoes Eggs Milk
A 5 0 178
B 0 35 7
Using binary indicators in this case would obviously be losing a lot of information and I am seeking a model which takes into account not only the presence of items in the basket, but also the frequency that the items occur.
What would be a suitable algorithm for this problem?
In my actual data there are over one hundred items and, based on the profile of a user's basket, I would like to calculate the probabilities of the customer consuming the other available items.
An alternative is to use binary indicators but constructing them in a more clever way.
The idea is to set the indicator when an amount is more than the central value, which means that it shall be significant. If everyone buys 3 breads on average, does it make sense to flag someone as a "bread-lover" for buying two or three?
Central value can a plain arithmetic mean, one with outliers removed, or the median.
Instead of:
binarize(x) = 0 if x = 0
1 otherwise
you can use
binarize*(x) = 0 if x <= central(X)
1 otherwise
I think if you really want to have probabilities is to encode your data in a probabilistic way. Bayesian or Markov networks might be a feasible way. Nevertheless without having a reasonable structure this will be computational extremely expansive. For three item types this, however, seems to be feasible
I would try to go for a Neural Network Autoencoder if you have many more item types. If there is some dependency in the data it will discover that.
For the above example you could use a network with three input, two hidden and three output neurons.
A little bit more fancy would be to use 3 fully connected layers with drop out in the middle layer.

Case of No examples left while constructing a Decision Tree

I was reading the topic of Decision Trees(page 720) from book Artificial Intelligence A Modern Approach 3rd edition. The book is describing some cases that may occur after we split the training set(examples) by choosing an attribute. One of the case mentioned is
If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node’s parent.
I understand that by plurality classification they mean majority rule. But I am unable to understand the above cases i.e. when could it occur. Some example of decision tree where the above cases becomes true.
Think of the problem as constructing a 2D table of occurrence counts where the column represents some feature or class to be considered and the rows represent particular configurations of other variables.
for example,
X Y Z | class counts
------+-------------
1 1 1 | ...
1 1 2 | ...
1 1 3 | ...
The table represents the joint distribution of the training set.
A particular combination of X, Y and Z (say 1,3,1) may not have been seen during training. The more variables you have, the more likely you will encounter unseen combinations. If you have 10 variables each with two states then there are 1024 possible configurations of those variables. If there are three states for each then the number of configurations would be 3 ^ 10, etc.
Frankly, I would use 1/numberCols for any particular column with a missing row as you don't really have any information regarding it. You could use 1/Sum(rows) for each column but this may unnecessarily bias the result. Depends on the data.

Numerically representing Nominal Data whilst retaining data semantics

I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

Resources