I have a classification problem where I have a set of blocks which form my data points. One of the attributes that I can use for block classification is a tag, which essentially is a block number of another block. The blocks also have other attributes (size) which can be used for classification. The "tag" attribute in my data set can be used for classification as follows - If 2 blocks have 2 tags (block numbers) that belong to the same cluster, the blocks or data points should be clustered together. Here, I do not know beforehand what cluster number the tag number will be beforehand.
Block 1 [Tag 4] size 10
Block 2 [Tag 3] size 20
Block 3 [Tag 1] size 100
Block 4 [Tag 2] size 110
Here, based on the Tag attribute, Block 1 and Block 2 tag Block 3 and 4 respectively. also, block 3 and block 4 tag block 2 and block 1 respectively. Hence, Block 1, Block 2 can belong to cluster id 1, and block 3 and 4 can belong to cluster id 2. also, the size of blocks 1,2 are more similar than sizes of blocks 3,4. the end result of classification should be
cluster id 1: Block 1 , Block 2
cluster id 2: Block 3 , Block 4
Is there a way to classify such data points? As I understand, a Naive Bayes Classifier considers each attribute to be independent of each other. Here, the attribute (tag) is dependent on a future event (the cluster id in which the tagged block number will belong). What form/class of clustering algorithms should I look for to solve this problem?
One approach that I can think of is running k-means using other attributes such as size, and then when I approximately know the cluster ids, I add this cluster id to tags and use that as an attribute for classification. Are there alternative better approaches to write classifiers where attributes depend on resultant clusters themselves?
Any help would be appreciated.
This objective does not make sense.
Your four blocks and tags form a cycle:
1 -> 4 -> 2 -> 3 -> 1
Why would it make sense to break this into two groups, 1+2 and 3+4?
k-means and other algorithms will not be of much help here. You need to find some formal property of what is a good solution; then find an algorithm to optimize this property. k-means minimizes sqaured deviations - how is this going to help your problem?
Related
I have a problem. I would like to use a classification algorithm. For this I have a column materialNumber, like the name the column represents the material number.
How could I use that as a feature for my Machine Learning algorithm?
I can not use them e.g. as a One Hot Enconding matrix, because there is too much different material numbers (~4500 unique material numbers).
How can I use this column in a classification algorithm? Do I need to standardize/normalize it? I would like to use a RandomForest classifier.
customerId materialNumber
0 1 1234.0
1 1 4562.0
2 2 1234.0
3 2 4562.0
4 3 1547.0
5 3 1547.0
Here you can group material numbers by categorizing them. If you want to use a categorical variable in a machine learning algorithm, as you mentioned, you have to use the "one-hot encoding" method. But here, as the unique material number values increase, the number of columns in your data will also increase.
For example, you have a material number like this:
material_num_list=[1,2,3,4,5,6,7,8,9,10]
Suppose the numbers are similar in themselves, for example:
[1,5,6,7], [2,3,8], [4,9,10]
We ourselves can assign values to these numbers:
[1,5,6,7] --> A
[2,3,8] --> B
[4,9,10] --> C
As you can see, our tag count has decreased. And we can do "one-hot encoding" with fewer tags.
But here, the data set needs to be examined well and this grouping process needs to be done in a reasonable way. It might work if you can categorize the material numbers as I mentioned.
I have studied association rules and know how to implement the algorithm on the classic basket of goods problem, such as:
Transaction ID Potatoes Eggs Milk
A 1 0 1
B 0 1 1
In this problem each item has a binary identifier. 1 indicates the basket contains the good, 0 indicates it does not.
But what would be the best way to model a basket which can contain many of the same good? E.g., take the below, very unrealistic example.
Transaction ID Potatoes Eggs Milk
A 5 0 178
B 0 35 7
Using binary indicators in this case would obviously be losing a lot of information and I am seeking a model which takes into account not only the presence of items in the basket, but also the frequency that the items occur.
What would be a suitable algorithm for this problem?
In my actual data there are over one hundred items and, based on the profile of a user's basket, I would like to calculate the probabilities of the customer consuming the other available items.
An alternative is to use binary indicators but constructing them in a more clever way.
The idea is to set the indicator when an amount is more than the central value, which means that it shall be significant. If everyone buys 3 breads on average, does it make sense to flag someone as a "bread-lover" for buying two or three?
Central value can a plain arithmetic mean, one with outliers removed, or the median.
Instead of:
binarize(x) = 0 if x = 0
1 otherwise
you can use
binarize*(x) = 0 if x <= central(X)
1 otherwise
I think if you really want to have probabilities is to encode your data in a probabilistic way. Bayesian or Markov networks might be a feasible way. Nevertheless without having a reasonable structure this will be computational extremely expansive. For three item types this, however, seems to be feasible
I would try to go for a Neural Network Autoencoder if you have many more item types. If there is some dependency in the data it will discover that.
For the above example you could use a network with three input, two hidden and three output neurons.
A little bit more fancy would be to use 3 fully connected layers with drop out in the middle layer.
I was reading the topic of Decision Trees(page 720) from book Artificial Intelligence A Modern Approach 3rd edition. The book is describing some cases that may occur after we split the training set(examples) by choosing an attribute. One of the case mentioned is
If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node’s parent.
I understand that by plurality classification they mean majority rule. But I am unable to understand the above cases i.e. when could it occur. Some example of decision tree where the above cases becomes true.
Think of the problem as constructing a 2D table of occurrence counts where the column represents some feature or class to be considered and the rows represent particular configurations of other variables.
for example,
X Y Z | class counts
------+-------------
1 1 1 | ...
1 1 2 | ...
1 1 3 | ...
The table represents the joint distribution of the training set.
A particular combination of X, Y and Z (say 1,3,1) may not have been seen during training. The more variables you have, the more likely you will encounter unseen combinations. If you have 10 variables each with two states then there are 1024 possible configurations of those variables. If there are three states for each then the number of configurations would be 3 ^ 10, etc.
Frankly, I would use 1/numberCols for any particular column with a missing row as you don't really have any information regarding it. You could use 1/Sum(rows) for each column but this may unnecessarily bias the result. Depends on the data.
I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)