In the part where we create the trees (iTrees) I don't understand why we are using the following classification line of code (much alike as it is in decision tree classification):
def classify_data(data):
label_column = data.values[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)
index = counts_unique_classes.argmax()
classification = unique_classes[index]
return classification
We are choosing the last column and an indexed value of the largest unique element? It might make sense for decision trees but I don't understand why we use it in isolation forest?
And the whole iTree code is looking like the following:
def isolation_tree(data,counter=0,
max_depth=50,random_subspace=False):
# End loop if max depth or if isolated
if (counter == max_depth) or data.shape[0]<=1:
classification = classify_data(data)
return classification
else:
# Counter
counter +=1
# Select random feature
split_column = select_feature(data)
# Select random value
split_value = select_value(data,split_column)
# Split data
data_below, data_above = split_data(data,split_column,split_value)
# instantiate sub-tree
question = "{} <= {}".format(split_column,split_value)
sub_tree = {question: []}
# Recursive part
below_answer = isolation_tree(data_below,counter,max_depth=max_depth)
above_answer = isolation_tree(data_above,counter,max_depth=max_depth)
if below_answer == above_answer:
sub_tree = below_answer
else:
sub_tree[question].append(below_answer)
sub_tree[question].append(above_answer)
return sub_tree
Edit: Here is an example of the data and running classify_data:
feat1 feat2
0 3.300000 3.300000
1 -0.519349 0.353008
2 -0.269108 -0.909188
3 -1.887810 -0.555841
4 -0.711432 0.927116
label columns: [ 3.3 0.3530081 -0.90918776 -0.55584138
0.92711613]
unique_classes, counts unique classes: [-0.90918776 -0.55584138
0.3530081 0.92711613 3.3 ] [1 1 1 1 1]
-0.9091877609469025
So I later found out that the classification part was for testing purposes, it is worthless. If you use this code (popular on Medium) please remove the classification function as it serves no purpose.
I'm writing a game using Corona SDK in lua language. I'm having a hard time coming up with a logic for a system like this;
I have different items. I want some items to have 1/1000 chance of being chosen (a unique item), I want some to have 1/10, some 2/10 etc.
I was thinking of populating a table and picking a random item. For example I'd add 100 of "X" item to the table and than 1 "Y" item. So by choosing randomly from [0,101] I kind of achieve what I want but I was wondering if there were any other ways of doing it.
items = {
Cat = { probability = 100/1000 }, -- i.e. 1/10
Dog = { probability = 200/1000 }, -- i.e. 2/10
Ant = { probability = 699/1000 },
Unicorn = { probability = 1/1000 },
}
function getRandomItem()
local p = math.random()
local cumulativeProbability = 0
for name, item in pairs(items) do
cumulativeProbability = cumulativeProbability + item.probability
if p <= cumulativeProbability then
return name, item
end
end
end
You want the probabilities to add up to 1. So if you increase the probability of an item (or add an item), you'll want to subtract from other items. That's why I wrote 1/10 as 100/1000: it's easier to see how things are distributed and to update them when you have a common denominator.
You can confirm you're getting the distribution you expect like this:
local count = { }
local iterations = 1000000
for i=1,iterations do
local name = getRandomItem()
count[name] = (count[name] or 0) + 1
end
for name, count in pairs(count) do
print(name, count/iterations)
end
I believe this answer is a lot easier to work with - albeit slightly slower in execution.
local chancesTbl = {
-- You can fill these with any non-negative integer you want
-- No need to make sure they sum up to anything specific
["a"] = 2,
["b"] = 1,
["c"] = 3
}
local function GetWeightedRandomKey()
local sum = 0
for _, chance in pairs(chancesTbl) do
sum = sum + chance
end
local rand = math.random(sum)
local winningKey
for key, chance in pairs(chancesTbl) do
winningKey = key
rand = rand - chance
if rand <= 0 then break end
end
return winningKey
end
I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: http://archive.ics.uci.edu/ml/datasets/Adult The classification problem is whether or not a person makes over 50k a year based on their census data.
Two example entries are as follows:
45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.
Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:
distance = 0 in that coordinate if there's a match, 1 otherwise
The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.
Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.
I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.
You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.
Create individual Maps for each data points and use the map to convert to a double value.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}
Being fairly new to Stata, I'm having a difficulty figuring out how to do the following:
I have time-series data on selling price (p) and quantity sold (q) for 10 products in a single datafile (i,e., 20 variables, p01-p10 and q01-q10). I am strugling with appropriate stata command that computes sales revenue (pq) time-series for each of these 10 products (i.e., pq01-pq10).
Many thanks for your help.
forval i = 1/10 {
local j : display %02.0f `i'
gen pq`j' = p`j' * q`j'
}
A standard loop over 1/10 won't get you the leading zero in 01/09. For that we need to use an appropriate format. See also
#article {pr0051,
author = "Cox, N. J.",
title = "Stata tip 85: Looping over nonintegers",
journal = "Stata Journal",
publisher = "Stata Press",
address = "College Station, TX",
volume = "10",
number = "1",
year = "2010",
pages = "160-163(4)",
url = "http://www.stata-journal.com/article.html?article=pr0051"
}
(added later) Another way to do it is
local j = string(`i', "%02.0f")
That makes it a bit more explicit that you are mapping from numbers 1,...,10 to strings "01",...,"10".
I have this data set:
Instance num 0 : 300,24,'Social worker','Computer sciences',Music,10,5,5,1,5,''
Instance num 1 : 1000,20,Student,'Computer engineering',Education,10,5,5,5,5,Sony
Instance num 2 : 450,28,'Computer support specialist',Business,Programming,10,4,1,0,4,Lenovo
Instance num 3 : 1000,20,Student,'Computer engineering','3d Design',1,1,2,1,3,Toshiba
Instance num 4 : 1000,20,Student,'Computer engineering',Programming,2,5,1,5,4,Dell
Instance num 5 : 800,16,Student,'Computer sciences',Education,8,4,3,4,4,Toshiba
and I want to classify using SMO and other multi-class classifiers so I convert all the nominal values to numeric using this code :
int [] indices={2,3,4,10}; // indices of nominal columns
for (int i = 0; i < indices.length; i++) {
int attInd = indices[i];
Attribute att = data.attribute(attInd);
for (int n = 0; n < att.numValues(); n++) {
data.renameAttributeValue(att, att.value(n), "" + n);
}
}
and the result is:
Instance num 0 : 300,24,0,0,0,10,5,5,1,5,0
Instance num 1 : 1000,20,1,1,1,10,5,5,5,5,1
Instance num 2 : 450,28,2,2,2,10,4,1,0,4,2
Instance num 3 : 1000,20,1,1,3,1,1,2,1,3,3
Instance num 4 : 1000,20,1,1,2,2,5,1,5,4,4
Instance num 5 : 800,16,1,0,1,8,4,3,4,4,3
after applying the "Normalize" filter the result will be like this:
Instance num 0 : 0,0.666667,0,0,0,1,1,1,0.2,1,0
Instance num 1 : 1,0.333333,1,1,1,1,1,1,1,1,1
Instance num 2 : 0.214286,1,2,2,2,1,0.75,0,0,0.5,2
Instance num 3 : 1,0.333333,1,1,3,0,0,0.25,0.2,0,3
Instance num 4 : 1,0.333333,1,1,2,0.111111,1,0,1,0.5,4
Instance num 5 : 0.714286,0,1,0,1,0.777778,0.75,0.5,0.8,0.5,3
the problem is the converted columns still in String "Normalize" filter will not normalize them...
Any ideas?
and my second question: what should I use as multi-class classifier beside SMO?
Don't convert nominals/categoricals into floats(/integers), and then normalize them. It's meaningless. Garbage In, Garbage Out. Treating them as continuous numbers or numeric vectors gives nonsense results like "the average of 'Engineering' + 'Nursing' = 'Architecture'"
The right way to treat nominals/categoricals is to convert each one into dummy variables (also known as 'dummy coding' or 'dichotomizing'). Say if Occupation column (or Major, or Elective, or whatever) has K levels, then you create either K or (K-1) binary variables which are everywhere 0 except for one corresponding column containing a 1.
Look up Weka documentation to find the right function call.
cf. e.g. SO: Dummy Coding of Nominal Attributes (for Logistic Regression)
I believe that best way to convert string into a numeric can be done using the filter weka.filters.unsupervised.attribute.StringToWordVector.
After doing so, you can apply the "Normalize" filter weka.classifiers.functions.LibSVM.