Dask regex extract comparison failing with NotImplementedError - dask

I have a Dask dataframe that looks like this:
class1 statement class2 value
<geoentity_Pic_de_Font_Blanca_2986043> <hasLatitude> 42.64991^^<degrees> 42.64991
<geoentity_Pic_de_Font_Blanca_2986043> <hasLongitude> 1.53335^^<degrees> 1.53335
<geoentity_Pic_de_Font_Blanca_2986043> <hasGeonamesEntityId> 2986043 NaN
<geoentity_Pic_de_Font_Blanca_2986043> rdfs:label Pic de Font Blanca NaN
I'm trying to check whether the number in class1 matches the one in class2 for all the <hasGeonamesEntityId> rows; so that I can get rid of those rows, since they would then carry unnecessarily duplicated data.
I tried:
df[(df['statement'] == '<hasGeonamesEntityId>') & (df['class1'].str.extract(r'_(\d+)>$') == df['class2'])].head()
but this gives me the following error:
E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
3347 graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self, key])
3348 return new_dd_object(graph, name, self, self.divisions)
-> 3349 raise NotImplementedError(key)
3350
3351 def __setitem__(self, key, value):
NotImplementedError: Dask DataFrame Structure:
0 1
npartitions=442
bool bool
... ...
... ... ...
... ...
... ...
Dask Name: and_, 3978 tasks
My dtypes are:
class1 category
statement category
class2 object
value category
I'm not sure why this is failing since the extract on its own seems to return the correct sub string. Anybody know what I'm doing wrong?

It's hard to say without a reproducible example, but it looks like you're trying to index a Dask DataFrame with another Dask DataFrame, which isn't supported and probably isn't what you want.
Using just pandas
In [18]: df = pd.DataFrame({"A": ['a1', 'b2', 'c3']})
In [19]: df[df.A.str.extract('(\d)') == '1']
Out[19]:
A
0 NaN
1 NaN
2 NaN
That's because .str.extract returns a DataFrame. Set expand=False to get a 1D series
In [20]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[20]:
A
0 a1
Which works for Dask as well
In [21]: df = dd.from_pandas(df, 2)
In [22]: df[df.A.str.extract('(\d)', expand=False) == '1']
Out[22]:
Dask DataFrame Structure:
A
npartitions=1
0 object
2 ...
Dask Name: getitem, 5 tasks
In [23]: _.compute()
Out[23]:
A
0 a1

Related

Isolation Tree algorithm question about classification

In the part where we create the trees (iTrees) I don't understand why we are using the following classification line of code (much alike as it is in decision tree classification):
def classify_data(data):
label_column = data.values[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)
index = counts_unique_classes.argmax()
classification = unique_classes[index]
return classification
We are choosing the last column and an indexed value of the largest unique element? It might make sense for decision trees but I don't understand why we use it in isolation forest?
And the whole iTree code is looking like the following:
def isolation_tree(data,counter=0,
max_depth=50,random_subspace=False):
# End loop if max depth or if isolated
if (counter == max_depth) or data.shape[0]<=1:
classification = classify_data(data)
return classification
else:
# Counter
counter +=1
# Select random feature
split_column = select_feature(data)
# Select random value
split_value = select_value(data,split_column)
# Split data
data_below, data_above = split_data(data,split_column,split_value)
# instantiate sub-tree
question = "{} <= {}".format(split_column,split_value)
sub_tree = {question: []}
# Recursive part
below_answer = isolation_tree(data_below,counter,max_depth=max_depth)
above_answer = isolation_tree(data_above,counter,max_depth=max_depth)
if below_answer == above_answer:
sub_tree = below_answer
else:
sub_tree[question].append(below_answer)
sub_tree[question].append(above_answer)
return sub_tree
Edit: Here is an example of the data and running classify_data:
feat1 feat2
0 3.300000 3.300000
1 -0.519349 0.353008
2 -0.269108 -0.909188
3 -1.887810 -0.555841
4 -0.711432 0.927116
label columns: [ 3.3 0.3530081 -0.90918776 -0.55584138
0.92711613]
unique_classes, counts unique classes: [-0.90918776 -0.55584138
0.3530081 0.92711613 3.3 ] [1 1 1 1 1]
-0.9091877609469025
So I later found out that the classification part was for testing purposes, it is worthless. If you use this code (popular on Medium) please remove the classification function as it serves no purpose.

how extraction decision rules of random forest in python

I have one question though. I heard from someone that in R, you can use extra packages to extract the decision rules implemented in RF, I try to google the same thing in python but without luck, if there is any help on how to achieve that.
thanks in advance!
Assuming that you use sklearn RandomForestClassifier you can find the invididual decision trees as .estimators_. Each tree stores the decision nodes as a number of NumPy arrays under tree_.
Here is some example code which just prints each node in order of the array. In a typical application one would instead traverse by following the children.
import numpy
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets, ensemble
def print_decision_rules(rf):
for tree_idx, est in enumerate(rf.estimators_):
tree = est.tree_
assert tree.value.shape[1] == 1 # no support for multi-output
print('TREE: {}'.format(tree_idx))
iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value))
for node_idx, data in iterator:
left, right, feature, th, value = data
# left: index of left child (if any)
# right: index of right child (if any)
# feature: index of the feature to check
# th: the threshold to compare against
# value: values associated with classes
# for classifier, value is 0 except the index of the class to return
class_idx = numpy.argmax(value[0])
if left == -1 and right == -1:
print('{} LEAF: return class={}'.format(node_idx, class_idx))
else:
print('{} NODE: if feature[{}] < {} then next={} else next={}'.format(node_idx, feature, th, left, right))
digits = datasets.load_digits()
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target)
estimator = ensemble.RandomForestClassifier(n_estimators=3, max_depth=2)
estimator.fit(Xtrain, ytrain)
print_decision_rules(estimator)
Example outout:
TREE: 0
0 NODE: if feature[33] < 2.5 then next=1 else next=4
1 NODE: if feature[38] < 0.5 then next=2 else next=3
2 LEAF: return class=2
3 LEAF: return class=9
4 NODE: if feature[50] < 8.5 then next=5 else next=6
5 LEAF: return class=4
6 LEAF: return class=0
...
We use something similar in emlearn to compile a Random Forest to C code.

Unable to get pipeline.fit() to work using Sklearn and Keras Wrappers

I am getting a value error for parameters (not enough to unpack expected 2 got 1) I have a network I want to train:
def build(self):
numpy.random.seed(self.seed)
self.estimators.append(('standardize', StandardScaler))
self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
self.pipeline = Pipeline(self.estimators)
Now if I want to fit the data to some values: say self.X, self.Y
self.model = self.pipeline.fit(self.X, self.Y, verbose=1)
I get
Traceback (most recent call last):
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 257, in
<module>
model.run()
File "C:/Users/jaehan/PycharmProjects/cerebro/cerebro.py", line 138, in run
self.model = self.pipeline.fit(self.X, self.Y, verbose=1)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site-
packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\jaehan\AppData\Local\Continuum\anaconda3\envs\py36\lib\site-
packages\sklearn\pipeline.py", line 197, in _fit
step, param = pname.split('__', 1)
ValueError: not enough values to unpack (expected 2, got 1)
Am I doing something wrong here? I was under the impression I could just run a fit and it would return a history object, which I could save and load at any time
I even tried...
self.pipeline.fit(self.X, self.Y)
Which throws...
AttributeError: 'numpy.ndarray' object has no attribute 'fit'
I have no idea what is going on here.
Full Code
class Cerebro:
def __init__(self):
self.model = None
self.build_fn = None
self.data = None
self.X = None
self.Y = None
#these three are for encoding string values to integer_encodings / one hot encodings
self.encoder = LabelEncoder()
self.encodings = {}
self.one_hot_encodings = {}
self.seed = numpy.random.seed(7) #this is to ensure we have reproducible results.
self.estimators = []
self.pipeline = None
self.kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=self.seed)
self.cross_validation_score = 0.0
def preprocess(self):
"""
This method will preprocess the dataset we want to train our network on.
Example:
import preproccessing
...
dataset, X, Y = preprocessing.main()
"""
self.data = pandas.read_csv('src_examples/hwtxn_final_for_influx.txt', sep='\t').values
self.X = numpy.delete(self.data, 13, axis=1)
self.Y = self.data[:, 13].astype(numpy.float16)
def build(self):
self.build_fn = self.base_model()
self.preprocess()
numpy.random.seed(self.seed)
self.estimators.append(('standardize', StandardScaler()))
self.estimators.append(('mlp', KerasClassifier(build_fn=self.build_fn, epochs=50, batch_size=5, verbose=0)))
self.pipeline = Pipeline(self.estimators)
def run(self):
"""This will actually take the pipeline (preprocessing standardization, model)
and fit it to our dataset (X, Y) (We don't need test/train since we are using stratified k fold cross val.)
Args:
None
Returns:
None
"""
# this is the 'model'
# self.pipeline
print(type(self.pipeline))
print(self.X.shape)
self.model = self.pipeline.fit(self.X, self.Y)
def load(self, fn):
"""This will load a saved model (history object)
Args:
fn (filename): represents saved model file
Returns:
model (pkl object): represents model
"""
return pickle.load(open(fn, 'rb'))
def save(self, fn):
"""This will save a model (history object)
Args:
fn (filename): represents a filename to save the model as
Returns:
None
"""
pickle.dump(self.model, open(fn, 'wb'))
def encode(self, vals, key):
""" This method will encode a list of values and take a key (representing column name, or index) to save
in the class object (self.encodings)
This will help us keep track of encodings we have for values we need to translate/decipher.
Args:
vals(np.array): array of values to encode
key(str): str representing the key used to encode this particular set of values
Returns:
transformed values (np.array) representing the encoded versions of values
"""
# int encoding for non int values
self.encodings[key] = self.encoder.fit_transform(vals)
return self.encoder.fit_transform(vals)
def decoder(self, vals, key):
"""This method will decode the integer_encodings for class variables. It will take vals which
represents a list of values to decode (i.e. [1,2,3] -- [apple, pear, orange])
It will also take a key (since every decoding has a corresponding encoding) to find which encoding
scheme to map to
Args:
vals(np.array) : array of values to decode
key(str) : string representing the key used for encoding the values (for decoding it)
Returns:
inverse transform of encoded values (np.array)
"""
# translate int encodings to original values (encoder._classes)
return self.encodings[key].inverse_transform(vals)
def cross_validate(self):
"""
This will perform a cross validation score using a stratified kfold method. (Think traditional Kfold but
with the values evenly distributed for each subsample)
Args:
None
Returns:
None
"""
self.cross_validation_score = cross_val_score(self.pipeline, self.X, self.Y, cv=self.kfold)
return self.cross_validation_score
#staticmethod
def base_model():
"""
This will return a base model for us to try. The good thing about this implementation is that
when we decide we want something more complex then all we have to do is define a class function and replace
the values in the build f(x)
Args:
None
Returns:
model (keras.models.Sequential): Keras based DNN Model
"""
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
#staticmethod
def one_hot_encoder(int_encoding):
"""
This will take an integer encoding of string variables (traditional preprocessing step, will probably
move this to the preprocessing package.
Essential it returns a binary 'one hot' encoding of the values we wish to encode
Example
#Dataset Values
[apple, orange, pear]
#Integer Encoding
[1, 2, 3]
#One Hot Encoding
[[1, 0, 0]
[0, 1, 0]
[0, 0, 1]]
Args:
None
Returns:
Matrix (np.array): matrix representing one hot vectors for a class of values
"""
# we might not need this... so for now we will keep it static
return OneHotEncoder(sparse=False).fit_transform(int_encoding.reshape(len(int_encoding), 1))
if __name__ == '__main__':
# Step 1 is to initialize class (with seed == 7)
model = Cerebro()
model.build()
model.cross_validate()
print("Here are our estimators:\n {}".format(model.estimators))
print("Here is our pipeline:\n {}".format(model.pipeline))
model.run()
EDIT
The answer is that .fit() build_fn argument requires a function pointer and not the model itself.
IMHO I feel an error should be thrown for specifically that case.
This is due to the following line:
self.build_fn = self.base_model()
This should actually be:
self.build_fn = self.base_model
KerasClassifier requires a pointer to the function which creates the model, but by appending () at the end, you are assigning build_fn with the actual model, which is wrong.
Now in addition to above error, I would recommend checking the following lines in your code, which if not corrected will give error in future when you will use the code.
1) self.encodings[key] = self.encoder.fit_transform(vals)
Here you are assigning the transformed data to the encodings[key] not the model. So when you do this:-
self.encodings[key].inverse_transform(vals)
It makes no sense to call inverse_transform() on the transformed data.
inverse_transform() is a method of scikit-learn transformers. But self.encodings[key] will give out a ndarray, because you have saved the output array from fit_transform().
2) Something similar to 2 is also happening with one_hot_encoder()
The error "AttributeError: 'numpy.ndarray' object has no attribute 'fit'" seems related to 1 and 2.

weka - normalize nominal values

I have this data set:
Instance num 0 : 300,24,'Social worker','Computer sciences',Music,10,5,5,1,5,''
Instance num 1 : 1000,20,Student,'Computer engineering',Education,10,5,5,5,5,Sony
Instance num 2 : 450,28,'Computer support specialist',Business,Programming,10,4,1,0,4,Lenovo
Instance num 3 : 1000,20,Student,'Computer engineering','3d Design',1,1,2,1,3,Toshiba
Instance num 4 : 1000,20,Student,'Computer engineering',Programming,2,5,1,5,4,Dell
Instance num 5 : 800,16,Student,'Computer sciences',Education,8,4,3,4,4,Toshiba
and I want to classify using SMO and other multi-class classifiers so I convert all the nominal values to numeric using this code :
int [] indices={2,3,4,10}; // indices of nominal columns
for (int i = 0; i < indices.length; i++) {
int attInd = indices[i];
Attribute att = data.attribute(attInd);
for (int n = 0; n < att.numValues(); n++) {
data.renameAttributeValue(att, att.value(n), "" + n);
}
}
and the result is:
Instance num 0 : 300,24,0,0,0,10,5,5,1,5,0
Instance num 1 : 1000,20,1,1,1,10,5,5,5,5,1
Instance num 2 : 450,28,2,2,2,10,4,1,0,4,2
Instance num 3 : 1000,20,1,1,3,1,1,2,1,3,3
Instance num 4 : 1000,20,1,1,2,2,5,1,5,4,4
Instance num 5 : 800,16,1,0,1,8,4,3,4,4,3
after applying the "Normalize" filter the result will be like this:
Instance num 0 : 0,0.666667,0,0,0,1,1,1,0.2,1,0
Instance num 1 : 1,0.333333,1,1,1,1,1,1,1,1,1
Instance num 2 : 0.214286,1,2,2,2,1,0.75,0,0,0.5,2
Instance num 3 : 1,0.333333,1,1,3,0,0,0.25,0.2,0,3
Instance num 4 : 1,0.333333,1,1,2,0.111111,1,0,1,0.5,4
Instance num 5 : 0.714286,0,1,0,1,0.777778,0.75,0.5,0.8,0.5,3
the problem is the converted columns still in String "Normalize" filter will not normalize them...
Any ideas?
and my second question: what should I use as multi-class classifier beside SMO?
Don't convert nominals/categoricals into floats(/integers), and then normalize them. It's meaningless. Garbage In, Garbage Out. Treating them as continuous numbers or numeric vectors gives nonsense results like "the average of 'Engineering' + 'Nursing' = 'Architecture'"
The right way to treat nominals/categoricals is to convert each one into dummy variables (also known as 'dummy coding' or 'dichotomizing'). Say if Occupation column (or Major, or Elective, or whatever) has K levels, then you create either K or (K-1) binary variables which are everywhere 0 except for one corresponding column containing a 1.
Look up Weka documentation to find the right function call.
cf. e.g. SO: Dummy Coding of Nominal Attributes (for Logistic Regression)
I believe that best way to convert string into a numeric can be done using the filter weka.filters.unsupervised.attribute.StringToWordVector.
After doing so, you can apply the "Normalize" filter weka.classifiers.functions.LibSVM.

Strange table error in Lua

Okay, so I've got a strange problem with the following Lua code:
function quantizeNumber(i, step)
local d = i / step
d = round(d, 0)
return d*step
end
bar = {1, 2, 3, 4, 5}
local objects = {}
local foo = #bar * 3
for i=1, #foo do
objects[i] = bar[quantizeNumber(i, 3)]
end
print(#fontObjects)
After this code has been run, the length of objects should be 15, right? But no, it's 4. How is this working and what am I missing?
Thanks, Elliot Bonneville.
Yes it is 4.
From the Lua reference manual:
The length of a table t is defined to be any integer index n such that t[n] is not nil and t[n+1] is nil; moreover, if t[1] is nil, n can be zero. For a regular array, with non-nil values from 1 to a given n, its length is exactly that n, the index of its last value. If the array has "holes" (that is, nil values between other non-nil values), then #t can be any of the indices that directly precedes a nil value (that is, it may consider any such nil value as the end of the array).
Let's modify the code to see what is in the table:
local objects = {}
local foo = #bar * 3
for i=1, foo do
objects[i] = bar[quantizeNumber(i, 3)]
print("At " .. i .. " the value is " .. (objects[i] and objects[i] or "nil"))
end
print(objects)
print(#objects)
When you run this you see that objects[4] is 3 but objects[5] is nil. Here is the output:
$ lua quantize.lua
At 1 the value is nil
At 2 the value is 3
At 3 the value is 3
At 4 the value is 3
At 5 the value is nil
At 6 the value is nil
At 7 the value is nil
At 8 the value is nil
At 9 the value is nil
At 10 the value is nil
At 11 the value is nil
At 12 the value is nil
At 13 the value is nil
At 14 the value is nil
At 15 the value is nil
table: 0x1001065f0
4
It is true that you filled in 15 slots of the table. However the # operator on tables, as defined by the reference manual, does not care about this. It simply looks for an index where the value is not nil, and whose following index is nil.
In this case, the index that satisfies this condition is 4.
That is why the answer is 4. It's just the way Lua is.
The nil can be seen as representing the end of an array. It's kind of like in C how a zero byte in the middle of a character array is actually the end of a string and the "string" is only those characters before it.
If your intent was to produce the table 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5 then you will need to rewrite your quantize function as follows:
function quantizeNumber(i, step)
return math.ceil(i / step)
end
The function quantizeNumber is wrong. The function you're looking for is math.fmod:
objects[i] = bar[math.fmod(i, 3)]

Resources