Create dummies from column with multiple values in dask - dask

My question is similar to this thread Create dummies from column with multiple values in pandas
Objective: I would like to produce similar result below but using dask
In Pandas
import pandas as pd
df = pd.DataFrame({'fruit': ['Banana, , Apple, Dragon Fruit,,,', 'Kiwi,', 'Lemon, Apple, Banana', ',']})
df['fruit'].str.get_dummies(sep=',')
Which will output the following:
Apple Banana Dragon Fruit Banana Kiwi Lemon
0 1 1 0 1 1 0 0
1 0 0 0 0 0 1 0
2 0 1 1 0 0 0 1
3 0 0 0 0 0 0 0
get_dummies() above is of type <pandas.core.strings.StringMethods>
Now the problem is there is no get_dummies() for dask equivalent <dask.dataframe.accessor.StringAccessor>
How can I solve my problem using dask?

Apparently this is not possible in dask as we wouldn't know the output columns before hand. See https://github.com/dask/dask/issues/4403.

Related

Why is my yolo model has one extra output?

I have trained yolo model to detect 24 different classes and now when I'm trying to extract outputs of it, it returns 29 numbers for each prediction. Here they are:
0.605734 0.0720678 0.0147335 0.0434446 0.999661 0 0 0 0.999577 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I suppose that last 24 numbers are scores for each class, first 4 are parameters of bbox, but what is 5th? It is always bigger than 0.9. I'm confused. Please, help me.
It's the probability that the specific box has an object

How to use lightGBM in multi-classification problem whose target has multi-labels

I get a multi-class classification problem that the samples can have more than one labels. So I want to know how to use lightGBM in such multi-class classification problems.
For examples, the target is as follows:
id label1 label2 label3 label4 label5
0 0 1 1 0 0 0
1 1 0 1 0 0 0
2 2 1 0 0 0 0
3 3 0 0 0 0 0
And I have checked the documentation of lightGBM but it doesn't offer what I want.
So I am wondering how to use lightGBM in such multi-class classification problems.

Feature engineering, handling missing data

Consider this data table
NumberOfAccidents MeanDistance
1 5
3 0
0 NA
0 NA
6 1.2
2 0
the first feature is the number of accidents and the second is the average distance of these accidents to a certain point. It is obvious for a record with zero accident, there won't be a value for MeanDistance. However, imputing these missing values are not logical!
MY SOLUTION: I have decided to discretize the MeanDistance with NAs being a level (bin) and the rest of the data being in bins like: [0,1), [1,2.5), [2.5, Inf). the final table will look like this:
NumberOfAccidents NAs first_bin sec_bin third_bin
1 0 0 0 1
3 0 1 0 0
0 1 0 0 0
0 1 0 0 0
6 0 0 1 0
2 0 1 0 0
What is your idea with these types of missing values that cannot be imputed?
what is your solution to this problem?
It really depends on the domain and what you are trying to predict. Even though your solution is fine, I wouldn't bin the rest of the data as you did. Giving that the NumberOfAccidents feature already tells what MeanDistance have NA values, I would probably just impute 0 into the NA values (for computations) and leave the rest of the data as it is.
Nevertheless, there is no need to limit yourself, just try different approaches and keep the one that boost your KPI (Key Performance Indicator).

Where are class names stored in a machine learning dataset in Python?

I'm learning machine learning using the iris dataset on Python 3.6 with sklearn, and I don't understand where the class names that are being retrieved are stored. In Iris, there are 3 classes, and each class contains 50 observations. You can use several commands to print the classes, and their associated numerical values:
print(iris.target)
print(iris.target_names)
This will result in the output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']
So as can be seen, the classes are Setosa, Versicolor, and Virginica. What I don't understand is where these class names are being stored, or how they're called upon within the model. If you use the shape command on the data, or target, the result is (150,4) and (150,) meaning there is 150 observations and 4 rows in the data, and 150 rows in the target. I am just not able to bridge the gap with my mind as to where this is coming from, however.
What I don't understand is where the class names are supposed to be stored. If I made a brand new dataset for pokemon types and had ice, fire, water, flying, where could I store these types? Would they be required to be numerical as well, like iris, with 0,1,2,3?
Sklearn uses a custom type of object to store its datasets, exactly so that they can store metadata along with the raw data.
If you load the iris dataset
In [2]: from sklearn import datasets
In [3]: iris = datasets.load_iris()
You can inspect the type of object with type:
In [4]: type(iris)
Out[4]: sklearn.utils.Bunch
You can look at the attributes inside the object with dir:
In [5]: dir(iris)
Out[5]: ['DESCR', 'data', 'feature_names', 'target', 'target_names']
And then use . notation to take a look at the attributes themselves:
In [6]: type(iris.data)
Out[6]: numpy.ndarray
In [7]: type(iris.target)
Out[7]: numpy.ndarray
In [8]: type(iris.feature_names)
Out[8]: list
If you want to mimic this for your own datasets, you will have to define your own custom object type to mimic this structure. That would involve defining your own class.

McCulloch-Pitts neuron NAND

A MP neuron of NAND can be constructed using the truth table below:
P Q P(and not)Q
1 1 0
1 0 1
0 1 0
0 0 0
The neuron that shows this:
Inputs:
P +2
Q -1
If the threshold is 2
This will give an output of Y=1
My professor seemed confused and didn't clarify why this isn't correct when it is (to the best of my knowledge). Did he make a mistake or have i got this wrong?
A solution would be great.
Side note: I have sketched out this neuron but cannot draw on this page (new to SO).
First of all NAND is not "and not" but "not and", the logical table is
P Q NAND(P,Q)
1 1 0
1 0 1
0 1 1
0 0 1
second of all, there is nothing hard about NAND nor your gate. The "only" problematic one is XOR (and nXOR).
P Q XOR(P,Q)
1 1 0
1 0 1
0 1 1
0 0 0
So:
single perceptron can easily represent both NAND(p,q) = NOT(AND(p,q)) as well as AND(p, NOT(q)) (which you call NAND).
the impossible to represent gate is XOR and its negation.

Resources