Feature engineering, handling missing data - machine-learning

Consider this data table
NumberOfAccidents MeanDistance
1 5
3 0
0 NA
0 NA
6 1.2
2 0
the first feature is the number of accidents and the second is the average distance of these accidents to a certain point. It is obvious for a record with zero accident, there won't be a value for MeanDistance. However, imputing these missing values are not logical!
MY SOLUTION: I have decided to discretize the MeanDistance with NAs being a level (bin) and the rest of the data being in bins like: [0,1), [1,2.5), [2.5, Inf). the final table will look like this:
NumberOfAccidents NAs first_bin sec_bin third_bin
1 0 0 0 1
3 0 1 0 0
0 1 0 0 0
0 1 0 0 0
6 0 0 1 0
2 0 1 0 0
What is your idea with these types of missing values that cannot be imputed?
what is your solution to this problem?

It really depends on the domain and what you are trying to predict. Even though your solution is fine, I wouldn't bin the rest of the data as you did. Giving that the NumberOfAccidents feature already tells what MeanDistance have NA values, I would probably just impute 0 into the NA values (for computations) and leave the rest of the data as it is.
Nevertheless, there is no need to limit yourself, just try different approaches and keep the one that boost your KPI (Key Performance Indicator).

Related

How Standard CAN win the bus access in Arbitration with EXT CAN?

I read some documents and all of them say that the Std CAN have higher priority than the Ext CAN because the SRR bit is always Recessive in EXT CAN when they have the same ID, but from my understanding it depends.
https://copperhilltech.com/blog/controller-area-network-can-bus-tutorial-extended-can-protocol/
To simplify, let's say we have message ID 0x1(Std CAN) and 0x1(Ext CAN) sending simultaneously on the same bus.
The arbitration field of the Std CAN be compared to Ext CAN should be like this:
Std CAN: 0 0 0 0 0 0 0 0 0 0 1 0 (The bold bit is RTR)
Ext CAN: 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 (The bold bits are SRR, IDE and RTR)
At the 11th bit, The node that sends Std CAN is sending 1 (Recessive bit), and the node that sends Ext CAN is sending 0 (Dominant bit), so the Ext CAN wins the bus access and the node that sends Std CAN switch to listen mode and not sending anything after that, so the SRR and IDE bits never be reached to decide the message is Ext CAN or Std CAN.
Is my above understanding correct?
Thank you in advance,
Yes, an 29 bit frame with RTR set has higher priority than an 11 bit frame without RTR, given the first 11 bits of the identifiers are identical. So saying that standard frames have higher priority than extended is a simplification.
RTR frames is a bit of an oddball case overall, as they may also have varied length in the DLC area even though there's no data at all in the frame.

Why is my yolo model has one extra output?

I have trained yolo model to detect 24 different classes and now when I'm trying to extract outputs of it, it returns 29 numbers for each prediction. Here they are:
0.605734 0.0720678 0.0147335 0.0434446 0.999661 0 0 0 0.999577 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I suppose that last 24 numbers are scores for each class, first 4 are parameters of bbox, but what is 5th? It is always bigger than 0.9. I'm confused. Please, help me.
It's the probability that the specific box has an object

Create dummies from column with multiple values in dask

My question is similar to this thread Create dummies from column with multiple values in pandas
Objective: I would like to produce similar result below but using dask
In Pandas
import pandas as pd
df = pd.DataFrame({'fruit': ['Banana, , Apple, Dragon Fruit,,,', 'Kiwi,', 'Lemon, Apple, Banana', ',']})
df['fruit'].str.get_dummies(sep=',')
Which will output the following:
Apple Banana Dragon Fruit Banana Kiwi Lemon
0 1 1 0 1 1 0 0
1 0 0 0 0 0 1 0
2 0 1 1 0 0 0 1
3 0 0 0 0 0 0 0
get_dummies() above is of type <pandas.core.strings.StringMethods>
Now the problem is there is no get_dummies() for dask equivalent <dask.dataframe.accessor.StringAccessor>
How can I solve my problem using dask?
Apparently this is not possible in dask as we wouldn't know the output columns before hand. See https://github.com/dask/dask/issues/4403.

The Difference between One Hot Encoding and LabelEncoder?

I am working on a ML problem to predict house prices and Zip Code is one feature which will be useful. I am also trying to use Random Forest Regressor to predict the log of the price.
However, should I use One Hot Encoding or Label Encoder for Zip Code? Because I have about 2000 Zip Codes in my dataset and performing One Hot Encoding will expand the columns significantly.
https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
To rephrase: does it make sense to use LabelEncoder instead of One Hot Encoding on Zip Codes
Like the link says:
LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but
then the imposed ordinality means that the average of dog and mouse is
cat. Still there are algorithms like decision trees and random forests
that can work with categorical variables just fine and LabelEncoder
can be used to store values using less disk space.
And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.
**PS:** Give 2 power 11 is 2048 and you have 2000 categories for zipcodes, you can reduce your feature columns to 11 instead of 1999 in the case of one hot encoding!

McCulloch-Pitts neuron NAND

A MP neuron of NAND can be constructed using the truth table below:
P Q P(and not)Q
1 1 0
1 0 1
0 1 0
0 0 0
The neuron that shows this:
Inputs:
P +2
Q -1
If the threshold is 2
This will give an output of Y=1
My professor seemed confused and didn't clarify why this isn't correct when it is (to the best of my knowledge). Did he make a mistake or have i got this wrong?
A solution would be great.
Side note: I have sketched out this neuron but cannot draw on this page (new to SO).
First of all NAND is not "and not" but "not and", the logical table is
P Q NAND(P,Q)
1 1 0
1 0 1
0 1 1
0 0 1
second of all, there is nothing hard about NAND nor your gate. The "only" problematic one is XOR (and nXOR).
P Q XOR(P,Q)
1 1 0
1 0 1
0 1 1
0 0 0
So:
single perceptron can easily represent both NAND(p,q) = NOT(AND(p,q)) as well as AND(p, NOT(q)) (which you call NAND).
the impossible to represent gate is XOR and its negation.

Resources