Do features in Sci-kit learn have to be the same length - machine-learning

I have several features take from PCAP files these are network flow features
However the problem I have is that the features are not the same length
What I mean is for example here is a sample of my dataframe
TotBytes Dur Afr DNS_Interval NTP_interval
250 0.030967 8073.110084929118 300.0 301.0
262 0.113429 2309.8149503213463 1.0 300.0
1960 0.062134 31544.725914957988 300.0 52.0
379 0.020444 18538.446487967132 10.0 300.0
1389 0.154713 8977.913943883192 40.0 1.0
End of the dataframe
TotBytes Dur Afr DNS_interval NTP_interval
262 0.099459 2634.25129953046 0.0 0.0
250 0.029093 8593.132368611006 0.0 0.0
250 0.024784 10087.153001936733 0.0 0.0
250 0.035297 7082.75490834915 0.0 0.0
262 0.112134 2336.46943 0.0 0.0
250 0.024445 10227.04029453876 0.0 0.0
As you can see the features DNS_interval and NTP_interval are not the same length as the other 3 features (TotBytes, Dur and Afr)
I am using Random Forest as the classifier. Does the features need to be the same length and if so what shall i do?
Do i fill in the missing figures with the mean? that a lot of zeros? if I did it seems to fill the same mean figure down the whole column where all the zeros are

The features need to be of the same length i.e. You should have no missing values in the dataset. Some models handle missing values internally but it's better to handle those.
There are a number of options that you have. Let's list each one of them.
1. In case the number of missing values in a column is very less compared to the size of DataFrame, you can drop the entire rows containing missing values.
df.dropna(axis=0, inplace=True)
This drops every row with any missing value. Make sure to check the size before and after dropping to ensure no substantial loss of data.
2. In case there are very few values in a column of a DataFrame, you can drop that entire feature/column.
df.drop(feature_list_to_drop, axis=1, inplace=True)
3. When number of missing values are comparable to values present.
There are various techniques to fill missing values and every technique is good at it's own place. You just need to find out which one is better for your dataset.
a. Fill with mean
df.feature.fillna(df.feature.mean(), inplace=True)
b. Fill with mode (In case one value dominates or is categorical)
df.feature.fillna(df.feature.mode()[0], inplace=True)
c. Make a model to predict that missing column value
(Tough to implement and time overhead but the best method
4. When nothing works out, just fill the missing values with some negative value like -99. Maybe the model figures out some sense from the fact that these values were missing.

Your data seems fine. All the values are floats (except TotBytes).
If you meant the precision of floating point by "length of features", it shouldn't matter with the classifier.

Related

Get dependant probabilities in multiclassification

After training my CatBoostClassifier model I call get_proba function which returns me list of probabilities. The problem starts from an another point... I transfer that data into dataframe then to Excel after what I sum all float numbers in my list and get numbers approximately equal to 2.
(Example: 0,980831511 0,99695788 2,99173E-13 1,63919E-15 7,35072E-14 4,82846E-16 . Their sum is equal to 1,977789391 )
Parameters which were used:
'loss_function': 'MultiClassOneVsAll',
'eval_metric': 'ZeroOneLoss',
The problem is that I need to get dependant type of probabilities, so I get something more like: 0.2 0.5 0.1 0.2 where their sum will be equal to 1 and the highest probability (which might be obvious) is in the second category (which equals to 0.5)
I've completed several tests.
I've used different objectives aka loss functions and metrics, so if you need to get "dependant" probability you may use everything (correct me if I'm wrong), but loss_function multiclassova (in other words OneVsAll). I've used multiclassova as eval metric and everything seemed right.
In case you use OneVsAll (using multiclassova):
In another case, as you see, the sum of all events equals 1, while in the last case it could vary from 0.5 to 2.0 (using other loss_function):

Normalize data with outlier inside interval

I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)

replace missing values in categorical data

Let's suppose I have a column with categorical data "red" "green" "blue" and empty cells
red
green
red
blue
NaN
I'm sure that the NaN belongs to red green blue, should I replace the NaN by the average of the colors or is a too strong assumption? It will be
col1 | col2 | col3
1 0 0
0 1 0
1 0 0
0 0 1
0.5 0.25 0.25
Or even scale the last row but keeping the ratio so these values have less influence? Usually what is the best practice?
0.25 0.125 0.125
The simplest strategy for handling missing data is to remove records that contain a missing value.
The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
The Imputer class operates directly on the NumPy array instead of the DataFrame.
Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.
It depends on what you want to do with the data.
Is the average of these colors useful for your purpose?
You are creating a new possible value doing that, that is probably not wanted. Especially since you are talking about categorical data, and you are handling it as if it was numeric data.
In Machine Learning you would replace the missing values with the most common categorical value regarding a target attribute (what you want to predict).
Example: You want to predict if a person is male or female by looking at their car, and the color feature has some missing values. If most of the cars from male(female) drivers are blue(red), you would use that value to fill missing entries of cars from male(female) drivers.
In addition to Lan's answer's approach, which seems most commonly used, you can use something based on matrix factorization. For example there is a variant of Generalized Low Rank Models that can impute such data, just as probabilistic matrix factorization is used to impute continuous data.
GLRMs can be used from H2O which provides bindings for both Python and R.

Precision and Recall computation for different group sizes

I didn't find an answer for this question anywhere, so I hope someone here could help me and also other people with the same problem.
Suppose that I have 1000 Positive samples and 1500 Negative samples.
Now, suppose that there are 950 True Positives (positive samples correctly classified as positive) and 100 False Positives (negative samples incorrectly classified as positive).
Should I use these raw numbers to compute the Precision, or should I consider the different group sizes?
In other words, should my precision be:
TruePositive / (TruePositive + FalsePositive) = 950 / (950 + 100) = 90.476%
OR should it be:
(TruePositive / 1000) / [(TruePositive / 1000) + (FalsePositive / 1500)] = 0.95 / (0.95 + 0.067) = 93.44%
In the first calculation, I took the raw numbers without any consideration to the amount of samples in each group, while in the second calculation, I used the proportions of each measure to its corresponding group, to remove the bias caused by the groups' different size
Answering the stated question: by definition, precision is computed by the first formula: TP/(TP+FP).
However, it doesn't mean that you have to use this formula, i.e. precision measure. There are many other measures, look at the table on this wiki page and choose the one most suited for your task.
For example, positive likelihood ratio seems to be the most similar to your second formula.

Understanding FFT in aurioTouch2

I've been looking at aurioTouch 2 from Apple' sample code (found here). At the end of the day I want to analyze the frequencies myself. For now I'm trying to understand some of what's going on here. My apologies if this is trivial, just trying to understand some of the uncommented magic numbers floating around in some of the source. My main points of confusion right now are:
Why do they zero out the nyquist value in FFTBufferManager::ComputeFFT? Can this value really just be thrown away? (~line 112 of FFTBufferManager.cpp).
They scale everything down by -128db, so I'm assuming that the results are thus in the range of (-128, 0). However, later in aurioTouchAppDelegate.mm (~line 807), They convert this to a value between 0 and 1 by adding 80 and dividing by 64, then clamping to 0 and 1. Why the fuzziness? Also, am I right in assuming values will be in the vicinity of (-128, 0)?
Well, it's not trivial for me either but this is how i understand it. If i've over simplified it is purely for my benefit, i don't mean to be patronising.
Zeroing the result corresponding to the Nyquist frequency:
I'm going to suppose we are computing the forward FFT of 1024 input samples. At 44100hz input this is usually true in my case (but isn't what AurioTouch is doing, which i find a bit weird, but i'm no expert). It's easier for me to understand with specific values.
Given 1024 (n) input samples, arranged as needed (even indexes' first then odd indexes' { in[0], in[2], in[4], …, in1, in[3], in[5], … }) (use vDSP_ctoz() to order your input)
The output of FFT 1024 (n) input samples is 513 ((n/2)+1) complex values. ie 513 real components and 513 imaginary components, a total of 1026 values.
However, imaginary[0] and imaginary[512] (n/2) are always, necessarily, zero. So by placing real[512] (the real component of the Nyquist frequency bin) at imaginary[0] and forgetting imaginary[512] - which is always zero and can be inferred, the results are packed into an 1024 (n) length buffer.
So, for the returned results to be valid you must at least set imaginary[0] back to zero. If you require all 513 ((n/2)+1) frequency bins you need to append another complex value to the result and set it thus..
unpackedVal = imaginary[0]
real[512]=unpackedVal, imaginary[512]=0
imaginary[0] = 0
In AurioTouch i always supposed they just don't bother. n/2 results is obviously more convenient to work with and you can hardly tell from the visualizer:- "Oh look, it's missing one magnitude at the Nyquist frequency"
The UsingFourierTransforms docs explain the packing
NB the specific values 1024, 513, 512, etc. are examples not the actual values of n, (n/2)+1, n/2 from AurioTouch.
They scale everything down by -128db
Not quite, the range of the output values is relative to the number of input samples so it has to be normalised. The scale is 1.0/(2*inNumberFrames).
After scaling the range is -1.0 –> +1.0. The magnitude of the complex vector is then taken (the phase is ignored) giving a Scalar value for each frequency bin between 0 and 1.0
This value is then interpreted as a decibel value between -128 and 0
The drawing stuff… +80 / 64. …*120… …i'm not sure. I may be completely wrong or it may be …artistic license?

Resources