How to add missing values to a string column using SimpleImputer()? - machine-learning

I am having a dataset which I read into a pandas dataframe.
Most of them are string columns.
Column structure of my dataframe:
['id', 'currently working', column3, column4, ....]
The column that has missing data is 'currently working'. The column contains only two values -> YES, NO and there are null values as well.
I applied the SimpleImputer() in one of my previous learning and that is on an integer column which contain salaries, where I give strategy as mean to preprocess the dataset and replace nulls like below.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
But in my current scenario, the column is of String type which I certainly can't apply any numeric function methods.
Could anyone let me know how can I preprocess the existing data and replace nulls in a string column of a pandas dataframe ?
What is the preprocessing method that should I follow when working on String columns ?

You can use most_frequent strategy. SimpleImputer will replace missing using the most frequent value. It may also be useful to use add_indicatorbool=True. In this case, the output of the imputer’s transform will stack an additional column with the value from the MissingIndicator. So, your model will have a clue that the value was missing before.
Code example:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,
strategy='most_frequent',
add_indicator=True)

Related

Writing variable-length sequence to a compound array

I am using compound datatypes with h5py, with some elements being variable-length arrays. I can't find a way to set the item. The following MWE shows 6 various ways to do that (sequential indexing — which would not work in h5py anyway, fused indexing, read-modify-commit for columns/rows), neither of which works.
What is the correct way? Why is h5py saying Cannot change data-type for object array when writing integer list to int32 list?
with h5py.File('/tmp/test-vla.h5','w') as h5:
dt=np.dtype([('a',h5py.vlen_dtype(np.dtype('int32')))])
dset=h5.create_dataset('test',(5,),dtype=dt)
dset['a'][2]=[1,2,3] # does not write the value back
dset[2]['a']=[1,2,3] # does not write the value back
dset['a',2]=[1,2,3] # Cannot change data-type for object array
dset[2,'a']=[1,2,3] # Cannot change data-type for object array
tmp=dset['a']; tmp[2]=[1,2,3]; dset['a']=tmp # Cannot change data-type for object array
tmp=dset[2]; tmp['a']=[1,2,3]; dset[2]=tmp # 'list' object has no attribute 'dtype'
When working with compound datasets, I've discovered it's best to add all row data in a single statement. I tweaked your code and to show how add 3 rows of data (each of different length). Note how I: 1) define the row of data with a tuple; 2) define the list of integers with np.array(); and 3) don't reference the field name ['a'].
with h5py.File('test-vla.h5','w') as h5:
dt=np.dtype([('a',h5py.vlen_dtype(np.dtype('int32')))])
dset=h5.create_dataset('test',(5,),dtype=dt)
print (dset.dtype, dset.shape)
dset[0] = ( np.array([0,1,2]), )
dset[1] = ( np.array([1,2,3,4]), )
dset[2] = ( np.array([0,1,2,3,4]), )
For more info, take a look at this post on the HDF Group Forum under HDF5 Ancillary Tools / h5py:
Compound datatype with int, float and array of floats

One measurement - three datatypes

I have a Line Protocol like this:
Measurement1,Valuetype=Act_value,metric=Max,dt=Int value=200i 1553537228984000000
Measurement1,Valuetype=Act_value,metric=Lower_bound,dt=Int value=25i 1553537228987000000
Measurement1,Valuetype=Act_value,metric=Min,dt=Int value=10i 1553537228994000000
Measurement1,Valuetype=Act_value,metric=Upper_limit,dt=Int value=222i 1553537228997000000
Measurement1,Valuetype=Act_value,metric=Lower_limit,dt=Int value=0i 1553537229004000000
Measurement1,Valuetype=Act_value,metric=Simulation,dt=bool value=False 1553537229007000000
Measurement1,Valuetype=Act_value,metric=Value,dt=Int value=69i 1553537229014000000
Measurement1,Valuetype=Act_value,metric=Percentage,dt=Int value=31i 1553537229017000000
Measurement1,Valuetype=Set_value,metric=Upper_limit,dt=Int value=222i 1553537229024000000
Measurement1,Valuetype=Set_value,metric=Lower_limit,dt=Int value=0i 1553537229028000000
Measurement1,Valuetype=Set_value,metric=Unit,dt=string value="Kelvin" 1553537229035000000
Measurement1,Valuetype=Set_value,metric=Value,dt=Int value=222i 1553537229038000000
Measurement1,Valuetype=Set_value,metric=Percentage,dt=Int value=0i 1553537229045000000
I need to insert multiple Lines at once. The issue is likely that I insert integers, booleans and strings into the same table. It worked when I created measurements like, e.g. Measurement1_Int,Measurement1_bool,Measurement1_string. In the above configuration I get an error.
I have the following questions:
Is there any way to save values of different (data-)types to one
table/measurement?
If yes how do I need to adjust my Line Protocol?
Would it work I write the three datatypes seperately but still in the same table?
If you can afford to assign the same timestamp to all metrics within measurement datapoint the best variant would be to use metric name a field name in influxdb record:
Measurement1,Valuetype=Act_value Max=200i,Lower_bound=25i,Min=10i,Upper_limit=222i,Lower_limit=0i,Simulation=False,Value=69i,Percentage=31i 1553537228984000000
Otherwise you can still use metric name as field name but missing fields for each timestamp will have null values:
Measurement1,Valuetype=Set_value Upper_limit=222i 1553537229024000000
Measurement1,Valuetype=Set_value Lower_limit=0i 1553537229028000000
Measurement1,Valuetype=Set_value Unit="Kelvin" 1553537229035000000
Measurement1,Valuetype=Set_value Value=222i 1553537229038000000
Measurement1,Valuetype=Set_value Percentage=0i 1553537229045000000

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

H2O randomForest column/feature selection

In h2o.randomForest, lets say I have 5 input features x=c("A","B","C","D","E"), is there anyway to force the algorithm to always choose A,B AND one of the remaining features?
In this case h2o.randomForest is just asking you to pass correct x (list of columns to use in prediction) and y (the column name to do prediction) so anything you will pass will be used as input.
What you are asking is a python specific question. How you want to pass the list of columns you will need to write logic for it. You can defined the following is a function and use it as needed.
import random
myframe = ["a","b","c","d","e"]
//You can also set myframe as column name list
//myframe.remove(_use_response_column_name) this will make it generic
selectedkeys = ["a","b"]
for item in selectedkeys:
if item in myframe:
myframe.remove(item)
selectedkeys.append(random.choice(myframe))
print(selectedkeys)
print(myframe)
You just need to pass the selectedkeys as input for X.

How to transform to Entity Attribute Value (EAV) using Spoon Normalise

I am trying to use Spoon (Pentaho Data Integration) to change data that is in typical row format to Entity Attribute Value format.
My source data is as follows:
My Normaliser is setup as follows:
And here are the results:
Why is the value for the CONDITION_START_DATE and CONDITION_STOP_DATE in the string_value column instead of the date_value column?
According to this documentation
Fieldname: Name of the fields to normalize
Type: Give a string to classify the field.
New field: You can give one or more fields where the new value should transferred to.
Please check Normalizing multiple rows in a single step section in http://wiki.pentaho.com/display/EAI/Row+Normaliser. Accordind to this, you should have a group of fields with the same Type (pr_sl -> Product1, pr1_nr -> Product1), only in this case you can get multiple fields in output (pr_sl -> Product Sales, pr1_nr -> Product Number).
In your case you can convert dates to strings and then use row normalizer with single new field and then use formula for example:
And then convert date_value to date.

Resources