error in getting categorial features from train dataset

error in getting categorial features from train dataset - machine-learning

My train data looks something like this:
train data
To extract categorial features out of it I ran following code"
categorial=[c for c in train.columns if train.columns(c).dtype in ['object'] ]
But I am getting error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-31eb7ac47e21> in <module>
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
<ipython-input-31-31eb7ac47e21> in <listcomp>(.0)
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
4295 if is_scalar(key):
4296 key = com.cast_scalar_indexer(key, warn_float=True)
-> 4297 return getitem(key)
4298
4299 if isinstance(key, slice):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
What is the possible solution?

Use this to select the 'object' type variables-
categorical = train.select_dtypes('object')
If you just want the variable names -
categorical_cols = train.select_dtypes('object').columns.tolist()

Related

Error in metpy.calc.wind_direction with era5 data

I downloaded Era5 U and V wind components from era-5 and I am using xarray to read the .nc file and select several lat lon points from the data. After i need to calculate wind speed and direction using metpy.calc function:
ds = xr.open_mfdataset('ERA5_u10_*.nc',combine='by_coords').metpy.parse_cf()
dv = xr.open_mfdataset('ERA5_v10_*.nc',combine='by_coords').metpy.parse_cf()
#direction =mpcalc.wind_direction(ds['u10'],dv['v10'])
#dd=mpcalc.wind_direction(ds.u10,dv.v10)
locations=[]
k=0
u10 = []
v10=[]
speed=[]
direction=[]
u100 = []
v100=[]
for i,j in zip(lats,lons):
stations_u=ds.sel(longitude=j,latitude=i,method='nearest')
stations_v=dv.sel(longitude=j,latitude=i,method='nearest')
u10.append(stations_u.u10)### mudar variavel
v10.append(stations_v.v10)### mudar variavel
speed.append(mpcalc.wind_speed(stations_u.u10,stations_v.v10))
direction.append(mpcalc.wind_direction(stations_u.u10,stations_v.v10))
Wind speed works with no problem, but wind direction raises an error:
runfile('/mnt/data1/ERA5/wind/untitled0.py', wdir='/mnt/data1/ERA5/wind')
Found valid latitude/longitude coordinates, assuming latitude_longitude for projection grid_mapping variable
Found valid latitude/longitude coordinates, assuming latitude_longitude for projection grid_mapping variable
Traceback (most recent call last):
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 1615, in __setitem__
y = where(key, value, self)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/routines.py", line 1382, in where
return elemwise(np.where, condition, x, y)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 4204, in elemwise
broadcast_shapes(*shapes)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 4165, in broadcast_shapes
"shapes {0}".format(" ".join(map(str, shapes)))
ValueError: operands could not be broadcast together with shapes (262968,) (nan,) (262968,)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/data1/ERA5/wind/untitled0.py", line 40, in <module>
direction.append(mpcalc.wind_direction(stations_u.u10,stations_v.v10))
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/xarray.py", line 1206, in wrapper
result = func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/units.py", line 246, in wrapper
return func(*args, **kwargs)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/calc/basic.py", line 104, in wind_direction
wdir[mask] += 360. * units.deg
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/pint/quantity.py", line 1868, in __setitem__
self._magnitude[key] = factor.magnitude
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 1622, in __setitem__
) from e
ValueError: Boolean index assignment in Dask expects equally shaped arrays.
Example: da1[da2] = da3 where da1.shape == (4,), da2.shape == (4,) and da3.shape == (4,).
If i try to calculate Wdir with only 1 element in U and in V, surprisingly it works:
mpcalc.wind_direction(stations_u.u10[0],stations_v.v10[0]).values
Out[12]: array(173.41553, dtype=float32)

metpy.calc.wind_direction is unfortunately known not to work with Dask arrays--and in fact many places in MetPy don't work well with Dask, though we definitely want to. For now, to use wind_direction you'll need to turn stations_u, etc. into standard numpy arrays using e.g. .compute().

Thaks for your reply! Yeh it makes sense...
The dataarrays i am working with are very big, so i need to keep it as a dataArray because copying it to a normal np array would use more memory that what i have xD
I eventually tried the calculation with statins_u as both u and v, and surprisingly it works:
mpcalc.wind_direction(stations_u.u10,stations_u.u10)## Heading ##
Out[7]:
<xarray.DataArray 'sub-404afe706bcfce1f36810c70766c0647' (time: 262968)>
<Quantity(dask.array<sub, shape=(262968,), dtype=float32, chunksize=(87672,), chunktype=numpy.ndarray>, 'degree')>
Coordinates:
longitude float32 -7.1
latitude float32 43.8
* time (time) datetime64[ns] 1990-01-01 ... 2019-12-31T23:00:00
why does it work in this case ? isn't metpy.calc_wind_direction using dask array in this case ?
i fell that there's something wrong with attributes names and units and pint, but cant understand what it is ...

parallelise prediction with `map_partitions`

I have a dataframe of shape (25M, 79) and im trying to parallelise an sklearn pipeline prediction on it.
When I run it for just one partition, it works as expected:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
grid_searcher.best_estimator_.predict_proba(ddf.get_partition(0))
But if I apply it to every partition, then it fails:
n_partitions = 1000
ddf = dd.from_pandas(df_x_selection, npartitions=n_partitions)
def _f(_df, _pipeline, _predicted_class) -> np.array:
return _pipeline.predict_proba(_df)[:, _predicted_class]
ddf.map_partitions(_f, grid_searcher.best_estimator_, 1, meta=(None, 'f8')).compute()
The error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
130 raise ValueError(
--> 131 f"Wrong number of items passed {len(self.values)}, "
132 f"placement implies {len(self.mgr_locs)}"
ValueError: Wrong number of items passed 79, placement implies 100
What am I doing wrong?
Thanks

Keras Tokenization (fit on text)

When i am running this script-->
tokenizer.fit_on_texts(df['text'].values)
sequences = tokenizer.texts_to_sequences(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
I am getting this error
AttributeError Traceback (most recent call
last)
<ipython-input-4-7c08b89b116a> in <module>()
----> 1 tokenizer.fit_on_texts(df['text'].values)
2 sequences = tokenizer.texts_to_sequences(df['text'].values)
3 word_index = tokenizer.word_index
4 print('Found %s unique tokens.' % len(word_index))
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
fit_on_texts(self, texts)
220 self.filters,
221 self.lower,
--> 222 self.split)
223 for w in seq:
224 if w in self.word_counts:
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
My size of CSV file is 6970963 when I reduce the size it works, is there any size limit of keras Tokenizer or I am doing something wrong

I guess file size is not the issue, try using a try block and look at the data your are passing. Use some thing like this instead of the line
#instead of this
tokenizer.fit_on_texts(df['text'].values)
#use this to look at the data when it is causing that error.
try:
tokenizer.fit_on_texts(df['text'].values)
except Exception as e:
print("exceiption is", e)
print("data passedin ", df['text'].values)
Then you can accordingly fix the error you are getting.

Check the datatype of the text you are fitting the tokenizer on. It sees it as a float instead of string. You need to convert to string before fitting a tokenizer on it.
Try something like this:
train_x = [str(x[1]) for x in train_x]

Although it is an old thread, but still following could be answer.
You data may have nan, which are interpreted as a float instead of nan. either force the type as str(word) or remove the nan using data.fillna('empty')

Sklearn LabelEncoder throws TypeError in sort

I am learning machine learning using Titanic dataset from Kaggle. I am using LabelEncoder of sklearn to transform text data to numeric labels. The following code works fine for "Sex" but not for "Embarked".
encoder = preprocessing.LabelEncoder()
features["Sex"] = encoder.fit_transform(features["Sex"])
features["Embarked"] = encoder.fit_transform(features["Embarked"])
This is the error I got
Traceback (most recent call last):
File "../src/script.py", line 20, in <module>
features["Embarked"] = encoder.fit_transform(features["Embarked"])
File "/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 131, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "/opt/conda/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 211, in unique
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: '>' not supported between instances of 'str' and 'float'

I solved it myself. The problem was that the particular feature had NaN values. Replacing it with a numerical value it will still throw an error since it is of different datatypes. So I replaced it with a character value
features["Embarked"] = encoder.fit_transform(features["Embarked"].fillna('0'))

Try this function, you’ll need to pass a Pandas Dataframe. It will look at the type of your column and encode. So you won’t need to even bother checking the types yourself.
def encoder(data):
'''Map the categorical variables to numbers to work with scikit learn'''
for col in data.columns:
if data.dtypes[col] == "object":
le = preprocessing.LabelEncoder()
le.fit(data[col])
data[col] = le.transform(data[col])
return data

Keeping zeros in data with sklearn

I have a csv dataset that I'm trying to use with sklearn. The goal is to predict future webtraffic. However, my dataset contains zeros on days that there were no visitors and I'd like to keep that value. There are more days with zero visitors then there are with visitors (it's a tiny tiny site). Here's a look at the data
Col1 is the date:
10/1/11
10/2/11
10/3/11
etc....
Col2 is the # of visitors:
12
1
0
0
1
5
0
0
etc....
sklearn seems to interpret the zero values as NaN values which is understandable. How can I use those zero values in a logistic function (is that even possible)?
Update:
The estimator is https://github.com/facebookincubator/prophet and when I run the following:
df = pd.read_csv('~/tmp/datafile.csv')
df['y'] = np.log(df['y'])
df.head()
m = Prophet()
m.fit(df);
future = m.make_future_dataframe(periods=365)
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast);
m.plot_components(forecast);
plt.show
I get the following:
growthprediction.py:7: RuntimeWarning: divide by zero encountered in log
df['y'] = np.log(df['y'])
/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py:307: RuntimeWarning: invalid value encountered in double_scalars
k = (df['y_scaled'].ix[i1] - df['y_scaled'].ix[i0]) / T
Traceback (most recent call last):
File "growthprediction.py", line 11, in <module>
m.fit(df);
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 387, in fit
params = model.optimizing(dat, init=stan_init, iter=1e4)
File "/usr/local/lib/python3.6/site-packages/pystan/model.py", line 508, in optimizing
ret, sample = fit._call_sampler(stan_args)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 804, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.StanFit4Model._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:16585)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 398, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:8818)
RuntimeError: k initialized to invalid value (nan)

In this line of your code:
df['y'] = np.log(df['y'])
you are taking logarithm of 0 when your df['y'] is zero, which results in warnings and NaNs in your resulting dataset, because logarithm of 0 is not defined.
sklearn itself does NOT interpret zero values as NaNs unless you replace them with NaNs in your preprocessing.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

error in getting categorial features from train dataset - machine-learning

Use this to select the 'object' type variables- categorical = train.select_dtypes('object') If you just want the variable names - categorical_cols = train.select_dtypes('object').columns.tolist()

Related

Error in metpy.calc.wind_direction with era5 data

parallelise prediction with `map_partitions`

Keras Tokenization (fit on text)

Sklearn LabelEncoder throws TypeError in sort

Keeping zeros in data with sklearn

Categories

Resources