Keeping zeros in data with sklearn

Keeping zeros in data with sklearn - machine-learning

I have a csv dataset that I'm trying to use with sklearn. The goal is to predict future webtraffic. However, my dataset contains zeros on days that there were no visitors and I'd like to keep that value. There are more days with zero visitors then there are with visitors (it's a tiny tiny site). Here's a look at the data
Col1 is the date:
10/1/11
10/2/11
10/3/11
etc....
Col2 is the # of visitors:
12
1
0
0
1
5
0
0
etc....
sklearn seems to interpret the zero values as NaN values which is understandable. How can I use those zero values in a logistic function (is that even possible)?
Update:
The estimator is https://github.com/facebookincubator/prophet and when I run the following:
df = pd.read_csv('~/tmp/datafile.csv')
df['y'] = np.log(df['y'])
df.head()
m = Prophet()
m.fit(df);
future = m.make_future_dataframe(periods=365)
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast);
m.plot_components(forecast);
plt.show
I get the following:
growthprediction.py:7: RuntimeWarning: divide by zero encountered in log
df['y'] = np.log(df['y'])
/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py:307: RuntimeWarning: invalid value encountered in double_scalars
k = (df['y_scaled'].ix[i1] - df['y_scaled'].ix[i0]) / T
Traceback (most recent call last):
File "growthprediction.py", line 11, in <module>
m.fit(df);
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 387, in fit
params = model.optimizing(dat, init=stan_init, iter=1e4)
File "/usr/local/lib/python3.6/site-packages/pystan/model.py", line 508, in optimizing
ret, sample = fit._call_sampler(stan_args)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 804, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.StanFit4Model._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:16585)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 398, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:8818)
RuntimeError: k initialized to invalid value (nan)

In this line of your code:
df['y'] = np.log(df['y'])
you are taking logarithm of 0 when your df['y'] is zero, which results in warnings and NaNs in your resulting dataset, because logarithm of 0 is not defined.
sklearn itself does NOT interpret zero values as NaNs unless you replace them with NaNs in your preprocessing.

Related

Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?

The code of the partial derivatives of the mean square error:
w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)
With n_samples as n, the samples number, y_true as the observations and y_pred as the predictions
My question is, why we used the sum for the gradient in the code of b (b_grad), and why we didn't in the code of w_grad?

The original equation is:
If you have ten features, then you have ten Ws and ten Bs, and the total number of variables are twenty.
But we can just sum all B_i into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x be 1. The calculation becomes below:

Error in metpy.calc.wind_direction with era5 data

I downloaded Era5 U and V wind components from era-5 and I am using xarray to read the .nc file and select several lat lon points from the data. After i need to calculate wind speed and direction using metpy.calc function:
ds = xr.open_mfdataset('ERA5_u10_*.nc',combine='by_coords').metpy.parse_cf()
dv = xr.open_mfdataset('ERA5_v10_*.nc',combine='by_coords').metpy.parse_cf()
#direction =mpcalc.wind_direction(ds['u10'],dv['v10'])
#dd=mpcalc.wind_direction(ds.u10,dv.v10)
locations=[]
k=0
u10 = []
v10=[]
speed=[]
direction=[]
u100 = []
v100=[]
for i,j in zip(lats,lons):
stations_u=ds.sel(longitude=j,latitude=i,method='nearest')
stations_v=dv.sel(longitude=j,latitude=i,method='nearest')
u10.append(stations_u.u10)### mudar variavel
v10.append(stations_v.v10)### mudar variavel
speed.append(mpcalc.wind_speed(stations_u.u10,stations_v.v10))
direction.append(mpcalc.wind_direction(stations_u.u10,stations_v.v10))
Wind speed works with no problem, but wind direction raises an error:
runfile('/mnt/data1/ERA5/wind/untitled0.py', wdir='/mnt/data1/ERA5/wind')
Found valid latitude/longitude coordinates, assuming latitude_longitude for projection grid_mapping variable
Found valid latitude/longitude coordinates, assuming latitude_longitude for projection grid_mapping variable
Traceback (most recent call last):
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 1615, in __setitem__
y = where(key, value, self)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/routines.py", line 1382, in where
return elemwise(np.where, condition, x, y)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 4204, in elemwise
broadcast_shapes(*shapes)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 4165, in broadcast_shapes
"shapes {0}".format(" ".join(map(str, shapes)))
ValueError: operands could not be broadcast together with shapes (262968,) (nan,) (262968,)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/data1/ERA5/wind/untitled0.py", line 40, in <module>
direction.append(mpcalc.wind_direction(stations_u.u10,stations_v.v10))
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/xarray.py", line 1206, in wrapper
result = func(*bound_args.args, **bound_args.kwargs)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/units.py", line 246, in wrapper
return func(*args, **kwargs)
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/metpy/calc/basic.py", line 104, in wind_direction
wdir[mask] += 360. * units.deg
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/pint/quantity.py", line 1868, in __setitem__
self._magnitude[key] = factor.magnitude
File "/usr/local/python/anaconda3/envs/my_env/lib/python3.6/site-packages/dask/array/core.py", line 1622, in __setitem__
) from e
ValueError: Boolean index assignment in Dask expects equally shaped arrays.
Example: da1[da2] = da3 where da1.shape == (4,), da2.shape == (4,) and da3.shape == (4,).
If i try to calculate Wdir with only 1 element in U and in V, surprisingly it works:
mpcalc.wind_direction(stations_u.u10[0],stations_v.v10[0]).values
Out[12]: array(173.41553, dtype=float32)

metpy.calc.wind_direction is unfortunately known not to work with Dask arrays--and in fact many places in MetPy don't work well with Dask, though we definitely want to. For now, to use wind_direction you'll need to turn stations_u, etc. into standard numpy arrays using e.g. .compute().

Thaks for your reply! Yeh it makes sense...
The dataarrays i am working with are very big, so i need to keep it as a dataArray because copying it to a normal np array would use more memory that what i have xD
I eventually tried the calculation with statins_u as both u and v, and surprisingly it works:
mpcalc.wind_direction(stations_u.u10,stations_u.u10)## Heading ##
Out[7]:
<xarray.DataArray 'sub-404afe706bcfce1f36810c70766c0647' (time: 262968)>
<Quantity(dask.array<sub, shape=(262968,), dtype=float32, chunksize=(87672,), chunktype=numpy.ndarray>, 'degree')>
Coordinates:
longitude float32 -7.1
latitude float32 43.8
* time (time) datetime64[ns] 1990-01-01 ... 2019-12-31T23:00:00
why does it work in this case ? isn't metpy.calc_wind_direction using dask array in this case ?
i fell that there's something wrong with attributes names and units and pint, but cant understand what it is ...

Keras Tokenization (fit on text)

When i am running this script-->
tokenizer.fit_on_texts(df['text'].values)
sequences = tokenizer.texts_to_sequences(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
I am getting this error
AttributeError Traceback (most recent call
last)
<ipython-input-4-7c08b89b116a> in <module>()
----> 1 tokenizer.fit_on_texts(df['text'].values)
2 sequences = tokenizer.texts_to_sequences(df['text'].values)
3 word_index = tokenizer.word_index
4 print('Found %s unique tokens.' % len(word_index))
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
fit_on_texts(self, texts)
220 self.filters,
221 self.lower,
--> 222 self.split)
223 for w in seq:
224 if w in self.word_counts:
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
My size of CSV file is 6970963 when I reduce the size it works, is there any size limit of keras Tokenizer or I am doing something wrong

I guess file size is not the issue, try using a try block and look at the data your are passing. Use some thing like this instead of the line
#instead of this
tokenizer.fit_on_texts(df['text'].values)
#use this to look at the data when it is causing that error.
try:
tokenizer.fit_on_texts(df['text'].values)
except Exception as e:
print("exceiption is", e)
print("data passedin ", df['text'].values)
Then you can accordingly fix the error you are getting.

Check the datatype of the text you are fitting the tokenizer on. It sees it as a float instead of string. You need to convert to string before fitting a tokenizer on it.
Try something like this:
train_x = [str(x[1]) for x in train_x]

Although it is an old thread, but still following could be answer.
You data may have nan, which are interpreted as a float instead of nan. either force the type as str(word) or remove the nan using data.fillna('empty')

Sklearn LabelEncoder throws TypeError in sort

I am learning machine learning using Titanic dataset from Kaggle. I am using LabelEncoder of sklearn to transform text data to numeric labels. The following code works fine for "Sex" but not for "Embarked".
encoder = preprocessing.LabelEncoder()
features["Sex"] = encoder.fit_transform(features["Sex"])
features["Embarked"] = encoder.fit_transform(features["Embarked"])
This is the error I got
Traceback (most recent call last):
File "../src/script.py", line 20, in <module>
features["Embarked"] = encoder.fit_transform(features["Embarked"])
File "/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 131, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "/opt/conda/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 211, in unique
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: '>' not supported between instances of 'str' and 'float'

I solved it myself. The problem was that the particular feature had NaN values. Replacing it with a numerical value it will still throw an error since it is of different datatypes. So I replaced it with a character value
features["Embarked"] = encoder.fit_transform(features["Embarked"].fillna('0'))

Try this function, you’ll need to pass a Pandas Dataframe. It will look at the type of your column and encode. So you won’t need to even bother checking the types yourself.
def encoder(data):
'''Map the categorical variables to numbers to work with scikit learn'''
for col in data.columns:
if data.dtypes[col] == "object":
le = preprocessing.LabelEncoder()
le.fit(data[col])
data[col] = le.transform(data[col])
return data

Model interpretation using timeslice method in CARET

Suppose you want to evaluate a simple glm model to forecast an economic data series.
Consider the following code:
library(caret)
library(ggplot2)
data(economics)
h <- 7
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 24*h,
horizon = 12,
fixedWindow = TRUE)
fit.glm <- train(unemploy ~ pce + pop + psavert,
data = economics,
method = "glm",
preProc = c("center", "scale","BoxCox"),
trControl = myTimeControl)
Suppose that the covariates used into the train formula are predictions of values obtained by some other model.
This simple model gives the following results:
Generalized Linear Model
574 samples
3 predictor
Pre-processing: centered (3), scaled (3), Box-Cox transformation (3)
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed
window)
Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...
Resampling results:
RMSE Rsquared
1446.335 0.2958317
Apart from the bad results obtained (this is only an example).
I wonder if it is correct:
To consider the above results as results obtained, on the entire dataset, by a GLM trained using only 24*h=24*7 samples and retrained after every horizon=12 samples
How evaluate RMSE as horizon grows from 1 to 12 (as reported here http://robjhyndman.com/hyndsight/tscvexample/ )?
if I show fit.glm summary I obtain:
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-5090.0 -1025.5 -208.1 833.4 4948.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7771.56 64.93 119.688 < 2e-16 ***
pce 5750.27 1153.03 4.987 8.15e-07 ***
pop -1483.01 1117.06 -1.328 0.185
psavert 2932.38 144.56 20.286 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 2420081)
Null deviance: 3999514594 on 573 degrees of freedom
Residual deviance: 1379446256 on 570 degrees of freedom
AIC: 10072
Number of Fisher Scoring iterations: 2
The parameters showed refer to the last trained GLM or are "average" paramters?
I hope I've been clear enough.

This resampling method is like any others. The RMSE is estimated using different subsets of the training data. Note that it says "Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...". The final model uses all of the training data set.
The difference between Rob's results and these are primarily due to the difference between Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Keeping zeros in data with sklearn - machine-learning

Related

Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?

Error in metpy.calc.wind_direction with era5 data

Keras Tokenization (fit on text)

Sklearn LabelEncoder throws TypeError in sort

Model interpretation using timeslice method in CARET

Categories

Resources