Keras Tokenization (fit on text) - machine-learning

When i am running this script-->
tokenizer.fit_on_texts(df['text'].values)
sequences = tokenizer.texts_to_sequences(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
I am getting this error
AttributeError Traceback (most recent call
last)
<ipython-input-4-7c08b89b116a> in <module>()
----> 1 tokenizer.fit_on_texts(df['text'].values)
2 sequences = tokenizer.texts_to_sequences(df['text'].values)
3 word_index = tokenizer.word_index
4 print('Found %s unique tokens.' % len(word_index))
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
fit_on_texts(self, texts)
220 self.filters,
221 self.lower,
--> 222 self.split)
223 for w in seq:
224 if w in self.word_counts:
/opt/conda/lib/python3.6/site-packages/keras_preprocessing/text.py in
text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
My size of CSV file is 6970963 when I reduce the size it works, is there any size limit of keras Tokenizer or I am doing something wrong

I guess file size is not the issue, try using a try block and look at the data your are passing. Use some thing like this instead of the line
#instead of this
tokenizer.fit_on_texts(df['text'].values)
#use this to look at the data when it is causing that error.
try:
tokenizer.fit_on_texts(df['text'].values)
except Exception as e:
print("exceiption is", e)
print("data passedin ", df['text'].values)
Then you can accordingly fix the error you are getting.

Check the datatype of the text you are fitting the tokenizer on. It sees it as a float instead of string. You need to convert to string before fitting a tokenizer on it.
Try something like this:
train_x = [str(x[1]) for x in train_x]

Although it is an old thread, but still following could be answer.
You data may have nan, which are interpreted as a float instead of nan. either force the type as str(word) or remove the nan using data.fillna('empty')

Related

Ran into "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when going through dataset

I am replicating ResNet (source: https://arxiv.org/abs/1512.03385).
I ran into the error "TypeError: '<' not supported between instances of 'Tensor' and 'list'" when trying to go through several different dataset in different sections of my code.
I tried different fixes but none worked: (i) I deleted enumerate cause I worried that using this may cause the problem (ii) I tried to go through dataloader rather than dataset but it didn't work
1st time: When I tried to view images:
for images, _ in train_loader:
print('images.shape:', images.shape)
plt.figure(figsize=(16,8))
plt.axis('off')
plt.imshow(torchvision.utils.make_grid(images, nrow=16).permute((1, 2, 0)))
break
2nd/3rd time: when I tried to validate/test the resnet:
with torch.no_grad():
for j, inputs, labels in enumerate(test_loader, start=0):
outputs = resnet_models[i](inputs)
_, prediction = torch.max(outputs, dim=1)
You may notice that I didn't run into this error when training the resnet, and the code is quite similar:
for batch, data in enumerate(train_dataloader, start=0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
Error message (taking the first error as an example. The rest is pretty much the same)
TypeError Traceback (most recent call last)
Input In [38], in <cell line: 8>()
6 print("Images AFTER NORMALIZATION")
7 print("--------------------------")
----> 8 for images, _ in training_data:
9 sort=False
10 print('images.shape:', images.shape)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/utils/data/dataset.py:471, in Subset.getitem(self, idx)
469 if isinstance(idx, list):
470 return self.dataset[[self.indices[i] for i in idx]]
--> 471 return self.dataset[self.indices[idx]]
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/datasets/cifar.py:118, in CIFAR10.getitem(self, index)
115 img = Image.fromarray(img)
117 if self.transform is not None:
--> 118 img = self.transform(img)
120 if self.target_transform is not None:
121 target = self.target_transform(target)
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:95, in Compose.call(self, img)
93 def call(self, img):
94 for t in self.transforms:
---> 95 img = t(img)
96 return img
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torch/nn/modules/module.py:1110, in Module._call_impl(self, *input, **kwargs)
1106 # If we don't have any hooks, we want to skip the rest of the logic in
1107 # this function, and just call forward.
1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks >or _global_backward_hooks
1109 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110 return forward_call(*input, **kwargs)
1111 # Do not call functions when jit is used
1112 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/resnet/lib/python3.9/site->packages/torchvision/transforms/transforms.py:707, in RandomHorizontalFlip.forward(self, >img)
699 def forward(self, img):
700 """
701 Args:
702 img (PIL Image or Tensor): Image to be flipped.
(...)
705 PIL Image or Tensor: Randomly flipped image.
706 """
--> 707 if torch.rand(1) < self.p:
708 return F.hflip(img)
709 return img
TypeError: '<' not supported between instances of 'Tensor' and 'list'
I was having the same error message, probably under different circumstances, but I just found my own bug and figured I would share it anyway for various readers. I was using a torchvision transformation in my dataset, which the dataloader was loading from. The transformation was
torchvision.transforms.RandomHorizontalFlip([0.5]),
and the error is that the input to this transformation should not be a list but should be
torchvision.transforms.RandomHorizontalFlip(0.5),
So if there is anything I can recommend, it's just that maybe there is some list argument being passed through that shouldn't be in some transformation or otherwise.

error in getting categorial features from train dataset

My train data looks something like this:
train data
To extract categorial features out of it I ran following code"
categorial=[c for c in train.columns if train.columns(c).dtype in ['object'] ]
But I am getting error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-31eb7ac47e21> in <module>
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
<ipython-input-31-31eb7ac47e21> in <listcomp>(.0)
----> 1 categorial=[c for c in train.columns if train.columns[c].dtype in ['object'] ]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
4295 if is_scalar(key):
4296 key = com.cast_scalar_indexer(key, warn_float=True)
-> 4297 return getitem(key)
4298
4299 if isinstance(key, slice):
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
What is the possible solution?
Use this to select the 'object' type variables-
categorical = train.select_dtypes('object')
If you just want the variable names -
categorical_cols = train.select_dtypes('object').columns.tolist()

CVXPY trying to formulate big problem: ValueError: negative dimensions are not allowed. Is RAM usage the problem here?

I am facing a problem trying to run a fairly large optimization problem in my opinion. You can see the code below. The variable b size is 500 x 96. What I am trying to do is to match a sum of timeseries profiles (351236 x 15 min timesteps) with a bigger profile by minimizing their difference. With the same formulation and a much smaller problem (672 timesteps and a b variable of the size 10 x 5) the problem is solved in under 2 seconds without a problem. But when I am running it for the full scale problem I get the error you see below.
I am running this on Jupyter Lab and python 3.7.4. The python installation is done with conda.
I would expect the problem to solve as with the much smaller problem. But when I run this one, RAM usage explodes up to 100 GB (about 99% of the available RAM on the server). After a while the RAM usage goes down and then a periodical swinging begins (RAM goes up and down from 50% to 100% every few minutes). From the error and after a lot of googling my suspicion is that the problem is too big for the memory and that at some point data is getting broken down to smaller pieces. I do not think it reaches to the point, where the solver does its work. I tried to optimize the code by vectorizing everything (current version) and trying not to have loops etc. in the formulation. But this did not change anything. Do you guys have any clue if this is a bug or a limitation? Or do you maybe have an idea on how to solve this?
X_opt = cp.Constant(np.asarray(X.iloc[:,:500])) # the array size is (35136,500)
K_opt = cp.Constant(np.asarray(K.YearlyDemand)) # the vector size is 96
b = cp.Variable((500,96),boolean = True, value = np.zeros((500,96)))
Y_opt = cp.Constant(np.asarray(y)) # the vector size is 35136
constraints = []
constraints.append( cp.sum(b, axis = 0) == 1 ) # the sum of the elements of every column of b must be equal to 1
constraints.append( cp.sum(b, axis = 1) <= 1 ) # the sum of the elements of every row of b must be smaller or equal to 1
objective = cp.Minimize(cp.sum(cp.abs(Y_opt-cp.sum((cp.diag(K_opt)*((X_opt#b).T)).T, axis = 1))))
prob = cp.Problem(objective, constraints)
prob.solve(solver = cp.GLPK_MI, verbose = True)
ValueError Traceback (most recent call last)
in
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in solve(self, *args, **kwargs)
287 else:
288 solve_func = Problem._solve
--> 289 return solve_func(self, *args, **kwargs)
290
291 #classmethod
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\problems\problem.py in _solve(self, solver, warm_start, verbose, parallel, gp, qcp, **kwargs)
567 self._construct_chains(solver=solver, gp=gp)
568 data, solving_inverse_data = self._solving_chain.apply(
--> 569 self._intermediate_problem)
570 solution = self._solving_chain.solve_via_data(
571 self, data, warm_start, verbose, kwargs)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\chain.py in apply(self, problem)
63 inverse_data = []
64 for r in self.reductions:
---> 65 problem, inv = r.apply(problem)
66 inverse_data.append(inv)
67 return problem, inverse_data
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\reductions\matrix_stuffing.py in apply(self, problem)
98 # Batch expressions together, then split apart.
99 expr_list = [arg for c in cons for arg in c.args]
--> 100 Afull, bfull = extractor.affine(expr_list)
101 if 0 not in Afull.shape and 0 not in bfull.shape:
102 Afull = cvxtypes.constant()(Afull)
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\utilities\coeff_extractor.py in affine(self, expr)
76 size = sum([e.size for e in expr_list])
77 op_list = [e.canonical_form[0] for e in expr_list]
---> 78 V, I, J, b = canonInterface.get_problem_matrix(op_list, self.id_map)
79 A = sp.csr_matrix((V, (I, J)), shape=(size, self.N))
80 return A, b.flatten()
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\canonInterface.py in get_problem_matrix(linOps, id_to_col, constr_offsets)
65
66 # Unpacking
---> 67 V = problemData.getV(len(problemData.V))
68 I = problemData.getI(len(problemData.I))
69 J = problemData.getJ(len(problemData.J))
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\cvxcore.py in getV(self, values)
320
321 def getV(self, values):
--> 322 return _cvxcore.ProblemData_getV(self, values)
323
324 def getI(self, values):
ValueError: negative dimensions are not allowed
This problem is solved here:
https://github.com/cvxgrp/cvxpy/issues/826#issuecomment-648618636
Note that the general problem is that large problems create underlying matrixies too large for the numpy.int32 which CVXPY uses. You can modify the code in CVXPY fairly easily to continue using the SCS solver.
You will have to modify the file canonInterface.py here:
D:\Anaconda3\envs\py37DuAL\lib\site-packages\cvxpy\cvxcore\python\
If you have trouble finding the second file to modify, just modify the first one, and use the traceback to find the second file.

Dask eror - min() arg is an empty sequence

I'm trying to use Dask to handle a reasonably large dataset but I keep getting
ValueError: min() arg is an empty sequence
when I try to run .describe().compute()
I have confirmed the Describe works in normal Pandas with the same dataset so it must be dask related.
Here is the line I'm using:
inpFile = dd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)
and the full error is:
ValueError Traceback (most recent call
last) in ()
----> 1 inpFile.describe().compute()
2 #inpFile2.describe()
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in describe(self, split_every) 1306 num =
self._get_numeric_data() 1307
-> 1308 stats = [num.count(split_every=split_every), 1309 num.mean(split_every=split_every), 1310
num.std(split_every=split_every),
/home/badrul/anaconda3/lib/python3.6/site-packages/dask/dataframe/core.py
in count(self, axis, split_every) 1191
token=token, split_every=split_every) 1192 if
isinstance(self, DataFrame):
-> 1193 result.divisions = (min(self.columns), max(self.columns)) 1194 return result 1195
ValueError: min() arg is an empty sequence
Although it doesn't run for very long so I suspect it's not loading.
The error then comes when I do: inpFile.describe().compute()

Keeping zeros in data with sklearn

I have a csv dataset that I'm trying to use with sklearn. The goal is to predict future webtraffic. However, my dataset contains zeros on days that there were no visitors and I'd like to keep that value. There are more days with zero visitors then there are with visitors (it's a tiny tiny site). Here's a look at the data
Col1 is the date:
10/1/11
10/2/11
10/3/11
etc....
Col2 is the # of visitors:
12
1
0
0
1
5
0
0
etc....
sklearn seems to interpret the zero values as NaN values which is understandable. How can I use those zero values in a logistic function (is that even possible)?
Update:
The estimator is https://github.com/facebookincubator/prophet and when I run the following:
df = pd.read_csv('~/tmp/datafile.csv')
df['y'] = np.log(df['y'])
df.head()
m = Prophet()
m.fit(df);
future = m.make_future_dataframe(periods=365)
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
m.plot(forecast);
m.plot_components(forecast);
plt.show
I get the following:
growthprediction.py:7: RuntimeWarning: divide by zero encountered in log
df['y'] = np.log(df['y'])
/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py:307: RuntimeWarning: invalid value encountered in double_scalars
k = (df['y_scaled'].ix[i1] - df['y_scaled'].ix[i0]) / T
Traceback (most recent call last):
File "growthprediction.py", line 11, in <module>
m.fit(df);
File "/usr/local/lib/python3.6/site-packages/fbprophet/forecaster.py", line 387, in fit
params = model.optimizing(dat, init=stan_init, iter=1e4)
File "/usr/local/lib/python3.6/site-packages/pystan/model.py", line 508, in optimizing
ret, sample = fit._call_sampler(stan_args)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 804, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.StanFit4Model._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:16585)
File "stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.pyx", line 398, in stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666._call_sampler (/var/folders/ym/m6j7kw0d3kj_0frscrtp58800000gn/T/tmp5wq7qltr/stanfit4anon_model_35bf14a7f93814266f16b4cf48b40a5a_4758371668158283666.cpp:8818)
RuntimeError: k initialized to invalid value (nan)
In this line of your code:
df['y'] = np.log(df['y'])
you are taking logarithm of 0 when your df['y'] is zero, which results in warnings and NaNs in your resulting dataset, because logarithm of 0 is not defined.
sklearn itself does NOT interpret zero values as NaNs unless you replace them with NaNs in your preprocessing.

Resources