Keras Tokenizer num_words doesn't seem to work - machine-learning

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_index
{'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}
I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?

There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.

Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").
The reason it keeps counter on all words is that you can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.
Hope it helps.

Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.
>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> print(t.word_index)
{'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11}
>>> t.texts_to_matrix(l, mode='count')
array([[0., 1., 1.],
[0., 1., 1.]])

Just to add a little bit to farid khafizov's answer,
words at sequence of num_words and above are removed from the results of texts_to_sequences (4 in 1st, 5 in 2nd and 6 in 3rd sentence disappeared respectively)
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
print(tf.__version__) # 2.4.1, in my case
sentences = [
'I love my dog',
'I, love my cat',
'You love my dog!'
]
tokenizer = Tokenizer(num_words=4)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
seq = tokenizer.texts_to_sequences(sentences)
print(word_index) # {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
print(seq) # [[3, 1, 2], [3, 1, 2], [1, 2]]

Related

How to generate conditions within constraints in Z3py

Let us assume there are 5-time slots and at each time slot, I have 4 options to choose from, each with a known reward, for eg. rewards = [5, 2, 1, -3]. At every time step, at least 1 of the four options must be selected, with a condition that, if option 3 (with reward -3) is chosen at a time t, then for the remaining time steps, none of the options should be selected. As an example, considering the options are indexed from 0, both [2, 1, 1, 0, 3] and [2, 1, 1, 3, 99] are valid solutions with the second solution having option 3 selected in the 3rd time step and 99 is some random value representing no option was chosen.
The Z3py code I tried is here:
T = 6 #Total time slots
s = Solver()
pick = [[Bool('t%d_ch%d' %(j, i)) for i in range(4)] for j in range(T)]
# Rewards of each option
Rewards = [5, 2, 1, -3]
# Select at most one of the 4 options as True
for i in range(T):
s.add(Or(Not(Or(pick[i][0], pick[i][1], pick[i][2], pick[i][3])),
And(Xor(pick[i][0],pick[i][1]), Not(Or(pick[i][2], pick[i][3]))),
And(Xor(pick[i][2],pick[i][3]), Not(Or(pick[i][0], pick[i][1])))))
# If option 3 is picked, then none of the 4 options should be selected for the future time slots
# else, exactly one should be selected.
for i in range(len(pick)-1):
for j in range(4):
s.add(If(And(j==3,pick[i][j]),
Not(Or(pick[i+1][0], pick[i+1][1], pick[i+1][2], pick[i+1][3])),
Or(And(Xor(pick[i+1][0],pick[i+1][1]), Not(Or(pick[i+1][2], pick[i+1][3]))),
And(Xor(pick[i+1][2],pick[i+1][3]), Not(Or(pick[i+1][0], pick[i+1][1]))))))
if s.check()==False:
print("unsat")
m=s.model()
print(m)
With this implementation, I am not getting solutions such as [2, 1, 1, 3, 99]. All of them either do not have option 3 or have it in the last time slot.
I know there is an error inside the If part but I'm unable to figure it out. Is there a better way to achieve such solutions?
It's hard to decipher what you're trying to do. From a basic reading of your description, I think this might be an instance of the XY problem. See https://xyproblem.info/ for details on that, and try to cast your question in terms of what your original goal is; instead of a particular solution, you're trying to implement. (It seems to me that the solution you came up with is unnecessarily complicated.)
Having said that, you can solve your problem as stated if you get rid of the 99 requirement and simply indicate -3 as the terminator. Once you pick -3, then all the following picks should be -3. This can be coded as follows:
from z3 import *
T = 6
s = Solver()
Rewards = [5, 2, 1, -3]
picks = [Int('pick_%d' % i) for i in range(T)]
def pickReward(p):
return Or([p == r for r in Rewards])
for i in range(T):
if i == 0:
s.add(pickReward(picks[i]))
else:
s.add(If(picks[i-1] == -3, picks[i] == -3, pickReward(picks[i])))
while s.check() == sat:
m = s.model()
picked = []
for i in picks:
picked += [m[i]]
print(picked)
s.add(Or([p != v for p, v in zip(picks, picked)]))
When run, this prints:
[5, -3, -3, -3, -3, -3]
[1, 5, 5, 5, 5, 1]
[1, 2, 5, 5, 5, 1]
[2, 2, 5, 5, 5, 1]
[2, 5, 5, 5, 5, 1]
[2, 1, 5, 5, 5, 1]
[1, 1, 5, 5, 5, 1]
[2, 1, 5, 5, 5, 2]
[2, 5, 5, 5, 5, 2]
[2, 5, 5, 5, 5, 5]
[2, 5, 5, 5, 5, -3]
[2, 1, 5, 5, 5, 5]
...
I interrupted the above as it keeps enumerating all the possible picks. There are a total of 1093 of them in this particular case.
(You can get different answers depending on your version of z3.)
Hope this gets you started. Stating what your original goal is directly is usually much more helpful, should you have further questions.

Catboost hyperparams search

I want to use default hyperparams in randomized search, how can I do it? (per_float_feature_quantization param here)
grid = {'learning_rate': [0.1, 0.16, 0.2],
'depth': [4, 6, 10],
'l2_leaf_reg': [1, 3, 5, 7, 9],
'iterations': [800, 1000, 1500, 2000],
'bagging_temperature': [1, 2, 3, 4, 5],
'border_count': [128, 256, 512],
'grow_policy': ['SymmetricTree', 'Depthwise'],
'per_float_feature_quantization':[None, '3:border_count=1024']}
model = CatBoostClassifier(loss_function='MultiClass',
custom_metric='Accuracy',
eval_metric='TotalF1',
od_type='Iter',
od_wait=40,
task_type="GPU",
devices='0:1',
random_seed=42,
cat_features=cat_features)
randomized_search_result = model.randomized_search(grid,
X=X,
y=y
)
And I've got
CatBoostError: library/cpp/json/writer/json_value.cpp:499: Not a map
There is an error in one or more of the parameters of your grid. Commenting them out one-by-one should help you identify it.
As a side note, Optuna recently released support for CatBoost, in case you want to try that instead of a grid search. Optuna’s documentation of a CatBoost example.

Dask equivalent of numpy (convolve + hstack)?

I currently have a function that computes a sliding sum across a 1-D numpy array (vector) using convolve and hstack. I would like to create an equivalent function using dask, but the various ways I've tried so far have not worked out.
What I'm trying to do is to compute a "sliding sum" of n numbers of an array, unless any of the numbers are NaN in which case the sum should also be NaN. The (n - 1) elements of the result should also be NaN, since no wrap around/reach behind is assumed.
For example:
input vector: [3, 4, 6, 2, 1, 3, 5, np.NaN, 8, 5, 6]
n: 3
result: [NaN, NaN, 13, 12, 9, 6, 9, NaN, NaN, NaN, 19]
or
input vector: [1, 5, 7, 2, 3, 4, 9, 6, 3, 8]
n: 4
result: [NaN, NaN, NaN, 15, 17, 16, 18, 22, 22, 26]
The function I currently have for this using numpy functions:
def sum_to_scale(values, scale):
# don't bother if the number of values to sum is 1 (will result in duplicate array)
if scale == 1:
return values
# get the valid sliding summations with 1D convolution
sliding_sums = np.convolve(values, np.ones(scale), mode="valid")
# pad the first (n - 1) elements of the array with NaN values
return np.hstack(([np.NaN] * (scale - 1), sliding_sums))
How can I do the above using the dask array API (and/or dask_image.ndfilters) to achieve the same functionality?

Dividing elements of a ruby array into an exact number of (nearly) equal-sized sub-arrays [duplicate]

This question already has answers here:
How to chunk an array in Ruby
(2 answers)
Closed 4 years ago.
I need a way to split an array in to an exact number of smaller arrays of roughly-equal size. Anyone have any method of doing this?
For instance
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
groups = a.method_i_need(3)
groups.inspect
=> [[1,2,3,4,5], [6,7,8,9], [10,11,12,13]]
Note that this is an entirely separate problem from dividing an array into chunks, because a.each_slice(3).to_a would produce 5 groups (not 3, like we desire) and the final group may be a completely different size than the others:
[[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13]] # this is NOT desired here.
In this problem, the desired number of chunks is specified in advance, and the sizes of each chunk will differ by 1 at most.
You're looking for Enumerable#each_slice
a = [0, 1, 2, 3, 4, 5, 6, 7]
a.each_slice(3) # => #<Enumerator: [0, 1, 2, 3, 4, 5, 6, 7]:each_slice(3)>
a.each_slice(3).to_a # => [[0, 1, 2], [3, 4, 5], [6, 7]]
Perhaps I'm misreading the question since the other answer is already accepted, but it sounded like you wanted to split the array in to 3 equal groups, regardless of the size of each group, rather than split it into N groups of 3 as the previous answers do. If that's what you're looking for, Rails (ActiveSupport) also has a method called in_groups:
a = [0,1,2,3,4,5,6]
a.in_groups(2) # => [[0,1,2,3],[4,5,6,nil]]
a.in_groups(3, false) # => [[0,1,2],[3,4], [5,6]]
I don't think there is a ruby equivalent, however, you can get roughly the same results by adding this simple method:
class Array; def in_groups(num_groups)
return [] if num_groups == 0
slice_size = (self.size/Float(num_groups)).ceil
groups = self.each_slice(slice_size).to_a
end; end
a.in_groups(3) # => [[0,1,2], [3,4,5], [6]]
The only difference (as you can see) is that this won't spread the "empty space" across all the groups; every group but the last is equal in size, and the last group always holds the remainder plus all the "empty space".
Update:
As #rimsky astutely pointed out, the above method will not always result in the correct number of groups (sometimes it will create multiple "empty groups" at the end, and leave them out). Here's an updated version, pared down from ActiveSupport's definition which spreads the extras out to fill the requested number of groups.
def in_groups(number)
group_size = size / number
leftovers = size % number
groups = []
start = 0
number.times do |index|
length = group_size + (leftovers > 0 && leftovers > index ? 1 : 0)
groups << slice(start, length)
start += length
end
groups
end
Try
a.in_groups_of(3,false)
It will do your job
As mltsy wrote, in_groups(n, false) should do the job.
I just wanted to add a small trick to get the right balance
my_array.in_group(my_array.size.quo(max_size).ceil, false).
Here is an example to illustrate that trick:
a = (0..8).to_a
a.in_groups(4, false) => [[0, 1, 2], [3, 4], [5, 6], [7, 8]]
a.in_groups(a.size.quo(4).ceil, false) => [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
This needs some better cleverness to smear out the extra pieces, but it's a reasonable start.
def i_need(bits, r)
c = r.count
(1..bits - 1).map { |i| r.shift((c + i) * 1.0 / bits ) } + [r]
end
> i_need(2, [1, 3, 5, 7, 2, 4, 6, 8])
=> [[1, 3, 5, 7], [2, 4, 6, 8]]
> i_need(3, [1, 3, 5, 7, 2, 4, 6, 8])
=> [[1, 3, 5], [7, 2, 4], [6, 8]]
> i_need(5, [1, 3, 5, 7, 2, 4, 6, 8])
=> [[1, 3], [5, 7], [2, 4], [6], [8]]

Calculate differences between array elements

Given a sorted array of n integers, like the following:
ary = [3, 5, 6, 9, 14]
I need to calculate the difference between each element and the next element in the array. Using the example above, I would end up with:
[2, 1, 3, 5]
The beginning array may have 0, 1 or many elements in it, and the numbers I'll be handling will be much larger (I'll be using epoch timestamps). I've tried the following:
times = #messages.map{|m| m.created_at.to_i}
left = times[1..times.length-1]
right = times[0..times.length-2]
differences = left.zip(right).map { |x| x[0]-x[1]}
But my solution above is both not optimal, and not ideal. Can anyone give me a hand?
>> ary = [3, 5, 6, 9, 14] #=> [3, 5, 6, 9, 14]
>> ary.each_cons(2).map { |a,b| b-a } #=> [2, 1, 3, 5]
Edit:
Replaced inject with map.
Similar but more concise:
[3, 5, 6, 9, 14].each_cons(2).collect { |a,b| b-a }
An alternative:
a.map.with_index{ |v,i| (a[i+1] || 0) - v }[0..-2]
Does not work in Ruby 1.8 where map requires a block instead of returning an Enumerator.

Resources