Memory_size and memory_counter in DeepQNetwork - machine-learning

what are the memory_size and the memory_counter in the DeepQNetwork:
class DeepQNetwork:
def __init__(
self,
n_actions,
n_features,
learning_rate=0.01,
reward_decay=0.9,
e_greedy=0.9,
replace_target_iter=300,
memory_size=500,
batch_size=32,
e_greedy_increment=None,
output_graph=True,
memory_counter=48
):

memory_size is stored memory of all experiences and memory_counter is a random small batch of memory that is used to learn. Ps: look at the code line 144

Related

TFF : every client do a pretrain function instead of build_federated_averaging_process

I would like that every client train his model with a function pretrainthat I wrote below :
def pretrain(model):
resnet_output = model.output
layer1 = tf.keras.layers.GlobalAveragePooling2D()(resnet_output)
layer2 = tf.keras.layers.Dense(units=zdim*2, activation='relu')(layer1)
model_output = tf.keras.layers.Dense(units=zdim)(layer2)
model = tf.keras.Model(model.input, model_output)
iterations_per_epoch = determine_iterations_per_epoch()
total_iterations = iterations_per_epoch*num_epochs
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.9)
checkpoint = tf.train.Checkpoint(step=tf.Variable(1), optimizer=optimizer, net=model)
manager = tf.train.CheckpointManager(checkpoint, pretrain_save_path, max_to_keep=10)
current_epoch = tf.cast(tf.floor(optimizer.iterations/iterations_per_epoch), tf.int64)
batch = client_data(0)
batch = client_data(0).batch(2)
epoch_loss = []
for (image1, image2) in batch:
loss, gradients = train_step(model, image1, image2)
epoch_loss.append(loss)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# if tf.reduce_all(tf.equal(epoch, current_epoch+1)):
print("Loss after epoch {}: {}".format(current_epoch, sum(epoch_loss)/len(epoch_loss)))
#print("Learning rate: {}".format(learning_rate(optimizer.iterations)))
epoch_loss = []
current_epoch += 1
if current_epoch % 50 == 0:
save_path = manager.save()
print("Saved model for epoch {}: {}".format(current_epoch, save_path))
save_path = manager.save()
model.save("model.h5")
model.save_weights("saved_weights.h5")
But as we know that TFF has a predefined function :
iterative_process = tff.learning.build_federated_averaging_process(...)
So please, how can I proceed ? Thanks
There are a few ways that one could proceed along similar lines.
First it is important to note that TFF is functional--one can use things like writing to / reading from files to manage state (as TF allows this), but it is not in the interface TFF exposes to users--while something involving writing to / reading from a file (IE, manipulating state without passing it through function parameters and results), this should at best be considered an implementation detail. It's something that TFF does not encourage.
By slightly refactoring your code above, however, I think this kind of application can fit quite nicely in TFF's programming model. We will want to define something like:
#tff.tf_computation
#tf.function
def pretrain_client_model(model, client_dataset):
# perhaps do dataset processing you want...
for batch in client_dataset:
# do model training
return model.weights() # or some tensor structure representing the trained model weights
Once your implementation looks something like this, you will be able to wire it in to a custom iterative process. The canned function you mention (build_federated_averaging_process) really just constructs an instance of tff.templates.IterativeProcess; you are always, however, free to write your own instance of this class.
Several tutorials take us through this process, this probably being the simplest. For a finished code example of a standalone iterative process implementation, see simple_fedavg.py.

Can I get extra information to a custom scorer function in sklearn?

I am performing a classification task which is essentially doing algorithm configuration, i.e. trying to pick a configuration (or 'mode') which is likely to make the problem-solving algorithm finish in the quickest time.
I am learning to classify the "best" configuration based on features of problem instances. I see that scikit-learn enables you to create your own scoring function to use in tuning the models. However the score_func only takes the true label and the predicted label as input.
Is it possible to identify which row in the dataset a prediction came from (when passing to this custom scorer)? That way I could figure out the performance hit of a predicted ("wrong") config and score the model accordingly. Basically sometimes a "wrong" selection can still be very good and close to the best, but a naive classification has no way of knowing this when the classification labels are purely based on the best config.
Here's a contrived example to illustrate what I'm trying to do
import random as rnd
import pandas as pd
rnd.seed('hello')
probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)
bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].\
rename(columns={'config':'best_config'})
feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
print(dataset)
df_alltimes:
problem config time
0 instance_0 analytic 15.307044
1 instance_0 bruteforce 36.742846
2 instance_0 hybrid 35.053416
3 instance_1 analytic 57.781358
4 instance_1 bruteforce 31.723275
5 instance_1 hybrid 8.080238
6 instance_2 analytic 4.211297
7 instance_2 bruteforce 24.034830
8 instance_2 hybrid 39.073023
9 instance_3 analytic 36.325485
10 instance_3 bruteforce 14.717841
11 instance_3 hybrid 57.103908
12 instance_4 analytic 7.358539
13 instance_4 bruteforce 10.805536
14 instance_4 hybrid 2.605044
15 instance_5 analytic 0.489870
16 instance_5 bruteforce 42.888858
17 instance_5 hybrid 58.634073
dataset:
best_config feature_0 feature_1 feature_2 feature_3 feature_4
0 analytic 0.645388 0.641626 0.975619 0.680713 0.209235
5 hybrid 0.993443 0.221038 0.893763 0.408532 0.254791
6 analytic 0.263872 0.142887 0.264538 0.166985 0.800054
10 bruteforce 0.155023 0.601300 0.258767 0.614732 0.850529
14 hybrid 0.766183 0.993692 0.597047 0.401482 0.275133
15 analytic 0.386327 0.065699 0.349115 0.370136 0.357329
I am using sklearn with the dataset where the X would be the feature columns and the y would be the best_config column. In this example, the "bad" choices for instance_0 are both almost equally bad, but for instance_1, the two wrong choices are not equally bad. So I'd like my custom scorer to be able to reflect this somehow. Is that possible?
In the end I did find a way to get the information I was after in the original question. If you're passing a pandas.Series as your target labels, the index attribute is available, so you can look up whatever you want in the full dataset.
In the solution below, the first part is pretty much the same as the original minimal working example - i.e. generating a fake dataset.
In the second part, a custom scorer function is defined, which is then passed to the cross-validating hyperparameter tuner, RandomizedSearchCV. Please bear in mind the data is garbage, so the "results" are meaningless; this is just a demo of how to refer back to a fuller set of results so that you can evaluate the quality of predictions made during hyperparameter tuning based on more specialised information rather than just "match / fail" when doing a classification.
import numpy as np
import pandas as pd
import random as rnd
INSTANCES = 200
FEATURES = 5
HP_ITER = 10
SEED = 1984
# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))
# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]\
.rename(columns={'config':'target'})\
.reset_index(drop=True)
# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
dataset[f'feature_{i}'] = feats[i]
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]
def _vb_loss(xvals, yvals, validation=False):
"""A custom scorer for cross-validation which uses distance to Virtual Best"""
# use the .index attribute to access the relevant rows in the
# timing data frame
source = df_tst if validation else df_trn
data = source.loc[xvals.index].reindex(columns=['problem','target'])
data['truevals'] = xvals
data['predvals'] = yvals
# what's the best time available for each problem?
data = data.merge(
df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
).rename(columns={'time' : 'best_time'}).drop(columns=['config'])
# what's the time for our predicted choices?
data = data.merge(
df_times, left_on=['problem','predvals'], right_on=['problem','config']
).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])
# how far away were the predictions in total?
residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
return residual_seconds
def fitAndPredict(use_custom_scorer=False):
"""Fit a model and make some predictions """
our_scorer = make_scorer(_vb_loss, greater_is_better=False)
hyperparameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : list(range(50,250)),
'max_depth' : list(range(2,32))
}
model = RandomizedSearchCV(
RandomForestClassifier(random_state=SEED),
hyperparameters,
n_iter = HP_ITER,
scoring = our_scorer if use_custom_scorer else None,
verbose = 1,
random_state = SEED,
)
model.fit(
df_trn.drop(columns=['target','problem']),
df_trn['target']
)
preds = model.predict(df_tst.drop(columns=['target','problem']))
return _vb_loss(df_tst['target'], preds, validation=True)
print("Timings for all configs:", df_times, "", sep="\n")
print("Labelled dataset:", dataset, "", sep="\n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))
Here's the output:
** Timings for all configs **
problem config time
0 p_000 analytic 21.811701
1 p_000 bruteforce 29.652341
2 p_000 hybrid 20.376605
3 p_001 analytic 12.989269
4 p_001 bruteforce 51.759137
.. ... ... ...
595 p_198 bruteforce 10.874092
596 p_198 hybrid 14.723661
597 p_199 analytic 24.984775
598 p_199 bruteforce 4.899111
599 p_199 hybrid 36.188729
[600 rows x 3 columns]
** Labelled dataset **
target problem feature_0 feature_1 feature_2 feature_3 feature_4
0 hybrid p_000 0.864952 0.487293 0.946654 0.863503 0.310866
1 analytic p_001 0.514093 0.007643 0.948784 0.582419 0.258159
2 bruteforce p_002 0.319059 0.872320 0.321495 0.807644 0.158471
3 analytic p_003 0.421063 0.955742 0.114808 0.980013 0.900057
4 hybrid p_004 0.325935 0.125824 0.697967 0.037196 0.923626
.. ... ... ... ... ... ... ...
195 hybrid p_195 0.179126 0.578338 0.391535 0.632501 0.442677
196 bruteforce p_196 0.827637 0.641567 0.710201 0.833341 0.215357
197 hybrid p_197 0.116661 0.480170 0.253893 0.623913 0.465419
198 bruteforce p_198 0.670555 0.037084 0.954332 0.408546 0.935973
199 bruteforce p_199 0.371541 0.463060 0.549176 0.581093 0.391114
[200 rows x 7 columns]
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done 50 out of 50 | elapsed: 9.1s finished
Test loss with custom CV scorer : 522.3236277796698

Issue with multilabel classification

I followed this tutorial: https://medium.com/#vijayabhaskar96/multi-label-image-classification-tutorial-with-keras-imagedatagenerator-cd541f8eaf24
and wrote some of my code for multilabel classification. I had it working with one-hot encoding on a small scale but I had to move to option 2 mentioned in the article because I have 6000 classes and therefore one hot was not viable. I managed to train the network and it said 99% accuracy and 83% f1 score. However, when I'm trying to test the network, for every image it's outputting some combination of only 3 labels when there are 6000 possible labels. I wondered if maybe the code to test the model was incorrect. I tried using the code mentioned in the post and it doesn't work:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=STEP_SIZE_TEST, verbose=1);
pred_bool = (pred > 0.5)
unorderable types: list() > float()
I've tried hard to fix this and not figured it out and I can't find any examples online of anyone doing something similar. Does anyone have an idea of how to get this prediction part working using this code block (I had it with another 2 options and was getting that issue printing one or several labels) or why the model might be failing in training with this behavior?
EDIT: for more context on the training issue, here is all the training code:
import json
input_file = open ('class_names_6000.json')
json_array = json.load(input_file)
#print(str(json_array))
args = parser.parse_args()
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Loading Data...')
df = pd.read_csv('dataset_train.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
datagen=ImageDataGenerator(rescale=1./255.)
test_datagen=ImageDataGenerator(rescale=1./255.)
train_generator=datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_test.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
test_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_validation.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
valid_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
print('Data Loaded.')
f1_score_callback = ComputeF1()
model = build_model('train', numclasses=len(json_array), model_name = args.model)
ImageFile.LOAD_TRUNCATED_IMAGES = True
Also, an important detail, when training, it says the accuracy is 99% and the f1 score is 84% with an validation f1 score at 84% as well.

Sklearn NotFittedError for CountVectorizer in pipeline

I am trying to learn how to work with text data through sklearn and am running into an issue that I cannot solve.
The tutorial I'm following is: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
The input is a pandas df with two columns. One with text, one with a binary class.
Code:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)
# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)
# MNB
mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
predicted = text_clf.predict(x_test_modified)
When I try to run the last line:
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
304 for name, transform in self.steps[:-1]:
305 if transform is not None:
--> 306 Xt = transform.transform(Xt)
307 return self.steps[-1][-1].predict(Xt)
308
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Any suggestions on how to fix this error? I am properly transforming the CV model on the test data. I even checked if the vocabulary list was empty and it isn't (count_vect.vocabulary_)
Thank you!
There are several issues with your question.
For starters, you don't actually fit the pipeline, hence the error. Looking more closely in the linked tutorial, you'll see that there is a step text_clf.fit (where text_clf is indeed the pipeline).
Second, you don't use the notion of the pipeline correctly, which is exactly to fit end-to-end the whole stuff; instead, you fit the individual components of it one by one... If you check again the tutorial, you'll see that the code for the pipeline fit:
text_clf.fit(twenty_train.data, twenty_train.target)
uses the data in their initial form, not their intermediate transformations, as you do; the point of the tutorial is to demonstrate how the individual transformations can be wrapped-up in (and replaced by) a pipeline, not to use the pipeline on top of these transformations...
Third, you should avoid naming variables as fit - this is a reserved keyword; and similarly, we don't use CV to abbreviate Count Vectorizer (in ML lingo, CV stands for cross validation).
That said, here is the correct way for using your pipeline:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(x_train, y_train)
predicted = text_clf.predict(x_test)
As you can see, the purpose of the pipelines is to make things simpler (compared to using the components one by one sequentially), not to complicate them further...

How to create feature columns for TensorFlow classifier

I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1
You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.

Resources