My Double DQN algorithm for 2048 game never learns - machine-learning

I am trying to make Double-DQN algorithm to learn play 2048 game. My implementation is available in GitHub if you want to check the code. (https://github.com/codetiger/MachineLearning-2048)
My code is not learning after a basic level. Its not able to achieve more than 256 tile. Some of my predictions are below.
I am using a random player to train the code. I guess RL algorithms learn this way. They try all possible moves and learn from failures. I wild guess is, since I am training it using random moves the code is learns very limited.
The maximum episodes I tried is 4000. How do I calculate the optimal number of episodes.
There is a problem with my code.
I am not able to identify the issue with my approach. Would like to get some view on this.
My Pseudocode is here.
for e in range(EPISODES):
gameEnv.Reset()
state = gameEnv.GetFlatGrid()
state = np.reshape(state, [1, state_size])
reward = 0.0
prevMaxNumber = 0
while True:
action = agent.get_action(state)
(moveScore, isValid) = gameEnv.Move(action + 1)
next_state = gameEnv.GetFlatGrid()
next_state = np.reshape(next_state, [1, state_size])
if isValid:
# Reward for step score
reward += moveScore
# Reward for New Max Number
if gameEnv.GetMaxNumber() > prevMaxNumber:
reward += 10.0
prevMaxNumber = gameEnv.GetMaxNumber()
gameEnv.AddNewNumber()
else:
reward = -50.0
done = gameEnv.CheckGameOver()
if done:
reward = -100.0
agent.append_sample(state, action, reward, next_state, done)
agent.train_model()
state = next_state
if done:
agent.update_target_model()

My two cents,
RL algorithms don't learn randomly. I suggest you take a look at 'Sutton and Barto (Second Edition)' for a detailed description of the wide variety of algorithms. Having said that I don't think the git code that you linked does what you expect (Why do you have an ES module? Are you training the network using evolutionary algorithms?). You might want to start off with simpler and stable implementations like this https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html.
2048 is probably a difficult game for a simple Q-network to learn because it requires long-term planning. It's much easier for DQN to learn to play control/instant action games like Pong or Breakout but doesn't do well on games that need some amount of planning (Pacman, for example).

Related

How to implement randomised log space search of learning rate in PyTorch?

I am looking to fine tune a GNN and my supervisor suggested exploring different learning rates. I came across this tutorial video where he mentions that a randomised log space search of hyper parameters is typically done in practice. For sake of the introductory tutorial this was not covered.
Any help or pointers on how to achieve this in PyTorch is greatly appreciated. Thank you!
Setting the scale in logarithm terms let you take into account more desirable values of the learning rate, usually values lower than 0.1
Imagine you want to take learning rate values between 0.1 (1e-1) and 0.001 (1e-4). Then you can set this lower and upper bound on a logarithm scale by applying a logarithm base 10 on it, log10(0.1) = -1 and log10(0.001) = -4. Andrew Ng provides a clearer explanation in this video.
In Python you can use np.random.uniform() for this
searchable_learning_rates = [10**(-4 * np.random.uniform(0.5, 1)) for _ in range(10)]
searchable_learning_rates
>>>
[0.004890650359810075,
0.007894672127828331,
0.008698831627963768,
0.00022779163472045743,
0.0012046829055603172,
0.00071395500159473,
0.005690032483124896,
0.000343368839731761,
0.0002819402550629178,
0.0006399571804618883]
as you can see you're able to try learning rate values from 0.0002819402550629178 up to 0.008698831627963768 which is close to the upper bound. The longer the array the more values you will try.
Following the example code in the video you provided you can implement the randomized log search for the learning rate by replacing learning_rates for searchable learning_rates
for batch_size in batch_sizes:
for learning_rate in searchable_learning_rates:
...
...

Best strategy to reduce false positives: Google's new Object Detection API on Satellite Imagery

I'm setting up the new Tensorflow Object Detection API to find small objects in large areas of satellite imagery. It works quite well - it finds all 10 objects I want, but I also get 50-100 false positives [things that look a little like the target object, but aren't].
I'm using the sample config from the 'pets' tutorial, to fine-tune the faster_rcnn_resnet101_coco model they offer. I've started small, with only 100 training examples of my objects (just 1 class). 50 examples in my validation set. Each example is a 200x200 pixel image with a labeled object (~40x40) in the center. I train until my precision & loss curves plateau.
I'm relatively new to using deep learning for object detection. What is the best strategy to increase my precision? e.g. Hard-negative mining? Increase my training dataset size? I've yet to try the most accurate model they offer faster_rcnn_inception_resnet_v2_atrous_coco as i'd like to maintain some speed, but will do so if needed.
Hard-negative mining seems to be a logical step. If you agree, how do I implement it w.r.t setting up the tfrecord file for my training dataset? Let's say I make 200x200 images for each of the 50-100 false positives:
Do I create 'annotation' xml files for each, with no 'object' element?
...or do I label these hard negatives as a second class?
If I then have 100 negatives to 100 positives in my training set - is that a healthy ratio? How many negatives can I include?
I've revisited this topic recently in my work and thought I'd update with my current learnings for any who visit in the future.
The topic appeared on Tensorflow's Models repo issue tracker. SSD allows you to set the ratio of how many negative:postive examples to mine (max_negatives_per_positive: 3), but you can also set a minimum number for images with no postives (min_negatives_per_image: 3). Both of these are defined in the model-ssd-loss config section.
That said, I don't see the same option in Faster-RCNN's model configuration. It's mentioned in the issue that models/research/object_detection/core/balanced_positive_negative_sampler.py contains the code used for Faster-RCNN.
One other option discussed in the issue is creating a second class specifically for lookalikes. During training, the model will attempt to learn class differences which should help serve your purpose.
Lastly, I came across this article on Filter Amplifier Networks (FAN) that may be informative for your work on aerial imagery.
===================================================================
The following paper describes hard negative mining for the same purpose you describe:
Training Region-based Object Detectors with Online Hard Example Mining
In section 3.1 they describe using a foreground and background class:
Background RoIs. A region is labeled background (bg) if its maximum
IoU with ground truth is in the interval [bg lo, 0.5). A lower
threshold of bg lo = 0.1 is used by both FRCN and SPPnet, and is
hypothesized in [14] to crudely approximate hard negative mining; the
assumption is that regions with some overlap with the ground truth are
more likely to be the confusing or hard ones. We show in Section 5.4
that although this heuristic helps convergence and detection accuracy,
it is suboptimal because it ignores some infrequent, but important,
difficult background regions. Our method removes the bg lo threshold.
In fact this paper is referenced and its ideas are used in Tensorflow's object detection losses.py code for hard mining:
class HardExampleMiner(object):
"""Hard example mining for regions in a list of images.
Implements hard example mining to select a subset of regions to be
back-propagated. For each image, selects the regions with highest losses,
subject to the condition that a newly selected region cannot have
an IOU > iou_threshold with any of the previously selected regions.
This can be achieved by re-using a greedy non-maximum suppression algorithm.
A constraint on the number of negatives mined per positive region can also be
enforced.
Reference papers: "Training Region-based Object Detectors with Online
Hard Example Mining" (CVPR 2016) by Srivastava et al., and
"SSD: Single Shot MultiBox Detector" (ECCV 2016) by Liu et al.
"""
Based on your model config file, the HardMinerObject is returned by losses_builder.py in this bit of code:
def build_hard_example_miner(config,
classification_weight,
localization_weight):
"""Builds hard example miner based on the config.
Args:
config: A losses_pb2.HardExampleMiner object.
classification_weight: Classification loss weight.
localization_weight: Localization loss weight.
Returns:
Hard example miner.
"""
loss_type = None
if config.loss_type == losses_pb2.HardExampleMiner.BOTH:
loss_type = 'both'
if config.loss_type == losses_pb2.HardExampleMiner.CLASSIFICATION:
loss_type = 'cls'
if config.loss_type == losses_pb2.HardExampleMiner.LOCALIZATION:
loss_type = 'loc'
max_negatives_per_positive = None
num_hard_examples = None
if config.max_negatives_per_positive > 0:
max_negatives_per_positive = config.max_negatives_per_positive
if config.num_hard_examples > 0:
num_hard_examples = config.num_hard_examples
hard_example_miner = losses.HardExampleMiner(
num_hard_examples=num_hard_examples,
iou_threshold=config.iou_threshold,
loss_type=loss_type,
cls_loss_weight=classification_weight,
loc_loss_weight=localization_weight,
max_negatives_per_positive=max_negatives_per_positive,
min_negatives_per_image=config.min_negatives_per_image)
return hard_example_miner
which is returned by model_builder.py and called by train.py. So basically, it seems to me that simply generating your true positive labels (with a tool like LabelImg or RectLabel) should be enough for the train algorithm to find hard negatives within the same images. The related question gives an excellent walkthrough.
In the event you want to feed in data that has no true positives (i.e. nothing should be classified in the image), just add the negative image to your tfrecord with no bounding boxes.
I think I was passing through the same or close scenario and it's worth it to share with you.
I managed to solve it by passing images without annotations to the trainer.
On my scenario I'm building a project to detect assembly failures from my client's products, at real time.
I successfully achieved very robust results (for production env) by using detection+classification for components that has explicity a negative pattern (e.g. a screw that has screw on/off(just the hole)) and only detection for things that doesn't has the negative pattens (e.g. a tape that can be placed anywhere).
On the system it's mandatory that the user record 2 videos, one containing the positive scenario and another containing the negative (or the n videos, containing n patterns of positive and negative so the algorithm can generalize).
After a while testing I found out that if I register to detected only tape the detector was giving very confident (0.999) false positive detections of tape. It was learning the pattern where the tape was inserted instead of the tape itself. When I had another component (like a screw on it's negative format) I was passing the negative pattern of tape without being explicitly aware of it, so the FPs didn't happen.
So I found out that, in this scenario, I had to necessarily pass the images without tape so it could differentiate between tape and no-tape.
I considered two alternatives to experiment and try to solve this behavior:
Train passing an considerable amount of images that doesn't has any annotation (10% of all my negative samples) along with all images that I have real annotations.
On the images that I don't have annotation I create a dummy annotation with a dummy label so I could force the detector to train with that image (thus learning the no-tape pattern). Later on, when get the dummy predictions, just ignore them.
Concluded that both alternatives worked perfectly on my scenario.
The training loss got a little messy but the predictions work with robustness for my very controlled scenario (the system's camera has its own box and illumination to decrease variables).
I had to make two little modifications for the first alternative to work:
All images that didn't had any annotation I passed a dummy annotation (class=None, xmin/ymin/xmax/ymax=-1)
When generating the tfrecord files I use this information (xmin == -1, in this case) to add an empty list for the sample:
def create_tf_example(group, path, label_map):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = Image.open(encoded_jpg_io)
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
if not pd.isnull(row.xmin):
if not row.xmin == -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(label_map[row['class']])
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
}))
return tf_example
Part of the traning progress:
Currently I'm using tensorflow object detection along with tensorflow==1.15, using faster_rcnn_resnet101_coco.config.
Hope it will solve someone's problem as I didn't found any solution on the internet. I read a lot of people telling that faster_rcnn is not adapted for negative training for FPs reduction but my tests proved the opposite.

Choose closest point to origin with reinforcement learning

I am attempting to use reinforcement learning to choose the closest point to the origin out of a given set of points repeatedly, until a complex (and irrelevant) end condition is reached. (This is a simplification of my main problem.)
A 2D array containing possible points is passed to the reinforcement learning algorithm, which makes a choice as to which point it thinks is the most ideal.
A [1, 10]
B [100, 0]
C [30, 30]
D [5, 7]
E [20, 50]
In this case, D would be the true best choice. (The algorithm should ideally output 3, from the range 0 to 4.)
However, whenever I train the algorithm, it seems to not learn what the "concept" is, but instead just that choosing, say, C is usually the best choice, so it should always choose that.
import numpy as np
import rl.core as krl
class FindOriginEnv(krl.Env):
def observe(self):
return np.array([
[np.random.randint(100), np.random.randint(100)] for _ in range(5)
])
def step(self, action):
observation = self.observe()
done = np.random.rand() < 0.01 # eventually
reward = 1 if done else 0
return observation, reward, done, {}
# ...
What should I modify about my algorithm such that it will actually learn about the goal it is trying to accomplish?
Observation shape?
Reward function?
Action choices?
Keras code would be appreciated, but is not required; a purely algorithmic explanation would also be extremely helpful.
Sketching out the MDP from your description, there are a few issues:
Your observation function appears to be returning 5 points, so that means a state can be any configuration of 10 integers in [0,99]. That's 100^10 possible states! Your state space needs to be much smaller. As written, observe appears to be generating possible actions, not state observations.
You suggest that you're are picking actions from [0,4], where each action is essentially an index into an array of points available to the agent. This definition of the action space doesn't give the agent enough information to discriminate what you say you'd like it to (smaller magnitude point is better), because you only act based on the point's index! If you wanted to tweak the formulation a bit to make this work, you would define an action to be selecting a 2D point with each dimension in [0,99]. This would mean you would have 100^2 total possible actions, but to maintain the multiple choice aspect, you would restrict the agent to selecting amongst a subset at a given step (5 possible actions) based on its current state.
Finally, the reward function that gives zero reward until termination means that you're allowing a large number of possible optimal policies. Essentially, any policy that terminates, regardless of how long the episode took, is optimal! If you want to encourage policies that terminate quickly, you should penalize the agent with a small negative reward at each step.

What's the best objective function for the CartPole task?

I'm doing policy gradient and I'm trying to figure out what the best objective function is for the task. The task is the open ai CartPole-v0 environment in which the agent receives a reward of 1 for each timestep it survives and a reward of 0 upon termination. I'm trying to figure out which is the best way to model the objective function. I've come up with 3 possible functions:
def total_reward_objective_function(self, episode_data) :
return sum([timestep_data['reward'] for timestep_data in timestep_data])
def average_reward_objective_function(self, episode_data):
return total_reward_objective_function(episode_data) / len(episode_data)
def sum_of_discounted_rewards_objective_function(self, episode_data, discount_rate=0.7)
return sum([episode_data[timestep]['reward'] * pow(discount_rate, timestep)
for timestep in enumerate(episode_data)])
Note that for the average reward objective function will always return 1 unless I intervene and modify the reward function to return a negative value upon termination. The reason I'm asking rather than just running a few experiments is because there's errors elsewhere. So if someone could point me towards a good practice in this area I could focus on the more significant mistakes in the algorithm.
You should use the last one (sum of discounted rewards), since the cart-pole problem is an infinite horizon MDP (you want to balance the pole as long as you can). The answer to this question explains why you should use a discount factor in infinite horizon MDPs.
The first one, instead, is just an undiscounted sum of the rewards, which could be used if episodes have a fixed length (for instance, in the case of a robot performing a 10 seconds trajectory). The second one is usually used in finite horizon MDPs, but I am not very familiar with it.
For the cart-pole, a discount factor of 0.9 should work (or, depending on the algorithm used, you can search for scientific papers and see the discount factor used).
A final note. The reward function you described (+1 at each timestep) is not the only one used in literature. A common one (and I think also the "original" one) gives 0 at each timestep and -1 if the pole falls. Other reward functions are related to the angle between the pole and the cart.

how can fixed parameters cost and gamma using libsvm matlab to improve accuracy?

I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.

Resources