Monte Carlo Tree Search - intuition behind child selection function for games of two players with opposite goals - machine-learning

Simple question on hello world example of the MCTS for tic-tac-toe,
Let's assume we are given a board and we want to make an optimal decision. As I undestand the choice of consecutive nodes while simulation (until leaf is met) is determined by a exploration/exploitation trade-off function (as described on wikipedia). I really wonder what is the intuition behind first component (exploitation) of the function here, especially for games between two players with oppposite goals. Then the meaning of "the most promising" changes depending on who makes a move. Shouldn't this function change depeding on who makes the next move (especially its first component)?

Yes, that exploitation part of the equation should be implemented to take into account the evaluations from the perspective of the agent/player who gets to select an action in that node.
For single-agent settings, the implementation is straightforward; simply always maximize.
For zero-sum, turn-based, two-player settings, you'd want to alternate between maximizing or minimizing that exploitation part of the equation (note: always maximize the exploration term!). This can also be implemented by simply multiplying that term by -1 in nodes where the opponent gets to move.
Other settings are possible too, but require slightly more implementation effort (e.g. keeping different average scores for different players in settings which are not zero-sum or have more than two players)

Related

Shortest path in games (StarCraft example)

In games like StarCraft you can have up to 200 units (for player) in a map.
There are small but also big maps.
When you for example grab 50 units and tell them to go to the other side of the map some algorithm kicks in and they find path through the obsticles (river, hills, rocks and other).
My question is do you know how the game doesnt slow down because you have 50 paths to calculate. In the meantime other things happens like drones collecting minerals buildinds are made and so on. And if the map is big it should be harder and slower.
So even if the algorithm is good it will take some time for 100 units.
Do you know how this works maybe the algorithm is similar to other games.
As i said when you tell units to move you did not see any delay for calculating the path - they start to run to the destination immediately.
The question is how they make the units go through the shortest path but fast.
There is no delay in most of the games (StarCraft, WarCraft and so on)
Thank you.
I guess it just needs to subdivide the problem and memoize the results. Example: 2 units. Unit1 goes from A to C but the shortest path goes through B. Unit2 goes from B to C.
B to C only needs to be calculated once and can be reused by both.
See https://en.m.wikipedia.org/wiki/Dynamic_programming
In this wikipedia page it specifically mentions dijkstra's algorithm for path finding that works by subdividing the problem and store results to be reused.
There is also a pretty good looking alternative here http://www.gamasutra.com/blogs/TylerGlaiel/20121007/178966/Some_experiments_in_pathfinding__AI.php where it takes into account dynamic stuff like obstacles and still performs very well (video demo: https://www.youtube.com/watch?v=z4W1zSOLr_g).
Another interesting technique, does a completely different approach:
Calculate the shortest path from the goal position to every point on the map: see the full explanation here: https://www.youtube.com/watch?v=Bspb9g9nTto - although this one is inefficient for large maps
First of all 100 units is not such a large number, pathfinding is fast enough on modern computers that it is not a big resource sink. Even on older games, optimizations are made to make it even faster, and you can see that unit will sometimes get lost or stuck, which shouldn't really happen with a general algorithm like A*.
If the map does not change map, you can preprocess it to build a set of nodes representing regions of the map. For example, if the map is two islands connected by a narrow bridge, there would be three "regions" - island 1, island 2, bridge. In reality you would probably do this with some graph algorithm, not manually. For instance:
Score every tile with distance to nearest impassable tile.
Put all adjacent tiles with score above the threshold in the same region.
When done, gradually expand outwards from all regions to encompass low-score tiles as well.
Make a new graph where each region-region intersection is a node, and calculate shortest paths between them.
Then your pathfinding algorithm becomes two stage:
Find which region the unit is in.
Find which region the target is in.
If different regions, calculate shortest path to target region first using the region graph from above.
Once in the same region, calculate path normally on the tile grid.
When moving between distant locations, this should be much faster because you are now searching through a handful of nodes (on the region graph) plus a relatively small number of tiles, instead of the hundreds of tiles that comprise those regions. For example, if we have 3 islands A, B, C with bridges 1 and 2 connecting A-B and B-C respectively, then units moving from A to C don't really need to search all of B every time, they only care about shortest way from bridge 1 to bridge 2. If you have a lot of islands this can really speed things up.
Of course the problem is that regions may change due to, for instance, buildings blocking a path or units temporarily obstructing a passageway. The solution to this is up to your imagination. You could try to carefully update the region graph every time the map is altered, if the map is rarely altered in your game. Or you could just let units naively trust the region graph until they bump into an obstacle. With some games you can see particularly bad cases of the latter because a unit will continue running towards a valley even after it's been walled off, and only after hitting the wall it will turn back and go around. I think the original Starcraft had this issue when units block a narrow path. They would try to take a really long detour instead of waiting for the crowd to free up a bridge.
There's also algorithms that accomplish analogous optimizations without explicitly building the region graph, for instance JPS works roughly this way.

Cluster Analysis for crowds of people

I have location data from a large number of users (hundreds of thousands). I store the current position and a few historical data points (minute data going back one hour).
How would I go about detecting crowds that gather around natural events like birthday parties etc.? Even smaller crowds (let's say starting from 5 people) should be detected.
The algorithm needs to work in almost real time (or at least once a minute) to detect crowds as they happen.
I have looked into many cluster analysis algorithms, but most of them seem like a bad choice. They either take too long (I have seen O(n^3) and O(2^n)) or need to know how many clusters there are beforehand.
Can someone help me? Thank you!
Let each user be it's own cluster. When she gets within distance R to another user form a new cluster and separate again when the person leaves. You have your event when:
Number of people is greater than N
They are in the same place for the timer greater than T
The party is not moving (might indicate a public transport)
It's not located in public service buildings (hospital, school etc.)
(good number of other conditions)
One minute is plenty of time to get it done even on hundreds of thousands of people. In naive implementation it would be O(n^2), but mind there is no point in comparing location of each individual, only those in close neighbourhood. In first approximation you can divide the "world" into sectors, which also makes it easy to make the task parallel - and in turn easily scale. More users? Just add a few more nodes and downscale.
One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd)ale at time of limited activity.
How to avoid "single-linkage problem" mentioned by author in comments? One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd). It is not a novel problem and I am sure there are papers that cover it (partially), e.g. Is There a Crowd? Experiences in Using Density-Based Clustering and Outlier Detection.
There is little use in doing a full clustering.
Just uses good database index.
Keep a database of the current positions.
Whenever you get a new coordinate, query the database with the desired radius, say 50 meters. A good index will do this in O(log n) for a small radius. If you get enough results, this may be an event, or someone joining an ongoing event.

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.

How to find a function that fits a given data set?

The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.
The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.

SARSA Implementation

I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.

Resources