Mone Carlo Tree Search and terminal Nodes handling - machine-learning

I'm trying to implement AlphaZero on a new game using this repository. I'm not sure if they are handling the MCTS search tree correctly.
The logic of their MCTS implementation is as follows:
Get a "canonical form" of the current game state. Basically, switching player colors because the Neural Net always needs the input from the perspective of player with ID = 1. So if the current player is 1, nothing changes. If the current player is -1 the board is inverted.
Call MCTS search. Source code
In the expand-step of the algorithm, a new node is generated like this:
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
next_s = self.game.getCanonicalForm(next_s, next_player)
"1" is the current player and "a" is the selected action. Since the input current player is always 1, next_player is always -1 and the board always gets inverted.
The problem occurs once we hit a terminal state:
Assume that action a ends the game
A next state (next_s) is returned by the "getNextState" method, next_player is set to -1. The board gets inverted one last time (1 becomes -1, -1 becomes 1). We now view the board from the perspective of the loser player. That means that a call to getGameEnded(canonicalBoard, 1) will always return -1 (or 0.0001 if it's a draw). Which means we can never observe a win for the player with ID 1.
The getGameEnded function is implemented from the perspective of player with ID = 1. So it returns +1 if player with ID 1 wins, -1 if player with ID 1 loses.
My current understanding about MCTS is that we need to observe all possible game ending states of a two player zero-sum game. I tried to use the framework on my game, and it didn't learn or get better. I changed the game logic to explicitly keep track of the current player id so that I can return all three possible outcomes. Now, at least it seems to learn a bit, but I still think that there is something wrong.
Questions:
Could this implementation theoretically work? Is it a correct implementation of the MCTS algorithm?
Does MCTS need to observe all possible outcomes of a two player zero-sum game?
Are there any obvious quick fixes of the code? Am I missing something about the implementation?

Conceptually the implementation in the linked repo is correct. The evaluation of the state is not checked until we recurse 1 more time to the perspective of the losing player, but as soon as that backs up 1 level the last player to move it is viewed as a win for the last player to perform an action and that will back up the tree all the way swapping back and forth to the current state of the real game which will return the correct value.
This does represent all possible outcomes. The outcomes are player 1 wins, player 2 wins or draw. In the case of a draw, it just returns something close to zero. In the case that player 1 wins, player 1 made the last move and then we recurse to player 2 who is in a losing evaluation. In the case that player 2 wins, player 2 would have made the last move and then we recurse once more to where player 1's evaluation is a loss.
It should be noted that it is possible for there to be a game where you move last and lose and in that case it is still correct!
If you wrote your own game rules and are trying to get this to work for your game, it's best to make sure your implementation adheres to the assumptions made by this implementation, i.e. the evaluation is always from the position of the active player and that your evaluation function is actually zero sum.

Related

How do I use pathfinder to make the NPC chase a player

I already tried looking up multiple solutions in the Roblox developer forums, but I only found some unfinished code or some code that didn't work with mine. The code in this picture makes my rig find the part and chase it until it's there. That works perfectly, but I've been trying to find a way for the rig to chase the nearest player instead. I've heard of magnitude, but I'm not sure how to implement it, I cant even make the rig chase a player in the first place.
First off, .magnitude is the "length of a vector". It is mostly used to find the distance in vector units from pointA to pointB. Hence, the distance from the 2 points would be (pointA.Position-pointB.Position).magnitude.
https://developer.roblox.com/en-us/articles/Magnitude
Now in order to chase the players, you can loop through all of the players and get the one your NPC will chase.
You can use a for loop and loop through game.Players:GetPlayers() then get their character: <v>.Character and then their HumanoidRootPart which has their position. If you wanted to have a range for your NPC or an 'aggro distance', this is where you can implement magnitude. You would first create a variable for your distance. Since we will be dealing with vector units, it should be in length of vectors. For example local aggroDistance = 30. When added to the code, this would mean that it would only track a player if they are within 30 studs. You would then put an if statement saying if (<NPC.HumanoidRootPart.Position-<players hrp position>).magnitude < aggroDistance then. Now you could use Pathfinding Service to move the NPC to the player by using PathfindingService:ComputeAsync(<NPC HumanoidRootPart Position, <player HumanoidRootPart Position>) :ComputeAsync() makes a path from the starting position (the first parameter) to the end position (the second parameter). <NPC Humanoid>:MoveTo(<player HumanoidRootPart Position>) then makes the NPC move to the player. In order to listen out for when the NPC has reached the end of the path it made in :ComputeAsync() you can do <NPC Humanoid>:MoveToFinished:Connect(function()) to run a function after it reached the end, or <NPC Humanoid>:MoveToFinished:Wait() to wait before computing the next path.
Tip: You might also want to check if the player has more than 0 health (if they are alive) so the NPC only moves to players who are alive. You can do this by adding a and <player>.Humanoid.Health > 0 in the if statement where you had your aggroDistance.
Please let me know if you have any questions.
Code Makeup:
aggroDistance variable < optional
while loop
if statement (can contain aggroDistance variable if you have one) and check player health
:ComputeAsync()
:MoveTo()
:MoveToFinished
if statement end
while loop end

Solitiare card game - how to program resume game function?

I've been programming a solitaire card game and all has been going well so far, the basic engine works fine and I even programmed features like auto move on click and auto complete when won, unlimited undo/redo etc. But now I've realised the game cannot be fully resumed ie saved so as to continue from the exact position last time the game was open.
I'm wondering how an experienced programmer would approach this since it doesn't seem so simple like with other games where just saving various numbers, like the level number etc is sufficient for resuming the game.
The way it is now, all game objects are created on a new game, the cards, the slots for foundations, tableaus etc and then the cards are shuffled and dealt out. This is random but the way I see it, the game needs to remember this random deal to resume game and deal it again exactly the same when the game is resumed. Then all moves that were executed have to be executed as they were as well. So it looks like the game was as it was last time it was played, but in fact all moves have been executed from beginning again. Not sure if this is the best way to do it but am interested in other ways if there are any.
I'm wondering if any experienced programmers could tell me how they would approach this and perhaps give some tips/advice etc.
(I am going to assume this is standard, Klondike Solitaire)
I would recommend designing a save structure. Each card should have a suit and a value variable, so I would write out:
[DECK_UNTURNED]
H 1
H 10
S 7
C 2
...
[DECK_UNTURNED_END]
[DECK_TURNED]
...
[DECK_TURNED_END]
etc
I would do that for each location cards can be stacked (I believe you called them foundations), the unrevealed deck cards, the revealed deck cards, each of the seven main slots, and the four winning slots. Make sure however you read them in and out, they end up in the same order, of course.
When you go to read the file, a simple way is to read the entire file into a vector of strings. Then you iterate through the vector until you find one of your blocks.
if( vector[ iter ] == "[DECK_UNTURNED]" )
Now you go into another loop, using the same vector and iter, and keep reading in those cards until you reach the associated end block.
while( vector[ iter ] != "[DECK_UNTURNED_END]" )
read cards...
++iter
This is how I generally do all my save files. Create [DATA] blocks, and read in until you reach the end block. It is not very elaborate, but it works.
Your idea of replaying the game up to a point is good. Just save the undo info and redo it at load time.

Controlling the phase of signal in pure data

I'm in need of figure out a way of changing the phase of a signal. Objective is to generate two signals with one phase changed and observe the patters when combined.
below is the program I'm using so far:
As in the above setting, I need to use the same signal to generate a phase changed signal and later combine the two signals and observe patters.
Can someone help me out on this?
Thanks.
Using the right inlet of the [osc~] object is a valid way to set the phase of an oscillator but it isn't the only or even the most correct way. The right inlet only permits a float at the control level.
A more comprehensive manipulation of phase can be done at the signal level using the [phasor~], [cos~], [wrap~], and [+~] objects. Essentially, you are performing the same function as [osc~] with a technique called a table lookup using [phasor~] and [cos~]. You could read another table with [tabread4~] instead of [cos~] as well.
This technique keeps your oscillators in sync. You can manipulate the phase of your oscillators with other oscillators, table lookups, and still of course floats (so long as the phase value is between 0 and 1, hence the [wrap~] object).
phase modulation at the signal level
Afterwards, like the other examples here, you can add the signals together and write them to corresponding tables or output the signal chain or both.
Here's how you might do the same for a custom table lookup. Of course, you'd replace sometable with your custom table name and num-samp-in-some-table with the number of samples in your table.
signal level phase modulation with custom tables
Hope it helps!
To change the phase of an oscillator, use the right-hand side inlet.
Quoting Johannes Kreidler's Programming Electronic Music in Pd:
3.1.2.1.3 Phase
In Pd, you can also set membrane position for a sound wave where it should begin (or where it should jump to). This is called the phase of a wave. You can set the phase in Pd in the right inlet of the "osc~" object with numbers between 0 and 1:
A wave's entire period is encompassed by the range from 0 to 1. However, it is often spoken of in terms of degrees, where the entire period has 360 degrees. One speaks, for example, of a "90 degree phase shift". In Pd, the input for the phase would be 0.25.
So for instance, if you want to observe how two signals can become mute due to destructive interference, you can try something like this:
Note that I connected a bang to adjust simultaneously the phases of both signals. This is important, because while you can reset the phase of a signal to any value between 0.0 and 1.0 at any moment, the other oscillator won't be reset and therefore the results will be quite random (you never know at which phase value the other signal will be at!). So resetting both does the trick.

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.

What are the things that I should save to a file/db with Reinforcement Learning?

I'm trying to get into machine learning, and decided to try things out for myself. I wrote a small tic-tac-toe game. So far, the computer plays against itself using random moves.
Now, I want to apply reinforcement learning by writing an agent that will explore or exploit based on the knowledge it has on the current state of the board.
The part I don't understand is this:
What does the agent use to train itself for the current state? Lets say a RNG bot (o) player does this:
[..][..][..]
[..][x][o]
[..][..][..]
Now the agent has to decide what the best move should be. A well trained one would pick 1st, 3rd, 7th or 9th. Does it look up a similar state in the DB that led him to a win? Because if so, I think I will need to save every single move into the DB up to eventually it's end state (win/lose/draw state), and that would be quite a lot of data for a single play?
If I'm thinking this through wrong, I would like to know how to this correctly.
Learning
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves. Strictly the choice is often based on Boltzman’s distribution of V(s'), but can be simplified to maximum-value move (greedy) or, with some probability epsilon, a random move as you are using;
3) Record s' in a sequence;
4) If the game finishes, it updates the values of the visited states in the sequence and starts over again; otherwise, go to 1).
Game Playing
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves;
3) Until the game is over and it starts over again; otherwise, go to 1).
Regarding your question, yes the look-up table in Game Playing phase is built up in the Learning phase. Every time the state is chosen from the all the V(s) with a maximum possible number of 3^9=19683. Here is a sample code written by Python that runs 10000 games in training.

Resources