Traverse all transitions in a state machine - code-coverage

I need some help in traversing all transitions in a state machine. (all-transitions is a coverage criteria)
I have a state machine and I want to cover all transitions with minimum number of paths from source to target/abort state.
ex:
A->B
B->B
B->C
B->D
so paths are: A->B->B->C, A->B->D

Just write some test code to invoke your state machine:
Two calls where you expect to end with state C.
Two calls where you expect to end with state D.
Your test code should verify the end state.

Related

Q-learning: How to include a terminal state in updating rule?

I use Q-learning in order to determine the optimal path of an agent. I know in advance that my path is composed of exactly 3 states (so after 3 states I reach a terminal state). I would like to know how to include that in the updating rule of the q-function.
What I am doing currently:
for t=1:Nb_Epochs-1
if rand(1)<Epsilon
an action 'a' is chosen at random
else
[Maxi a]=max(QIter(CurrentState,:,t));
end
NextState=FindNextState(CurrentState,a);
QIter(CurrentState,a,t+1)=(1-LearningRate)*QIter(CurrentState,a,t)+LearningRate*(Reward(NexState)+Discount*max(QIter(NextState,:,t)));
CurrentState=NextState;
end

Reinforcement Learning: How to deal with environments changing state due to external factors

I have a use case where the state of the environment could change due to random events in between time steps that the agent takes actions. For example, at t1, the agent takes action a1 and is given the reward and the new state s1. Before the agent takes the next action at t2, some random events occurred in the environment that altered the state. Now when the agent takes action at t2, it's now acting on "stale information" since the state of the environment had changed. Also, the new state s2 will represent changes not only due to the agent's action, but also due to the prior random events that occurred. In the worst case, the action could possibly have become invalid for the new state that was introduced due to these random events occurred within the environment.
How do we deal with this? Does this mean that this use case is not a good one to solve with RF? If we just ignore these changing states due to the random events in the environment, how would that affect the various learning algorithms? I presume that this is not a uncommon or unique problem in real-life use cases...

Describing a 'waiting' step on gherkin language

I'm trying to describe a scenario of my app on gherkin language so that I can use it as executable spec. The scenario is more less the following: There's a phase of a process in which a check is performed. If all conditions for the check are fulfilled then the process end. Otherwise, the process waits for any condition to change (it's notified about this) and the checks again, finishing if succesful. What I'm having trouble describin is this waiting part. My current version (simplified) is:
Given condition A
And not condition B
When the check is performed
Then the result is negative, pending condition B
What I'm trying to express with pending condition B is that the test will be repeated once condition B changes, but I don't particularly like this version, since it's hard to turn one to one to a test (the fact that condition B changes would be a new When).
Can anybody with more experience come up with a better formulation?
You can either link the two tests together, like this:
Scenario: When A and not B result is negative, but if B happens then result is positive
Given condition A
But not condition B
Then the check returns negative
But if condition B
Then the check returns positive
Which might not be best practice but is sometimes the pragmatic way of doing things, especially if the tests are slow running because of the system under test or your test environment etc.
Or you could make it into two scenarios with some repetition behind the scenes.
Scenario: When A and not B the result is negative
Given condition A
But not condition B
Then the check returns negative
Scenario: When A and B the result should be positive
Given the system has condition A but not B
And the check is returning negative
When condition B
Then the check returns positive
In your case I would say that which one to choose depends on how long your tests take to run. If they are slow then go for one big scenario. If they aren't, or it doesn't matter for some reason then go for the second suggestion. The second suggestion will give more information about the cause of the failure which is nice to have, but if the tests are slow then I think it would still be quite obvious why the the test was failing even if you are using one big scenario.

Q-Learning: Can you move backwards?

I'm looking over a sample exam and there is a question on Q-learning, I have included it below. In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2). It appears the Q value to go back up to A2 would be 0.18, and the Q value to go right would be 0.09. So why wouldn't the agent go back to A2 instead of going to B3?
Maze & Q-Table
Solution
Edit: Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
Edit2: Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Edit3: Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
You seem to assume that you should look at the values of the state in the next time step. This is incorrect. The Q-function answers the question:
If I'm in state x, which action should I take?
In non-deterministic environments you don't even know what the next state will be, so it would be impossible to determine which action to take in your interpretation.
The learning part of Q-learning indeed acts on two subsequent timesteps, but after they are already known, and they are used to update the values of Q-function. This has nothing to do with how these samples (state, action, reinforcement, next state) are collected. In this case, samples are collected by the agent interacting with the environment. And in Q-learning setting agents interact with the environment according to a policy, which is based on current values of Q-function here. Conceptually, a policy works in terms of answering the question I quoted above.
In steps 1 and 2, the Q-function is modified only for states 1,A and 2,A. In step 3 the agent is in state 3,A so that's the only part of Q-function that's relevant.
In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2).
In state 3,A the action that has the highest Q-value is "right" (0.2). All other actions have value 0.0.
Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
As I see it, there is no wall to the right from 2,C. Nevertheless, the Q-function is given and it's irrelevant in this task whether it's possible to reach such Q-function using Q-learning. And you can always start Q-learning from an arbitrary Q-function anyway.
In Q-learning your only knowledge is the Q-function, so you don't know anything about "walls" and other things - you act according to Q-function, and that's the whole beauty of this algorithm.
Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Again, you should look at the values for the state the agent is currently in, so for 1,B "right" is optimal - it has 0.1 and other actions are 0.0.
To answer the last question, even though it's irrelevant here: yes, if the agent is taking the greedy step and multiple actions seem optimal, it chooses one at random in most common policies.
Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
No. As I've stated above - the only guideline agent is using in pure Q-learning is the Q-function. It's not aware that it has been in a particular state before.

Remote nodes, group leaders and printouts

Given two Erlang nodes, "foo#host" and "bar#host", the following produces a print-out on "foo":
(foo#host) rpc:call('bar#host', io, format, ["~p", [test]]).
While the following prints out on "bar":
(foo#host) rpc:call('bar#host', erlang, display, [test]).
Even if erlang:display/1 is supposed to be used for debug only, both functions are supposed to send stuff to the standard output. Each process should inherit the group leader from its parent, so I would expect that the two functions would behave in a consistent way.
Is there any rationale for the above behaviour?
The reason for this difference in behaviour is where and by whom the output is done:
erlang:display/1 is a BIF and is handled directly by the BEAM which writes it straight out to its standard output without going anywhere near Erlang's io-system. So doing this on bar results in it printed to bar's standard output.
io:format/1/2 is handled by the Erlang io-system. As no IoDevice has been given it sends an io-request to its group leader. The rpc:call/4 is implemented in such away that the remotely spawned process inherits the group leader of the process doing the RPC call. So the output goes to the standard output of the calling process. So doing an RPC call on foo to the node bar results in the output going to foo's standard output.
Hence the difference. It is interesting to note that no special handling of this is needed in the Erlang io-system, once the group leader has been set it all works transparently.

Resources