Defining states, Q and R matrix in reinforcement learning - machine-learning

I am new to RL and I am referring couple of books and tutorials, yet I have a basic question and I hope to find that fundamental answer here.
the primary book referred: Sutton & Barto 2nd edition and a blog
Problem description (only Q learning approach): The agent has to reach from point A to point B and it is in a straight line, point B is static and only the initial position of Agent is always random.
-----------A(60,0)----------------------------------B(100,0)------------->
keeping it simple Agent always moves in the forward direction. B is always at X-axis position 100, which also a goal state and in first iteration A is at 60 X-axis position. So actions will be just "Go forward" and "Stop". Reward structure is to reward the agent 100 when A reaches point B and else just maintain 0, and when A crosses B it gets -500. So the goal for the Agent is to reach and stop at position B.
1)how many states would it require to go from point A to point B in this case? and how to define a Q and an R matrix for this?
2)How to add a new col and row if a new state is found?
Any help would be greatly appreciated.
Q_matrix implementation:
Q_matrix((find(List_Ego_pos_temp == current_state)) ,
possible_actions) = Q_matrix(find(List_Ego_pos_temp == current_state),possible_actions) + this.learning_rate * (Store_reward(this.Ego_pos_counter) + ...
this.discount * max(Q_matrix(find(List_Ego_pos_temp == List_Ego_pos_temp(find(current_state)+1))),possible_actions) - Q_matrix((find(List_Ego_pos_temp == current_state)) , possible_actions));
This implementation is in matlab.
List_Ego_pos_temp is a temporary list which store all the positions of the Agent.
Also, lets say there are ten states 1 to 10 and we also know that with what speed and distance the agent moves in each state to reach till state 10 and the agent always can move only sequentially which means agent can go from s1 to s2 to s3 to s4 till 10 not s1 to s4 or s10.
lets say at s8 is the goal state and Reward = 10, s10 is a terminal state and reward is -10, from s1 to s7 it receives reward of 0.
so will it be a right approach to calculate a Q table if the current state is considered as state1 and the next state is considered as state2 and in the next iteration current state as state2 and the next state as state3 and so on? will this calculate the Q table correctly as the next state is already fed and nothing is predicted?

Since you are defining the problem in this case, many of the variables are dependent on you.
You can define a minimum state (for e.g. 0) and a maximum state (for e.g. 150) and define each step as a state (so you could have 150 possible states). Then 100 will be your goal state. Then your action will be defined as +1 (move one step) and 0 (stop). Then the Q matrix will be a 150x2 matrix for all possible states and all actions. The reward will be scalar as you have defined.
You do not need to add new column and row, since you have the entire Q matrix defined.
Best of luck.

Related

why mql4 show error 130 when we use Stoploss in OrderSend function

I am trying to create a EA in mql4, but in OrderSend function, when i use some value instead of zero it show ordersend error 130. Please help to solve this problem
Code line is
int order = OrderSend("XAUUSD",OP_SELL,0.01,Bid,3,Bid+20*0.01,tp,"",0,0,Red);
error number 130 means Invalid stops.
so that means there is a problem with the stops you set with the ordersend function.
I suggest you set it like that:
int order = OrderSend("XAUUSD",OP_SELL,0.01,Bid,3,Bid+20*Point,tp,"",0,0,Red);
so you could use Point instead of hard coding it.
and to check what is the error number means. I think you could refer to: https://book.mql4.com/appendix/errors
You should know that there exists a minimum Stop Loss Size (mSLS) given in pips. "mSLS" changes with the currency and broker. So, you need to put in the OnInit() procedure of your EA a variable to get it:
int mSLS = MarketInfo(symbol,MODE_STOPLEVEL);
The distance (in pips) from your Order Open Price (OOP) and the Stop-Loss Price (SLP) can not be smaller than mSLS value.
I will try to explain a general algorithm I use for opening orders in my EAs, and then apply the constrain on Stop-Loss level (at step 3):
Step 1. I introduce a flag (f) for the type of operation I will open, being:
f = 1 for Buy, and
f = -1 for Sell
You know that there are mql4 constants OP_SELL=1 and OP_BUY=0 (https://docs.mql4.com/constants/tradingconstants/orderproperties).
Once I have defined f, I set my operation type variable to
int OP_TYPE = int(0.5((1+f)*OP_BUY+(1-f)*OP_SELL));
that takes value OP_TYPE=OP_BUY when f=1, while OP_TYPE=OP_SELL when f=-1.
NOTE: Regarding the color of the orders I put them in an array
color COL[2]= {clrBlue,clrRed};
then, having OP_TYPE, I set
color COLOR=COL[OP_TYPE];
Step 2. Similarly, I set the Order Open Price as
double OOP = int(0.5*((1+f)*Ask+(1-f)*Bid));
which takes value OOP=Ask when f=1, while OOP=Bid when f=-1.
Step 3. Then, given my desired Stop Loss in pips (an external POSITIVE parameter of my EA, I named sl) I make sure that sl > SLS. In other words, I check
if (sl <= mSLS) // I set my sl as the minimum allowed
{
sl = 1 + mSLS;
}
Step 4. Then I calculate the Stop-Loss Price of the order as
double SLP = OOP - f * sl * Point;
Step 5. Given my desired Take Profit in pips (an external POSITIVE parameter of my EA, I named tp) I calculate the Take-Profit Price (TPP) of the order as
double TPP = OOP + f * tp * Point;
OBSERVATION: I can not affirm, but, according to mql4 documentation, the minimum distance rule between the stop-loss limit prices and the open price also applies to the take profit limit price. In this case, a "tp" check-up needs to be done, similar to that of the sl check-up, above. that is, before calculating TPP it must be executed the control lines below
if (tp <= mSLS) // I set my tp as the minimum allowed
{
tp = 1 + mSLS;
}
Step 5. I call for order opening with a given lot size (ls) and slippage (slip) on the operating currency pair (from where I get the Ask and Bid values)
float ls = 0.01;
int slip = 3; //(pips)
int order = OrderSend(Symbol(),OP_TYPE,ls,OOP,slip,SLP,TPP,"",0,0,COLOR);
Note that with these few lines it is easy to build a function that opens orders of any type under your command, in any currency pair you are operating, without receiving error message 130, passing to the function only 3 parameters: f, sl and tp.
It is worth including in the test phase of your EA a warning when the sl is corrected for being less than the allowed, this will allow you to increase its value so that it does not violate the stop-loss minimum value rule, while you have more control about the risk of its operations. Remember that the "sl" parameter defines how much you will lose if the order fails because the asset price ended up varying too much in the opposite direction to what was expected.
I hope I could help!
Whilst the other two answers are not necessarily wrong (and I will not go over the ground they have already covered), for completeness of answers, they fail to mention that for some brokers (specifically ECN brokers) you must open your order first, without setting a stop loss or take profit. Once the order is opened, use OrderModify() to set you stop loss and/or take profit.

proof of optimality in activity selection

Can someone please explain in a not so formal way how the greedy choice is the optimal solution for the activity selection problem? This is the simplest explanation that I have found but I don't really get it
How does Greedy Choice work for Activities sorted according to finish time?
Let the given set of activities be S = {1, 2, 3, ..n} and activities be sorted by finish time. The greedy choice is to always pick activity 1. How come the activity 1 always provides one of the optimal solutions. We can prove it by showing that if there is another solution B with the first activity other than 1, then there is also a solution A of the same size with activity 1 as the first activity. Let the first activity selected by B be k, then there always exist A = {B – {k}} U {1}.(Note that the activities in B are independent and k has smallest finishing time among all. Since k is not 1, finish(k) >= finish(1)).
The following is my understanding of why greedy solution always words:
Assertion: If A is the greedy choice(starting with 1st activity in the sorted array), then it gives the optimal solution.
Proof: Let there be another choice B starting with some activity k (k != 1 or finishTime(k)>= finishTime(1)) which alone gives the optimal solution.So, B does not have the 1st activity and the following relation could be written between A & B could be written as:
A = {B - {k}} U {1}
Here:
1.Sets A and B are disjoint
2.Both A and B have compatible activities in them
Since we conclude that |A|=|B|, therefore activity A also gives the optimal solution.
Let's say A is a the optimal solution which starts with 1 if the intervals are S={1,2,3,.....m} and the length of the solution is say n1. If A is not an optimal solution, then there exists another solution B which starts with k!=1 and finishTime(k)>=finishTime(1), which has length n2.
So, n2>n1.
Now, if we exclude k from solution B then we are left with n2-1 number of elements.
Since, k doesn't overlap with other intervals in B, 1 will also not overlap.
This is because all intervals in B(excluding k) will have startTime>= finishTime(k)>=finishTime(1).Hence, if we replace k with 1 in B, we still have n2 length. But optimal solution starting with 1 was A with length n1. We are getting n1=n2 , which contradicts n2>n1. Hence Solution starting with 1 is optimal.

Greedy algorithm to finish a task with time constraint

This is a question from my midterm today and I wonder how to solve this. All i know is to prove the greedy algorithm using induction.
Question:
You are working on a programming project. There are n Java classes C1, C2, ..., Cn (the bossy architect says so). The architect also says that these classes have to be implemented in order (you are not allowed to implement C2 before you have completed C1 and so on).
Each of the Java classes takes at most 8 hours to implement. You work exactly 8 hours a day, and you should not leave a Java class unfinished at the end of the day.
To complete the project as soon as possible, a strategy is to implement as many classes as you can everyday. Prove that this greedy strategy is indeed the optimal one.
(Hint: let ti be the total number of classes completed in the first i days using the above strategy. The strategy always stays ahead if ti is no less than the total number of classes completed in the first i days using any other strategy)
This problem is similar to the classic task scheduling case where the waiting time in the system must be minimized.
Let C1, C2, ..., Cn your projected classes and c[1], c[2], ..., c[n] their required implementation time. Let's suppose you implement C1, C2, ... Cn in this order. Therefore, the total time (waiting + implementation) for each class Ck will be:
c[1] + c[2] + ... + c[k]
Therefore, we have the total time:
T = n·c[1] + (n – 1)·c[2] + ... + 2*c*[n – 1] + c[n] = sum(k = 1 to n) of (n – k + 1)·c[k]
(Sorry about the presentation — superscripts, subscripts, and math equations aren't supported...)
Let's suppose the implementation times in our permutation are not sorted by ascending order. We can therefore find two integers a and b such that a < b with c[a] > c[b]. If we switch them in the computation of T, we have:
T' = (n – a + 1)·c[b] + (n – b + 1)·c[a] + sum(k = 1 to n except a, b) of (n – k + 1)·c[k]
We finally compute T – T':
T – T' = (n – a + 1)(c[a] – c[b]) + (n – b + 1)(c[b] – c[a]) = (b – a)(c[a] – c[b])
Following our initial hypothesis (a < b and c[a] > c[b]), we have b – a > 0 and c[a] – c[b] > 0 as well, hence T – T' > 0.
This proves that we decrease the total waiting time by switching any pair of tasks so that the shorter one is done first.
Your problem statement is the same, except that before starting implementing a new class, you have to check whether you should start it now (if there is enough time left on the present day) or tomorrow. But the principle proven here holds when it comes to minimizing the total "waiting" time.
This is not a programming question for SO. The problem is not asking for a coding solution, rather its a proof that greedy is optimal. Which can be done with a proof by contradiction (no doubt taught in the class before the midterm).
What you want to do is to calculate the total time taken by greedy (there's only one solution) and disprove that any swaps in day would lead to a better solution. You probably also have to add something that mentions how swapping will allow u to permute the order to the optimal solution, if it exists.
I was going to write some formulae, but i realize Jeff Morin already has the equations, just going in the opposite direction. I think starting from the greedy solution might be easier to explain, since 'in order' is pretty much defined by the problem and you can only shift the work +- which day its done on.
The problem statement is incomplete. There is no indication that any class will take less than 8 hours. Since you can't leave any class unfinished, then you must start each class at the start of the day to be sure to have at least 8 hours to work on it. So if C2 really takes 3 hours and C3 really takes 5 hours, then a greedy algorithm would allow both classes to be done the same day. But after C2 takes 3 hours, you have to wait to day 3 to start C3 to be sure that you have enough time to finish since you don't know how long C3 will take.
So the restrictions really end up dictating that the effort will take n days, 1 day per class. So the implementation algorithm is strictly sequential, not greedy.
Edit Restrictions stated in problem.
(1) There are n Java classes C1, C2, ..., Cn
(2) these classes have to be implemented in order (you are not allowed to implement C2 before you have completed C1 and so on).
(3) Each of the Java classes takes at most 8 hours to implement
But there is no estimate for any class taking less than 8 hours.
(4) You work exactly 8 hours a day
(5) You should not leave a Java class unfinished at the end of the day.
The gist of this (3,4,& 5) is let's assume that I work on class 1 for 5 minutes. I now have 7 hours 55 minutes left. Can I start on Class 2? No because it might take up to 8 hours and I must finish before the end of my 8-hour day. So I must wait to day 2 to start class 2 and so on. Thus the implementation is strictly sequential and will take n days to complete, 1 day per class.
In order to use the Greedy algorithm you'd need additional information.
(6) You also know that each class has a known number of hours needed to code the class - h1, h2, h3, ..., hn. So class 1 takes h1 hours, class 2 takes h2 hours and so on. (From item 3 no class takes more than 8 hours)

Breaking A* admissibility caused exponential speed-up?

I've been working on a generalized version of the sliding tile puzzle where the tiles do not have numbers. Instead, each location either has a tile or a hole and is represented with a boolean as true or false (tile or hole).
The point of the search is to take an initial state with n tiles and a goal state with n target locations and use A* to find the solution of how to move the tiles so that every target location is populated. Here is an example below for a 4x3 grid:
Initial State:
T F T F
F F T F
F F T T
Goal State
T T T T
T F F F
F F F F
I had been working on different heuristics to do this and the most successful had a logic that went something like this:
int heuristicVal = 0
for every tile (i)...
int closest = infinity
for every goal location (j)...
if (manhattan distance of ij < closest) closest = manhattan distance of ij
end for
heuristicVal += closest
end for
return heuristicVal
Unfortunately, this was still too slow in situations where two or more tiles were being guided by the heuristic to the same target location. I tried multiplying heuristicVal by the number of tiles and suddenly there was an exponential speed-up. Problems that were taking 28 seconds before were taking less than 1 second.
Edit: It turns out it is not always producing optimal solutions after all with this change. However, I don't understand why it sped up so much or why it is still finding the correct (although suboptimal) answer despite no longer being admissible.
If you break admissability, A* no longer works correctly. Note that no longer works correctly doesn't mean you're never gonna get an optimal result - you're just no longer guaranteed to get one. You can also end up converging faster on solution, but what's the point if that solution is not the right one?

Push-relabel gap heuristics

I don't understand how to implement gap heuristics with push relabel. Wiki described it like this:
"In gap relabeling heuristic we maintain an array A of size n, holding in A[i]
the number of nodes for each label (up to n). If a label d is found, such that
A[d] = 0, then all nodes with label > d are relabeled to label n."
Use a gap heuristic. If there is a 'k' such that for no node height(u) =k, you can set height(u) = max(height(u), height(source) +1) for all nodes except source, for which height(u) >k. This is because any such 'k' represents a minimal cut in the graph, and no more flow will go from the nodes S={u where height(u) > k} to nodes in T={v, where height(v)0. But then height(u) > height(v)+1 , contradicting height(u) > k and height(v) < k.
Can someone explain to me in pseudocode how to add the gap heuristic to a FIFO push-relabel as shown in wiki's sample code?
Thanks,
Vince
It might be a little late but here is a link to a Stanford University notebook where you can find a push-relabel maximum flow using a Gap Heuristic in C.
I hope it helps you.
http://www.stanford.edu/~liszt90/acm/notebook.html#file3

Resources