I am new to machine learning, but I've read a lot about Reinforcement Learning in the past 2 days. I have an application that fetches a list of projects (e.g. from Upwork). There is a moderator that manually accepts or rejects a project (based on some parameters explained below). If a project is accepted, I want to send a project proposal and if it is rejected, I'll ignore it. I am looking to replace that moderator with AI (among other reasons) so I want to know which Reinforcement Algorithm should I use for this.
Parameters:
Some of the parameters that should decide whether the agent accepts or rejects the project are listed below. Assuming I only want to accept projects related to web development (specifically backend/server-side) here is how the parameters should influence the agent.
Sector: If the project is in related to IT sector it should have more chances of being accepted.
Category: If the project is in the Web Development category it should have more chances of being accepted.
Employer Rating: Employers having a rating of over 4 (out of 5) should have more chances of being accepted.
I thought Q-Learning or SARSA would be able to help me out but most of the examples that I saw were related to Cliff Walking problem where the states are dependent on each other which is not applicable in my case since each project is different from the previous one.
Note: I want the agent to be self-learning so that if in the future I start rewarding it for front-end projects too, it should learn that behavior. Therefore, suggesting a "pure" supervised learning algorithm won't work.
Edit 1: I would like to add that I have data (sector, category, title, employer rating etc.) of 3000 projects along with whether that project was accepted or rejected by my moderator.
your problem should easily be able to be solved using Q-learning. It just depends on how you design your problem. Reinforcement learning itself is a very robust algorithm that allows an agent to receive states from an environment, and then perform actions given those states. Depending on those actions, it will get rewarded accordingly. For your problem, the structure will look like this:
State
States: 3 x 1 matrix. [Sector, Category, Employer Rating]
The sector state are all integers, where each integer represents a different sector. For example, 1 = IT Sector, 2 = Energy, 3 = Pharmaceuticals, 4 = Automotives, etc.
The category state can also be all integers, where each integer represents a different category. Ex: 1 = Web Development, 2 = Hardware, 3 = etc.
Employer rating is again all integers between 1 - 5. Where the state represents the rating.
Action
Action: Output is an integer.
The action space would be binary. 1 or 0. 1 = Take the project, 0 = Don't take the project.
Reward
The reward provides feedback to your system. In your case, you would only evaluate the reward if the action = 1, i.e., you took the project. This will then allow your RL to learn how good of a job it did taking the project.
Reward would be a function that looks something like this:
def reward(states):
sector, category, emp_rating = states
rewards = 0
if sector == 1: # The IT sector
rewards += 1
if category == 1: # The web development category
rewards += 1
if emp_rating = 5: # Highest rating
rewards += 2
elif emp_rating = 4: # 2nd highest rating
rewards += 1
return rewards
To enhance this reward function, you can actually give some sectors negative rewards, so the RL will actually receive negative rewards if it took those projects. I avoided that here to avoid the further complexity.
You can also edit the reward function in the future to allow your RL to learn new things. Such as making some sectors better than others, etc.
edit: yes, regards to lejlot's comment, it basically is a multi-armed bandit problem, where there is no sequential decision making. The setup of the bandit problem is basically the same as Q-learning minus the sequential part. All your concerned with is you have a project proposal (state), make a decision (action), and then your reward. It does not matter what happens next in your case.
Related
I am trying my hands on Reinforcement/Deep-Q learning these days. And I started with a basic game of 'Snake'.
With the help of this article: https://towardsdatascience.com/how-to-teach-an-ai-to-play-games-deep-reinforcement-learning-28f9b920440a
Which I successfully trained to eat food.
Now I want it to eat food in specific number of steps say '20', not more, not less. How will the reward system and Policy be changed for this?
I have tried many things, with little to no result.
For example I tried this:
def set_reward(self, player, crash):
self.reward = 0
if crash:
self.reward = -10
return self.reward
if player.eaten:
self.reward = 20-abs(player.steps - 20)-player.penalty
if (player.steps == 10):
self.reward += 10 #-abs(player.steps - 20)
else:
player.penalty+=1
print("Penalty:",player.penalty)
Thank You.
Here's is the program:
https://github.com/maurock/snake-ga
I would suggest this approach is problematic because despite changing your reward function you haven't included the number of steps in the observation space. The agent needs that information in the observation space to be able to differentiate at what point it should bump into the goal. As it stands, if your agent is next to the goal and all it has to do is turn right but all it's done so far is five moves, that is exactly the same observation as if it had done 19 moves. The point is you can't feed the agent the same state and expect it to make different actions because the agent doesn't see your reward function it only receives a reward based on state. Therefore you are contradicting the actions.
Think of when you come to the testing the agents performance. There is no longer a reward. All you are doing is passing the network a state and you are expecting it to choose different actions for the same state.
I assume your state space is some kind of 2D array. Should be straightforward to alter the code to contain the number of steps in the state space. Then the reward function would be something like if observation[num_steps] = 20: reward = 10.
Ask if you need more help coding it
How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?
There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.
Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?
Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.
I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.
Yesterday my colleagues (all researchers) and I were bewildered by something. My colleague was asked to revise her article by the reviewers of a journal. She had done 3 repeated measures ANOVAs to answer each of the research questions, and the reviewers asked her to check for differences in the between subjects measure (in this case gender). She therefore set out to repeat her ANOVA (in SPSS) as a mixed model, with the repeated measures staying the same, but adding one between subjects measure (gender). She found no significant differences in gender and one significant interaction of a within subjects measure and gender (not problematic for this paper).
Here is the baffling part though: Her main within-subjects effects suddenly had changed - different F and p values were shown.
This defies my knowledge and understanding of what a mixed model ANOVA calculates. As I understand it:
F(within subjects main effect) = MS(within) / MS error(within)
F(within subjects * between subjects interaction) = MS(within * between) / MS error(within)
F(between subjects main effect) = MS(between) / MS error(between)
The F value of the within subjects main effect should not change, whether you are including a between subjects variable or not. This calculation is based on the same numbers either way and should therefore yield the same results either way. What are we missing here? What are we not understanding?
Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing.
I am using PredictionIO 0.9.6 and the Recommendation template for collaborative filtering. The ratings in my data set are numbers between 1 and 10. When I first trained a model with defaults from the template (using ALS.train), the predictions were horrible, at least subjectively. Scores ranged up to 60.0 or so but the recommendations seemed totally random.
Somebody suggested that ALS.trainImplicit did a better job, so I changed src/main/scala/ALSAlgorithm.scala accordingly:
val m = ALS.trainImplicit( // instead of ALS.train
ratings = mllibRatings,
rank = ap.rank,
iterations = ap.numIterations,
lambda = ap.lambda,
blocks = -1,
alpha = 1.0, // also added this line
seed = seed)
Scores are much lower now (below 1.0) but the recommendations are in line with the personal ratings. Much better, but also confusing. PredictionIO defines the difference between explicit and implicit this way:
explicit preference (also referred as "explicit feedback"), such as
"rating" given to item by users. implicit preference (also referred
as "implicit feedback"), such as "view" and "buy" history.
and:
By default, the recommendation template uses ALS.train() which expects explicit rating values which the user has rated the item.
source
Is the documentation wrong? I still think that explicit feedback fits my use case. Maybe I need to adapt the template with ALS.train in order to get useful recommendations? Or did I just misunderstand something?
A lot of it depends on how you gathered the data. Often ratings that seem explicit can actually be implicit. For instance, assume you give the option of allowing users to rate items that they have purchased / used before. This means that the very fact that they have spent their time evaluating that particular item means that the item is of a high quality. As such, items of poor quality are not rated at all because people do not even bother to use them. As such, even though the dataset is intended to be explicit, you may get better results because if you consider the results to be implicit. Again, this varies significantly based on how the data is obtained.
The explict data (like ratings) normally comes with bias - people go and rate a product because they like it! Think about your experience shopping and then rating on Amazon.com :-)
On the contrary, implict info often can truly reflect user's favor on a product, like viewing duration, comment length, etc. Even a like/dislike is better that rating because it provides a very simple 'bad' option without bothering a user to think "if I should give a 3, 3.5, or 4?".
I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.