Naive Bayes without Naive assumption - machine-learning

I'm trying to understand why the naive Bayes classifier is linearly scalable with the number of features, in comparison to the same idea without the naive assumption. I understand how the classifier works and what's so "naive" about it. I'm unclear as to why the naive assumption gives us linear scaling, whereas lifting that assumption is exponential. I'm looking for a walk-through of an example that shows the algorithm under the "naive" setting with linear complexity, and the same example without that assumption that will demonstrate the exponential complexity.

The problem here lies in following quantity
P(x1, x2, x3, ..., xn | y)
which you have to estimate. When you assume "naiveness" (feature independence) you get
P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y)
and you can estimate each P(xi | y) independently. In a natural way, this approach scales linearly, since if you add another k features you need to estimate another k probabilities, each using some very simple technique (like counting objects with given feature).
Now, without naiveness you do not have any decomposition. Thus you you have to keep track of all probabilities of form
P(x1=v1, x2=v2, ..., xn=vn | y)
for each possible values of vi. In simplest case, vi is just "true" or "false" (event happened or not), and this already gives you 2^n probabilities to estimate (each possible assignment of "true" and "false" to a series of n boolean variables). Consequently you have exponential growth of the algorithm complexity. However, the biggest issue here is usually not computational one - but rather the lack of data. Since there are 2^n probabilities to estimate you need more than 2^n data points to have any estimate for all possible events. In real life you will not ever encounter dataset of size 10,000,000,000,000 points... and this is a number of required (unique!) points for 40 features with such an approach.

Candy Selection
On the outskirts of Mumbai, there lived an old Grandma, whose quantitative outlook towards life had earned her the moniker Statistical Granny. She lived alone in a huge mansion, where she practised sound statistical analysis, shielded from the barrage of hopelessly flawed biases peddled as common sense by mass media and so-called pundits.
Every year on her birthday, her entire family would visit her and stay at the mansion. Sons, daughters, their spouses, her grandchildren. It would be a big bash every year, with a lot of fanfare. But what Grandma loved the most was meeting her grandchildren and getting to play with them. She had ten grandchildren in total, all of them around 10 years of age, and she would lovingly call them "random variables".
Every year, Grandma would present a candy to each of the kids. Grandma had a large box full of candies of ten different kinds. She would give a single candy to each one of the kids, since she didn't want to spoil their teeth. But, as she loved the kids so much, she took great efforts to decide which candy to present to which kid, such that it would maximize their total happiness (the maximum likelihood estimate, as she would call it).
But that was not an easy task for Grandma. She knew that each type of candy had a certain probability of making a kid happy. That probability was different for different candy types, and for different kids. Rakesh liked the red candy more than the green one, while Sheila liked the orange one above all else.
Each of the 10 kids had different preferences for each of the 10 candies.
Moreover, their preferences largely depended on external factors which were unknown (hidden variables) to Grandma.
If Sameer had seen a blue building on the way to the mansion, he'd want a blue candy, while Sandeep always wanted the candy that matched the colour of his shirt that day. But the biggest challenge was that their happiness depended on what candies the other kids got! If Rohan got a red candy, then Niyati would want a red candy as well, and anything else would make her go crying into her mother's arms (conditional dependency). Sakshi always wanted what the majority of kids got (positive correlation), while Tanmay would be happiest if nobody else got the kind of candy that he received (negative correlation). Grandma had concluded long ago that her grandkids were completely mutually dependent.
It was computationally a big task for Grandma to get the candy selection right. There were too many conditions to consider and she could not simplify the calculation. Every year before her birthday, she would spend days figuring out the optimal assignment of candies, by enumerating all configurations of candies for all the kids together (which was an exponentially expensive task). She was getting old, and the task was getting harder and harder. She used to feel that she would die before figuring out the optimal selection of candies that would make her kids the happiest all at once.
But an interesting thing happened. As the years passed and the kids grew up, they finally passed from teenage and turned into independent adults. Their choices became less and less dependent on each other, and it became easier to figure out what is each one's most preferred candy (all of them still loved candies, and Grandma).
Grandma was quick to realise this, and she joyfully began calling them "independent random variables". It was much easier for her to figure out the optimal selection of candies - she just had to think of one kid at a time and, for each kid, assign a happiness probability to each of the 10 candy types for that kid. Then she would pick the candy with the highest happiness probability for that kid, without worrying about what she would assign to the other kids. This was a super easy task, and Grandma was finally able to get it right.
That year, the kids were finally the happiest all at once, and Grandma had a great time at her 100th birthday party. A few months following that day, Grandma passed away, with a smile on her face and a copy of Sheldon Ross clutched in her hand.
Takeaway: In statistical modelling, having mutually dependent random variables makes it really hard to find out the optimal assignment of values for each variable that maximises the cumulative probability of the set.
You need to enumerate over all possible configurations (which increases exponentially in the number of variables). However, if the variables are independent, it is easy to pick out the individual assignments that maximise the probability of each variable, and then combine the individual assignments to get a configuration for the entire set.
In Naive Bayes, you make the assumption that the variables are independent (even if they are actually not). This simplifies your calculation, and it turns out that in many cases, it actually gives estimates that are comparable to those which you would have obtained from a more (computationally) expensive model that takes into account the conditional dependencies between variables.
I have not included any math in this answer, but hopefully this made it easier to grasp the concept behind Naive Bayes, and to approach the math with confidence. (The Wikipedia page is a good start: Naive Bayes).
Why is it "naive"?
The Naive Bayes classifier assumes that X|YX|Y is normally distributed with zero covariance between any of the components of XX. Since this is a completely implausible assumption for any real problem, we refer to it as naive.
Naive Bayes will make the following assumption:
If you like Pickles, and you like Ice Cream, naive bayes will assume independence and give you a Pickle Ice Cream and think that you'll like it.
Which is may not be true at all.
For a mathematical example see: https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

Related

Is reinforcement learning suitable for predicting bias in dice?

I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."
Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

Classification Accuracy Only 5% Higher Than Random Picking

I am trying to predict a public DotA 2 match outcome with given hero picks. It is usually possible for a human. There could only be 2 outcomes for a given side: it is either a win or a loss.
In fact, I am new to machine learning. I wanted to do this mini-project as an exercise but it already took 2 days of my time.
So, I made a dataset of around 2000 matches with about the same skill bracket. Each match contains exactly 13 000 features. Each feature is either 0 or 1 and specifies whether radiant have certain hero or not, whether dire have certain hero or not, whether radiant have one and dire another at a time (and vice versa). All combinations sum up to around 13000 features. Most of them are 0, of course. Labels are also 0 or 1 and indicate whether Radiant team won.
I used different sets for training and for testing.
Logistic regression classifier gave me 100% accuracy on training set and around 58% accuracy for test set.
SVM on the other hand scored 55% on training and 53% on test.
When I decreased number of examples by 1000 I've got 54.5% on training and 55% on test.
Should I continue increasing number of examples?
Should I select different features?
If I add more combinations of heroes feature number will explode. Or maybe there is no way to predict match outcome judging only on the heroes selected and I need to gain data about each players online rating and hero they selected and so on?
Plot of prediction accurace based on number of training examples:
Check out 2 latest graphs I added. I think I've got pretty decent results.
Also:
1. I asked 2 friends of mine to predict 10 matches and they both predicted 6 right. This amounts to 60% just as you said. 10 matches is not a big set, but they wont bother with bigger ones.
2. I downloaded 400 000 latest dota matches. MMR >3000, only all pick mode. Assuming that 1 billion dota matches are played each year 400k are from the same patch.
3. Concatenating hero picks of both sides was the orginal idea. Also, there are 114 heroes in dota, so I have 228 features now
4. In most matches odds are more or less equal, but there is fraction of picks, where one of the teams has advantage.From small up to critical.
What I ask you to do is to verify my conclusions, because results I've got are too bright for linear model.
[Probabilities test][2]
Actual probabilites and predicted probability ranges
distribution of predictions by probability range
The issue here is with your assertion that predicting a dota 2 match based on hero picks is "usually possible for a human". For this particular task it's likely more that there's a low cap on possible accuracy than anything. I watch a lot of dota, and even when you focus on the pro scene the accuracy of casters based on hero picks is quite low. My very preliminary analysis puts their accuracy at within spitting distance of 60%.
Secondly, how many dota matches are actually determined by hero picks? It's not many. In the vast majority of cases, especially in pub matches where skill levels are highly variable, team play matters much more than hero picks.
That's the first issue with your problem, but there are definitely other large issues with the way you've structured the problem that could help you get another couple of accuracy points (though again I doubt you can get far above 60%)
My first suggestion would be to change the way you're generating features. Feeding 13k features into an LR model with 2k examples is a recipe for disaster. Especially in the case of dota, where individual heroes don't matter very much and synergies and counters are drastically more important. I would start by reducing your feature count to ~200 by just concatenating hero picks by both sides. 111 for Radiant, 111 for Dire, 1 if hero is picked, 0 otherwise. This will help with overfitting, but then you run into the issue with LR is not a particularly good fit for the problem because individual heroes don't matter as much.
My second suggestion would be to constrain your match search for a single patch, ideally a later one where there's sufficient data. If you can get different data for different patches so much the better. An LR approach will give you decent accuracy for patches where specific heroes are overpowered, and especially at small data sizes you're a bit hosed if you dealing between patches as the heroes are actually changing.
My third suggestion would be to change models to one that's better at model inter-dependencies between your features. Random forest models are a pretty easy and straightforward approach that should give you better performance than straight LR for a problem like this, and has a built-in in sklearn.
If you want to go a bit further, then using an MLP-style network model could be relatively effective. I don't see an obvious framing of the problem to take advantage of modern network models though (CNNs and RNNs), so unless you change the problem definition a bit I think that this is going to be more hassle than it's worth.
Always, when in doubt get more data, and don't forget that people are very, very bad at this problem as well.

Assistance regarding model choice

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

Resources