Why do we choose Beta distribution as a prior on hypothesis? - machine-learning

I saw machine learning class videos of course 10-701 year 2011 by Tom Mitchell at CMU. He was teaching on topic Maximum Likelihood Estimation when he used Beta distribution as prior on theta, I wonder he chose that only?

In this lecture, prof Mitchell gives an example of coin flipping and estimating its fairness, i.e. the probability of heads - theta. He reasonably chose a binomial distribution for this experiment.
The reason to choose beta distribution for prior is to simplify the math when computing the posterior. This works well, because beta is a conjugate prior for binomial - at the very end of the same lecture the prof mentions it. This doesn't mean that one can't possibly use any other prior, e.g. normal, Poisson, etc. But other priors lead to complicated posterior distributions, which are hard to optimize, calculate the integral, etc.
This is a general principle: prefer a conjugate prior to more complex distributions even if it doesn't fit the data exactly, because the math is simpler.

Related

Probability Distribution and Data Distribution in ML

I have been reading about probability distributions lately and got confused that what actually is the difference between probability distribution and data distribution or are they the same? Also what actually is the importance of probability distribution in Machine Learning?
Thanks
Data distribution is a function or a listing that shows all the possible values (or intervals) of the data. This can help you decide if the set of good that you have is good enough to apply any techniques over it. You want to avoid skewed data.
Probability distribution is a statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This helps you decide what type of statistical methods you can apply to your data. Example: If your data forms a Gaussian distribution then you already know how values would look like when they are 1-standard deviation away from normal and what is the probability of expecting more than 1-standard deviation.
NOTE: You may want to learn about how hypothesis testing is done for ML models.

Why Normality is considered as an important assumption for dependent and independent variables?

While going through one kernel on Kaggle regarding Regression in that it was mentioned that the data should look like a normal distribution. But I am not getting why?
I know this question might be very basic But please help me to understand this concept.
Thanks in Advance!!
Regression models make a number of assumptions, one of which is normality. When this assumption is violated then your p-values and confidence intervals around your coefficient estimate could be wrong, leading to incorrect conclusions about the statistical significance of your predictors
However, a common misconception is that the data (i.e. the variables/predictors) needs to be normally distributed, but this is not true. These models don't make any assumptions about the distribution of predictors.
For example, imagine a case where you have a binary predictor in regression (Male/Female; Slow/Fast etc.) - it would be impossible for this variable to be normally distributed and yet it is still a valid predictor to use in a regression model. The normality assumption actually refers to the distribution of the residuals, not the predictors themselves

Is reinforcement learning applicable to a RANDOM environment?

I have a fundamental question on the applicability of reinforcement learning (RL) on a problem we are trying to solve.
We are trying to use RL for inventory management - where the demand is entirely random (it probably has a pattern in real life but for now let us assume that we have been forced to treat as purely random).
As I understand, RL can help learn how to play a game (say chess) or help a robot learn to walk. But all games have rules and so does the ‘cart-pole’ (of OpenAI Gym) – there are rules of ‘physics’ that govern when the cart-pole will tip and fall over.
For our problem there are no rules – the environment changes randomly (demand made for the product).
Is RL really applicable to such situations?
If it does - then what will improve the performance?
Further details:
- The only two stimuli available from the ‘environment’ are the currently available level of product 'X' and the current demand 'Y'
- And the ‘action’ is binary - do I order a quantity 'Q' to refill or do I not (discrete action space).
- We are using DQN and an Adam optimizer.
Our results are poor - I admit I have trained only for about 5,000 or 10,000 - should I let it train on for days because it is a random environment?
thank you
Rajesh
You are saying random in the sense of non-stationary, so, no, RL is not the best here.
Reinforcement learning assumes your environment is stationary. The underlying probability distribution of your environment (both transition and reward function) must be held constant throughout the learning process.
Sure, RL and DRL can deal with some slightly non-stationary problems, but it struggles at that. Markov Decision Processes (MDPs) and Partially-Observable MDPs assume stationarity. So value-based algorithms, which are specialized in exploiting MDP-like environments, such as SARSA, Q-learning, DQN, DDQN, Dueling DQN, etc., will have a hard time learning anything in non-stationary environments. The more you go towards policy-based algorithms, such as PPO, TRPO, or even better gradient-free, such as GA, CEM, etc., the better chance you have as these algorithms don't try to exploit this assumption. Also, playing with the learning rate would be essential to make sure the agent never stops learning.
Your best bet is to go towards black-box optimization methods such as Genetic Algorithms, etc.
Randomness can be handled by replacing single average reward output with a distribution with possible values. By introducing a new learning rule, reflecting the transition from Bellman’s (average) equation to its distributional counterpart, the Value distribution approach has been able to surpass the performance of all other comparable approaches.
https://www.deepmind.com/blog/going-beyond-average-for-reinforcement-learning

Data sampling technique and questions

I'm just a little confused about data sampling, what distribution should I expect for my sampling data? In general, do I want my sampling data has the same distribution as my whole dataset? I wonder what is a reasonable sampling technique and approach?
There are many factors to consider in choosing a sampling technique. Factors such as the purpose or objective of work, your budget, time and even sample size are worth considering when choosing a sampling technique.
Probability Sampling techniques are usually more involving while Non-Probability Sampling techniques may be less demanding.
The sampling technique chosen goes a long way affecting the interpretation of the data, as well as the overall outcome of your work.
These notes may be of interest:
Simple Random Sampling and Other Sampling Methods
I did not understand your question well, but I will try to answer.
Student's ‘t’ Distribution is essentially a normal distribution (with an approximate shape of a bell), that is why the statistical programs often have in them the statistical expressions of the Student 't' distribution instead of the normal distribution.

reasoning about probabilities and indistinguishability

I'm looking for hints on formal reasoning capturing notion of indistinguishability, that is, the same probability distribution of a random variable. Such a variable could be sampled from 0/1 space while considering XOR games with random bits, or it could be sampled from a large ring. The latter case would be equipped with modular addition.
At best, any known way to conclude that distribution of the sum with a random variable with flat distribution is flat?
Alternatively, what kind of reasoning about probabilities is doable with Z3?
The best match that I came across is probably reasoning about Bayesian Belief Networks (Michael Huth etal at Esorics), still feeling lost. So, where to start from? Thank you.
Can't really answer the question, but it may be of interest that we recently designed a (very specialized) model checker for (some) probabilistic systems based on Z3. There's a paper about it and an implementation. In our very special setting everything is discretized, so it may be possible to answer questions of 'flatness' or similar, but it would probably still be very expensive.

Resources