Assume, two attributes have a correlation of 0.02; what does this tell you about the relationship of the two attributes? Answer the same question assuming the correlation is -0.98.
.02 means that the two attributes are not strongly correlated. As in, they have very little to do with each other.
For instance, let's talk about the level of someone's craziness compared to the number of cats they own. The more cats, the crazier the person is. That is a high correlation, because the two things are related: number of cats and craziness. The correlation would be around .98 , because the two things are strongly correlated.
A weak correlation of around .02 as above would be how often I fart in a day vs the price of gasoline in India. It's .02 because the two are not correlated at all. The closer to 0 a correlation is, the less likely it is that the two variables have anything to do with each other. The closer to 1, the more likely it is that the variables have something to do with each other.
A negative correlation, like -.98, means that two things are related but in an opposite way: like if you exercise more you decrease risk of cancer. As exercise goes up, risk of cancer goes down. The two are strongly correlated but in a negative way. They do not rise and fall together, when one rises the other falls.
Related
i was reading about Classification Algorithm KNN and came across with one term Distance Sensitive Data. I was not able to Found what exactly is Distance Sensitive Data wha are it's classifications, How to say if our Data is Distance-Sensitive or Not?
Suppose that xi and xj are vectors of observed features in cases i and j. Then, as you probably know, kNN is based on distances ||xi-xj||, such as the Euclidean one.
Now if xi and xj contain just a single feature, individual's height in meters, we are fine, as there are no other "competing" features. Suppose that next we add annual salary in thousands. Consequently, we look at distances between vectors like (1.7, 50000) and (1.8, 100000).
Then, in the case of the Euclidean distance, clearly salary feature dominates height and it's almost like we are using the salary feature alone. That is,
||xi-xj||2 ≈ |50000-100000|.
However, if the two features actually have similar importance, then we are doing a poor job. It is even worse if salary is actually irrelevant and we should be using height alone. Interestingly, under weak conditions, our classifier still has nice properties such as universal consistency even in such bad situations. The problem is that in finite samples the performance is our classifier is very bad so that the convergence is very slow.
So, as to deal with that, one may want to consider different distances, such that do something about the scale. Commonly people standardize (set the mean to zero and variance to 1) each feature, but that's not a complete solution either. There are various proposals what could be done (see, e.g., here).
On the other hand, algorithms based on decision trees do not suffer from this. In those cases we just look for a point where to split the variable. For instance, if salary takes values in [0,100000] and the split is at 40000, then Salary/10 would be slit at 4000 so that the results would not change.
Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.
I'm trying to understand why the naive Bayes classifier is linearly scalable with the number of features, in comparison to the same idea without the naive assumption. I understand how the classifier works and what's so "naive" about it. I'm unclear as to why the naive assumption gives us linear scaling, whereas lifting that assumption is exponential. I'm looking for a walk-through of an example that shows the algorithm under the "naive" setting with linear complexity, and the same example without that assumption that will demonstrate the exponential complexity.
The problem here lies in following quantity
P(x1, x2, x3, ..., xn | y)
which you have to estimate. When you assume "naiveness" (feature independence) you get
P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y)
and you can estimate each P(xi | y) independently. In a natural way, this approach scales linearly, since if you add another k features you need to estimate another k probabilities, each using some very simple technique (like counting objects with given feature).
Now, without naiveness you do not have any decomposition. Thus you you have to keep track of all probabilities of form
P(x1=v1, x2=v2, ..., xn=vn | y)
for each possible values of vi. In simplest case, vi is just "true" or "false" (event happened or not), and this already gives you 2^n probabilities to estimate (each possible assignment of "true" and "false" to a series of n boolean variables). Consequently you have exponential growth of the algorithm complexity. However, the biggest issue here is usually not computational one - but rather the lack of data. Since there are 2^n probabilities to estimate you need more than 2^n data points to have any estimate for all possible events. In real life you will not ever encounter dataset of size 10,000,000,000,000 points... and this is a number of required (unique!) points for 40 features with such an approach.
Candy Selection
On the outskirts of Mumbai, there lived an old Grandma, whose quantitative outlook towards life had earned her the moniker Statistical Granny. She lived alone in a huge mansion, where she practised sound statistical analysis, shielded from the barrage of hopelessly flawed biases peddled as common sense by mass media and so-called pundits.
Every year on her birthday, her entire family would visit her and stay at the mansion. Sons, daughters, their spouses, her grandchildren. It would be a big bash every year, with a lot of fanfare. But what Grandma loved the most was meeting her grandchildren and getting to play with them. She had ten grandchildren in total, all of them around 10 years of age, and she would lovingly call them "random variables".
Every year, Grandma would present a candy to each of the kids. Grandma had a large box full of candies of ten different kinds. She would give a single candy to each one of the kids, since she didn't want to spoil their teeth. But, as she loved the kids so much, she took great efforts to decide which candy to present to which kid, such that it would maximize their total happiness (the maximum likelihood estimate, as she would call it).
But that was not an easy task for Grandma. She knew that each type of candy had a certain probability of making a kid happy. That probability was different for different candy types, and for different kids. Rakesh liked the red candy more than the green one, while Sheila liked the orange one above all else.
Each of the 10 kids had different preferences for each of the 10 candies.
Moreover, their preferences largely depended on external factors which were unknown (hidden variables) to Grandma.
If Sameer had seen a blue building on the way to the mansion, he'd want a blue candy, while Sandeep always wanted the candy that matched the colour of his shirt that day. But the biggest challenge was that their happiness depended on what candies the other kids got! If Rohan got a red candy, then Niyati would want a red candy as well, and anything else would make her go crying into her mother's arms (conditional dependency). Sakshi always wanted what the majority of kids got (positive correlation), while Tanmay would be happiest if nobody else got the kind of candy that he received (negative correlation). Grandma had concluded long ago that her grandkids were completely mutually dependent.
It was computationally a big task for Grandma to get the candy selection right. There were too many conditions to consider and she could not simplify the calculation. Every year before her birthday, she would spend days figuring out the optimal assignment of candies, by enumerating all configurations of candies for all the kids together (which was an exponentially expensive task). She was getting old, and the task was getting harder and harder. She used to feel that she would die before figuring out the optimal selection of candies that would make her kids the happiest all at once.
But an interesting thing happened. As the years passed and the kids grew up, they finally passed from teenage and turned into independent adults. Their choices became less and less dependent on each other, and it became easier to figure out what is each one's most preferred candy (all of them still loved candies, and Grandma).
Grandma was quick to realise this, and she joyfully began calling them "independent random variables". It was much easier for her to figure out the optimal selection of candies - she just had to think of one kid at a time and, for each kid, assign a happiness probability to each of the 10 candy types for that kid. Then she would pick the candy with the highest happiness probability for that kid, without worrying about what she would assign to the other kids. This was a super easy task, and Grandma was finally able to get it right.
That year, the kids were finally the happiest all at once, and Grandma had a great time at her 100th birthday party. A few months following that day, Grandma passed away, with a smile on her face and a copy of Sheldon Ross clutched in her hand.
Takeaway: In statistical modelling, having mutually dependent random variables makes it really hard to find out the optimal assignment of values for each variable that maximises the cumulative probability of the set.
You need to enumerate over all possible configurations (which increases exponentially in the number of variables). However, if the variables are independent, it is easy to pick out the individual assignments that maximise the probability of each variable, and then combine the individual assignments to get a configuration for the entire set.
In Naive Bayes, you make the assumption that the variables are independent (even if they are actually not). This simplifies your calculation, and it turns out that in many cases, it actually gives estimates that are comparable to those which you would have obtained from a more (computationally) expensive model that takes into account the conditional dependencies between variables.
I have not included any math in this answer, but hopefully this made it easier to grasp the concept behind Naive Bayes, and to approach the math with confidence. (The Wikipedia page is a good start: Naive Bayes).
Why is it "naive"?
The Naive Bayes classifier assumes that X|YX|Y is normally distributed with zero covariance between any of the components of XX. Since this is a completely implausible assumption for any real problem, we refer to it as naive.
Naive Bayes will make the following assumption:
If you like Pickles, and you like Ice Cream, naive bayes will assume independence and give you a Pickle Ice Cream and think that you'll like it.
Which is may not be true at all.
For a mathematical example see: https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
I don't understand what is the multiobjective clustering is it using multiple variables for clustering or what?
I know that stack overflow might not be the best for this kind of questions, but
I've asked it on other website and I did not got a response.
Multiobjective optimization in general means that you have multiple criterions which you are interested in, which cannot be simply converted to something comparable. For example consider problem when you try to have very fast model and very accurate one. Time is measured in s, accuracy in %. How do you compare (1s, 90%) and (10days, 92%)? Which one is better? In general there is no answer. Thus what people usually do - they look for pareto front, so you test K models and selec M <= K of them such that, none of them is clearly "beaten" by any else. For example if we add (1s, 91%) to the previous example, Pareto front will be {(1s, 91%), (10days, 92%)} (as (1s, 90%) < (1s, 91%), and remaining ones are impossible to compare).
And now you can apply the same problem in clustering setting. Say for example that you want to build a model which is fast to classify new instances, minimizes avg. distance inside each cluster, and does not put into each cluster too many special instances labeled with X. Then again you will get models (clusterings) which are now characterized by 3, not comparable, measures, and in Multiobjective Clustering you try to deal with these problems (like for example finding Pareto front of such clusterings).
I am reading about precision and recall in machine learning.
Question 1: When are precision and recall inversely related? That is, when does the situation occur where you can improve your precision but at the cost of lower recall, and vice versa? The Wikipedia article states:
Often, there is an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the
other. Brain surgery provides an obvious example of the tradeoff.
However, I have seen research experiment results where both precision and recall increase simultaneously (for example, as you use different or more features).
In what scenarios does the inverse relationship hold?
Question 2: I'm familiar with the precision and recall concept in two fields: information retrieval (e.g. "return 100 most relevant pages out of a 1MM page corpus") and binary classification (e.g. "classify each of these 100 patients as having the disease or not"). Are precision and recall inversely related in both or one of these fields?
The inverse relation only holds when you have some parameter in the system that you can vary in order to get more/less results. Then there's a straightforward relationship: you lower the threshold to get more results and among them some are TPs and some FPs. This, actually, doesn't always mean that precision or recall will rise and fall simultaneously - the real relationship can be mapped using the ROC curve. As for Q2, likewise, in both of these tasks precision and recall are not necessarily inversely related.
So, how do you increase recall or precision, not impacting the other simultaneously? Usually, by improving the algorithm or model. I.e. when you just change parameters of a given model, the inverse relationship will usually hold, although you should mind that it will also be usually non-linear. But if you, for example, add more descriptive features to the model, you can increase both metrics at once.
Regarding the first question, I interpret these concepts in terms of how restrictive your results must be.
If you're more restrictive, I mean, if you're more "demanding on the correctness" of the results, you want it to be more precise. For that, you might be willing to reject some correct results as long as everything you get is correct. Thus, you're raising your precision and lowering your recall. Conversely, if you do not mind getting some incorrect results as long as you get all the correct ones, you're raising your recall and lowering your precision.
On what concerns the second question, if I look at it from the point of view of the paragraphs above, I can say that yes, they are inversely related.
To the best of my knowledge, In order to be able to increase both, precision and recall, you'll need either, a better model (more suitable for your problem) or better data (or both, actually).