Is Machine Learning the relationship between input & output [closed] - machine-learning

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
According to an article I read here, Machine Learning is to do with teaching a machine how to do certain tasks through 'learning' input/output relations.
What is a more accurate definition of machine learning?

Machine Learning is to do with teaching a machine how to do certain tasks through input/output relations. Is this kind of correct?
The short answer is yes, kind of. Read on.
Definition of Machine Learning
To understand what Machine Learning is let's first define the term Learning. The often quoted definition by Tom M. Mitchell (1) is as follows:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at
tasks in T, as measured by P, improves with experience E
Meaning?
This sounds quite formal, however it just says computers learn from experience that they are presented with in terms of data. The data to enable learning exists relative to a specific task and consists of several parameters:
T, a task to accomplish, e.g. predict housing price predictions
E, some value of experience, e.g. prices observed
P, some value of performance, e.g. how many prices are predicted
Example: Housing prices
Once a program has learnt from these inputs, it can take a new, previously unseen experience and from that predict, in our example, the specific housing price. The housing price might be strongly correlated to say location, age and size of house or apartment, and the luxury of its interiors.
What is the result of a learning algorithm?
In its simplest form then a machine learning algorithm for housing prices might implement a multi-variate regression analysis. It takes as input a body of data that relates real, observed prices to the four features location, age, size, luxury. The process of learning produces a regression model that in essence assigns a weight to each feature, of the form
y^ = w_location * location + w_age * age + w_size * size + w_luxury * luxury
That is, the weights w_* are learned from the input data, y^ is the predicted price. The learning is considered successful once the formula is able to successfully predict housing prices based on a list of features alone. Usually a prediction is considered successful if it falls within a certain bound (%-range) of the real price.
Note that the definition of successful very much depends on the kind of task that the program must learn, however the result needs to be substantially better than a pure random guess (that is, the ratio of correct results needs to be statistically significant).
Is there more to it?
Yes, a lot. Some pointers can be found in this Wikipedia article. If you are keen to get into the subject, professor Andrew Ng's Standford lecture is quite famous, although there are many more courses if you look for it. Pick the one that best suits your interests.
References
(1): Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2. as referenced by Wikipedia

Related

What is scoring and what's the difference between real time scoring and batch scoring in machine learning? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I've seen examples of real time scoring used in credit card fraud detection, but I'm not seeing how scoring can achieve such a task. I think I'm fundamentally misunderstanding scoring.
My understanding is: "scoring a model" (in the case of classification models) means predicting on a series of datasets (where we know the answer to) and evaluating the predictions it's made by calculating all the wrong predictions the model made vs. the correct. i.e. if a model made 50 mistakes out of 100 predictions, the model is 50% accurate -- thus the score.
But I don't get how doing this in real time can detect fraud. If we don't know if the transaction is a fraud or not (since it's not historical data), how can scoring achieve fraud detection?
OR Is scoring actually the "confidence" of the prediction? i.e. When I make a prediction on an unseen dataset, a classification model will tell me that the confidence for the prediction can be 80% (the model is 80% sure it has the correct prediction). Is the score 80% in this case?
I've also seen scoring defined as applying a model to a new dataset. Isn't that the same as a prediction?
Firstly scoring depends on what metric you have defined yourself in order to measure the performance of your model. It can be anything whether like confidence or accuracy or any other metric for model evaluation. You have to define yourself which metric to use and which works best and its output will be called score.
The difference in Real Time Scoring & Batch Scoring:
Let us say you are building a fraud detection model. You will have to assign the scores to each transaction. There are two ways to do it.
Real Time Scoring - You get the features in real time and do all the preprocessing and pass it through the model in order to get the predictions. This all should be happening in real time itself giving immediate results. Pros are users or systems will not have to wait in order to get the results.
Batch Scoring - When you create a model which does the predictions in batch periodically, then it is called batch inferencing or batch scoring. Imagine you run your systems for predictions every hour or every midnight then it is doing it in batches.
They both have their pros & cons but generally, these decisions depend on business stakeholders and business requirements.

What are the steps should we take to analyze a dataset? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.
What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.
Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight

Difference between Regression and classification in Machine Learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

NLP: Calculating probability a document belongs to a topic (with a bag of words)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Given a topic, how can I calculate the probability a document "belongs" to that topic(ie sports)
This is what I have to work with:
1) I know the common words in documents associated with that topics (eliminating all STOP words), and the % of documents that have that word
For instance if the topic is sports, I know:
75% of sports documents have the word "play"
70% have the word "stadium"
40% have the word "contract"
30% have the word "baseball"
2) Given this, and a document with a bunch of words, how can I calculate the probability this document belongs to that topic?
This is fuzzy classification problem with topics as classes and words as features. Normally you don't have bag of words for each topic, but rather set of documents and associated topics, so I will describe this case first.
The most natural way to find probability (in the same sense it is used in probability theory) is to use naive Bayes classifier. This algorithm has been described many times, so I'm not going to cover it here. You can find quite good explanation in this synopsis or in associated Coursera NLP lectures.
There are also many other algorithms you can use. For example, your description naturally fits tf*idf based classifiers. tf*idf (term frequency * inverse document frequency) is a statistic used in modern search engines to calculate importance of a word in a document. For classification, you may calculate "average document" for each topic and then find how close new document is to each topic with cosine similarity.
If you have the case exactly like you've described - only topics and associated words - just consider each bag of words as a single document with, possibly, duplicating frequent words.
Check out topic modeling (https://en.wikipedia.org/wiki/Topic_model) and if you are coding in python, you should check out radim's implementation, gensim (http://radimrehurek.com/gensim/tut1.html). Otherwise there are many other implementations from http://www.cs.princeton.edu/~blei/topicmodeling.html
There are many approaches to solving a clustering problem. I suggest start with simple logistic regression and look at the results. If you already have predefined ontology sets, you can add them as features in next stage to improve accuracy.

Resources