In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.
Related
I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.
Precision#k has the advantage of not requiring any estimate of the
size of the set of relevant documents but the disadvantages that it is
the least stable of the commonly used evaluation measures and that it
does not average well, since the total number of relevant documents
for a query has a strong influence on precision at k
I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.
I was wondering; say if I run an evaluation function which returns the following array:
Numpy array containing precision#k scores for each user.
And now I have an array for all of the precision#3 scores across my dataset.
If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision#K or MAP#K or am I understanding this a little too literally?
I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.
There are two averages involved which make the concepts somehow obscure, but they are pretty straightforward -at least in the recsys context-, let me clarify them:
P#K
How many relevant items are present in the top-k recommendations of your system
For example, to calculate P#3: take the top 3 recommendations for a given user and check how many of them are good ones. That number divided by 3 gives you the P#3
AP#K
The mean of P#i for i=1, ..., K.
For example, to calculate AP#3: sum P#1, P#2 and P#3 and divide that value by 3
AP#K is typically calculated for one user.
MAP#K
The mean of the AP#K for all the users.
For example, to calculate MAP#3: sum AP#3 for all the users and divide that value by the amount of users
If you are a programmer, you can check this code, which is the implementation of the functions apk and mapk of ml_metrics, a library mantained by the CTO of Kaggle.
Hope it helped!
I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!
I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.
I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html
i am working on kmeans clustering .
i have 3d dataset as no.days,frequency,food
->day is normalized by means & std deviation(SD) or better to say Standardization. which gives me range of [-2 to 14]
->for frequency and food which are NOMINAL data in my data sets are normalized by DIVIDE BY MAX ( x/max(x) ) which gives me range [0 to 1]
the problem is that the kmeans only considers the day-axis for grouping since there is obvious gap b/w points in this axis and almost ignores the other two of frequency and food (i think because of negligible gaps in frequency and food dims ).
if i apply the kmeans only on day-axis alone (1D) i get the exact similar result as i applied on 3D(days,frequency,food).
"before, i did x/max(x) as well for days but not acceptable"
so i want to know is there any way to normalize the other two nominal data of frequency and food and we get fair scaling based on DAY-axis.
food => 1,2,3
frequency => 1-36
The point of normalization is not just to get the values small.
The purpose is to have comparable value ranges - something which is really hard for attributes of different units, and may well be impossible for nominal data.
For your kind of data, k-means is probably the worst choice, because k-means relies on continuous values to work. If you have nominal values, it usually gets stuck easily. So my main recommendation is to not use k-means.
For k-means to wprk on your data, a difference of 1 must be the same in every attribute. So 1 day difference = difference between food q and food 2. And because k-means is based on squared errors the difference of food 1 to food 3 is 4x as much as food to food 2.
Unless you have above property, don't use k-means.
You can try to use the Value Difference Metric, VDM (or any variant) to convert pretty much every nominal attribute you encounter to a valid numeric representation. An after that you can just apply standardisation to the whole dataset as usual.
The original definition is here:
http://axon.cs.byu.edu/~randy/jair/wilson1.html
Although it should be easy to find implementations for every common language elsewhere.
N.B. for ordered nominal attributes such as your 'frequency' most of the time it is enough to just represent them as integers.