Clustering with 1/3rd of the values as Zero [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 months ago.
Improve this question
I have a dataset on property.
It has rental values and deposit amount, number of bedrooms, area, etc.
At least 1/3rd of the rental column values are just Zero.
There is no value in it.
I have to perform clustering.
However, the rent values are highly skewed.
Can I ignore 1/3rd of the rows while performing clustering or should I impute values?
What is the right method to impute values.

It depends on the aim of clustering. You could ignore the data (ie, delete the column) and proceed with clustering. You will have clusters based on the remaining features like size, number of rooms, etc.
If rent amount is an important feature that can distinguish a property from another, then you should include the column, but remove the rows that have zero (or Nan) values in that column. To impute with mean value is bad, because a 10 bedroom apartment will have different rent value from a 1 bedroom apartment, then you would have added much noise to the data.
What I would do is a few steps:
(1) extract the rows with zero rent value and use them as "test dataset".
(2) use the remaining data to train a regression model to predict the rent value, ie, do the usual train_test_split for train-val-test to get the best performing model.
(3) apply the selected model on the "test dataset" to fill in the rent values
(4) combine the 2 datasets, but do this (I'd explain later): In the "test dataset" with predicted rent values, add a column called "recognise" and give a constant value, say 1001. In the dataset with real rent values, add a column called "recognise" and give a constant value, say 1000. Now you have a full dataset with full rent values to do clustering!
Now let me explain the column "recognise". This column will have little influence on the clustering, because 1000 is close to 1001. But this column can let you recognise which record has real rent values (1000) and which has predicted rent values (1001), for analysis later if needed.

Related

Overlapping dependent time series, ML problem approach [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Below is a simplified description of the problem:
Three weeks before delivery of a product a estimation of what the qty will be delivered on a certain demand date is given by the buyer.
This quantity might change as times comes closer to delivery (Illustrated in the image below). This seems quite straight forward but there is a high correlation between the Demand weeks. e.g if a qty is lowered for one week its likely that a surrounding week will increase.
Is there an approach that will get the model to acknowledge the surrounding demand weeks?
I'm currently using random forest regression with the attributes shown in the image and the results are OK but I thought asking for inspiration here might be a good idea.
From your description I understood, that you are currently using only the forecasts of the buyer as an input. And what you would like to do is to also consider the actual Qty of the last week(s) as an input for the next estimation. To achieve this you could create another column in your table that is the actual Qty shifted by one week. That way you get a new column "Actual Qty previous week". Then you can train your model to try to predict using both the buyer forecast and the actual Qty from last week. Of cause you can do the thing once more and shift by two weeks to also make the week before that available.
In addition you can also come up with more elaborate calculated features. One idea would be the average deviation of the buyer-forecast from the final demand (where you take the average for e.g. the last 10 weeks). That way you would be able to detect that some buyers tend to overestimate and some tend to underestimate.
Since you mentioned that variations of qty are influencing the subsequent weeks, I propose to just do tha: create a new feature that is going to show the variation.
This implies to run the predictive algorithm iteratively one week after the other, adding each time a new feature to the dataset: the variation of predicted total quantity for previous weeks.
The method would go like this:
run prediction model for week1
add a feature to the dataset: variation of predicted qty for week 1
run prediction model for week2
add a feature to the dataset: variation of predicted qty for week 1 + week 2
run prediction model for week3
etc ...
This is of course only the idea. It is possible to add different kind of features (variation of last week only, moving average of last weeks, whatever would make sense,...)

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

How to round a prediction when it should be a (non-categorical) integer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Say I am trying to predict a variable y which is a score from 0 to 10 (integer numbers only), and I am using a linear regression model. The model actually produces real numbers in that interval.
I am using regression, and not classification, because I want to be able to say that missing the correct prediction by (say) 2 is worse than missing it by 1. Currently I am using the average absolute error as the evaluation metric.
Given the prediction from the model is a real number, what is the best way to constraint it to be in the allowed set of integers (from 0 to 10)? Should I just round the prediction to the nearest integer, or any better way?
You can also use a multinomial logistic regression model and one can go for classification accuracy for the measure of the performance of the model.
Have a range from 0 to 11, and round to the nearest .5 number. This gives you 10 evenly spaced, equally sized categories. If you can, weigh the regression on how close it was to the .5 mark, as the results should ideally not be close enough to the boundary to cause ambiguity.
Alternatively, have a range from -0.5 to 10.5 and use the integers as the target. It makes no difference but is compatible with your existing network.

how to learn language model? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus.
I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able to predict next word based on multiple preceding words.
So what exactly is it to train a language model?
In my current implementation, I have batch size=20 and the vocabulary size is 10000, so I have 20 resulting matrices of 10k entries (parameters?) and the loss is calculated by making comparison to 20 ground-truth matrices of 10k entries, where only the index for actual next word is 1 and other entries are zero. Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
So what exactly is it to train a language model?
I think you don't need to train with every bigram in the corpus. Just use a sequence to sequence model, and when you predict the next word given previous words you just choose the one with the highest probability.
so I have 20 resulting matrices of 10k entries (parameters?)
Yes, per step of decoding.
Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
You can first read some open source code as a reference. For instance: word-rnn-tensorflow and char-rnn-tensorflow. The perplexity is at large -log(1/10000) which is around 9 per word(which means the model is not trained at all and selects the words totally randomly, as the model being tuned the complexity will decrease, so 2 is reasonable). I think 100 in your statement may mean the complexity per sentence.
For example, if tf.contrib.seq2seq.sequence_loss is employed to calculate the complexity, the result will be less than 10 if you set both average_across_timesteps and average_across_batch to be True as default, but if you set the average_across_timesteps to be False and the average length of the sequence is about 10, it will be about 100.

Best way to detect features based on text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a "simple" problem: I have text sections and based on this it should be decided its whether "Category A" or "Category B".
As training data I have classified sections of text, which the algorithm can be trained.
The text sections look something like this:
Category A
a blue car drives
or
the blue bus stops
or
the blue bike drives
Category B
a red bike drives
or
the red bus stops
(The section text contains up to 20 words and the vary is massive)
If I have trained the algorithm with this example data, it should decide if text contains "blue" it's Categorie A, if its contains "red" it's Categorie B and so on.
The algorithm should learn based on training data if the frequency of a word is it likely more Category A or B.
Whats the best way to do this and which tool should I use?
You can try Fisher method, in which the probability of both positive (A) and negative (B) for each feature word (red, blue) in the document are calculated. The probability that a sentence with each of the two specified word (red, blue) belongs in the specified category (A, B) is obtained, assuming there will be an equal number of items in each category. Then a combined probability is obtained.
Since the features are not independent, this won’t be a real probability, but it works much like the Bayesian classifier. The value returned by the Fisher method is a much better estimate of probability, which can be very useful when reporting results or deciding cutoffs.
I think a fisrt try should be Logistic Regression as you have a binary classification problem. As soon as you have defined you caracteristic vector (e.g., frequency of a set of determined words) you can optimize the parameters of cost function that is for binary classification (e.g, the sigmoid funcion ).
An step that you probably will need is to eliminate 'stop words'.
I really recommend the Coursera Machine Learning classes.

Resources