Use Machine learning to predict the prices of used car - machine-learning

I have a large table of used cars.
The header looks like this:
maker | model | year | kilometers | transmission | gas_type | price
I made a prediction model, that work like this: every time I wanted to know the price of a car, I filtered the data by maker and model, and then I run a quadratic Regression, using year and kilometers as parameters.
The results are OK, but not for every car.
The problem is that there are different "versions" for the same maker and model.
(It is not the same a FULL version than a simple version, or 4WD, or Leather Seats, etc. )
How can I identify the differences? Can I use some kind of clustering to identify different version between cars with the same model and maker.
Any help will be appreciated

That's not a clustering problem, just a sub-model feature. Also, you might want to differentiate between a sub-model (standard, Luxury Edition, hatchback, etc.) from model-independent features (4WD, leather seats, premium sound system, sun roof, etc.). The sub-model would likely be a single feature (text column), while the options would be individual features (Boolean column).
UPDATE AFTER OP CLARIFICATION
I see: those features are output, not input.
Yes, you can use clustering. However, that may or may not identify sub-models (your "version"). If you cluster only observations that have very similar use (kilometeres) and all other features equal, you will find some useful clustering. However, this will work only to the extent that the version is a major factor in the remaining price variation. You may find that your clustering is also affected by geographic region and other factors.

Related

Method to cluster data that also has classification labels?

I have a dataset in which each line represents a person and their payment behavior during a full year. For each person I have 3 possible classification labels (age, gender, nationality). Payment behavior is defined by over 30 metrics such as number of payments and value of payments. Resulting dataset example looks something like this (I included a few random payment behavior metrics on the right):
My goal is to create classes (based on a combination of age/gender/nationality) that represent homogenous groups of people with similar payment behavior. For example: we find that 50-60 year old males from the US all have similar payment behavior. For each class I can then for example determine averages, standard deviations, percentiles etc. Since this seems to be an overlap between clustering and classification, I am stuck in what to research and where to look. Are there any methodologies I can look in to?
An option I'm thinking of would be to first create all possible classes (e.g. 50-M-US, 50-F-US, 51-M-US, etc.) and then merge them based on Euclidian distances (using all payment behavior metrics means) until a desired number of classes is left. Let me know what you think.

Can different summary metrics of a single feature be used as a features for k-means clustering?

I have a scenario where i wanted to understand the customers behavior pattern and group them into different segments/clusters for an e-commerce platform. I choose to un-supervised machine learning algorithm: k-means clustering to accomplish this task.
I have purchase_orders data available to me.
In the process of preparing my data set, i had a question: Can different summary metrics like (Sum, Avg, Min, Max, Standard Deviation) of a feature be considered into different features. Or should i take only one summary metric (say, sum of total transaction amount of a customer over multiple orders) of a feature.
Will this effect how the functioning of the k-means algorithm works?
Which of the below two data formats mentioned below, that i can feed to my algorithm be optimal to derive good results :
Format-1:
Customer ID | Total.TransactionAmount | Min.TransactionAmount |
Max.TransactionAmount | Avg.TransactionAmount |
StdDev.TransactionAmount | TotalNo.ofTransactions and so on...,
Format-2:
Customer ID | Total.TransactionAmount | TotalNo.ofTransactions and so
on...,
(Note: Consider "|" as feature separator)
(Note: Customer ID is not fed as input to the algo)
Yes you can, but whether this is a good idea is all but clear.
These values will be correlated and hence this will distort the results. It will likely make all the problems you already have (such as the values not being linear, of the same importance and hence need weighting, and of similar magnitude) worse.
With features such as "transaction amount"mand "number of transactions" you already have some pretty bad scaling issues to solve, so why add more?
It's straightforward to write down your objective function. Put your features into the equation, and try to understand what you are optimizing - is this really what you need? Or do you just want some random result?

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

How to deal with feature vector of variable length?

Say you're trying to classify houses based on certain features:
Total area
Number of rooms
Garage area
But not all houses have garages. But when they do, their total area makes for a very discriminating feature. What's a good approach to leverage the information contained in this feature?
You could incorporate a zero/one dummy variable indicating whether there is a garage, as well as the cross-product of the garage area with the dummy (for houses with no garage, set the area to zero).
The best approach is to build your dataset with all the features and in most cases it is just fine to fill with zeroes those columns that are not available.
Using your example, it would be something like:
Total area Number of rooms Garage area
100 2 0
300 2 5
125 1 1.5
Often, the learning algorithm that you chose would be powerful enough to use those zeroes to classify properly that entry. After all, absence of value it's still information for the algorithm. This just could become a problem if your data is skewed, but in that case you need to address the skewness anyway.
EDIT:
I just realize there were another answer with a comment of you being afraid to use zeroes, given the fact that could be confused with small garages. While I still don't see a problem with that (there should be enough difference between a small garage and zero), you can still use the same structure marking the non-existence area garage with a negative number ( let's say -1).
The solution indicated in the other answer is perfectly plausible too, having an extra feature indicating whether the house has garage or not would work fine (specially in decision tree based algorithms). I just prefer to keep the dimensionality of the data as low as possible, but at the end this is more a preference rather a technical decision.
You'll want to incorporate a zero indicator feature. That is, a feature which is 1 when the garage size is 0, and 0 for any other value.
Your feature vector will then be:
area | num_rooms | garage_size | garage_exists
Your machine learning algorithm will then be able to see this (non-linear) feature of garage size.

Product Categorization?

There are several data sets for automobile manufacturers and models. Each contains several hundreds data entries like the following:
Mercedes GLK 350 W2
Prius Plug-in Hybrid Advanced Toyota
General Motors Buick Regal 2012 GS 2.4L
How to automatically divide the above entries into the manufacturers (e.g. Toyota ) and models (e.g. Prius Plug-in Hybrid Advanced) by using only those files?
Thanks in advance.
Machine Learning (ML) typically relies on training data which allows the ML logic to produce and validate a model of the underlying data. With this model, it is then in a position to infer the class of new data presented to it (in the classifier application, as the one at hand) or to infer the value of some variable (in the regression case, as would be, say, an ML application predicting the amount of rain a particular region will receive next month).
The situation presented in the question is a bit puzzling, at several levels.
Firstly, the number of automobile manufacturers is finite and relatively small. It would therefore be easy to manually make the list of these manufacturers and then simply use this lexicon to parse out the manufacturers from the model numbers, using plain string parsing techniques, i.e. no ML needed or even desired here. (alas the requirement that one would be using "...only those files" seems to preclude this option.
Secondly, one can think of a few patterns or heuristics that could be used to produce the desired classifier (tentatively a relatively weak one, as the patterns/heuristics that come to mind ATM seem relatively unreliable). Furthermore, such an approach is also not quite an ML approach in the common understanding of the word.

Resources