Image classification of house build year: regression or classification? - machine-learning

Let's say I want to find out when a house was built by training a CNN on a training set of housing images with the following mapping:
Input Pictures [244, 244, 3] -> Output Year [1850, 1850, ... , 2018]
It's a supervised learning problem so the labels are known (years from 1850-2018).
Would I built a classification or regression classifier to solve this problem? I'm unsure because I don't have inputs for every year from 1850-2018 but I want the classifier to output all values for new pictures that I give to the classifier after training is done. So this would point me to a regression classifer.
On the other hand I don't want the classifier to output continuous Y's because I'm interested in the concrete year the building was built. Not an inbetween value.
The answer to this may be super simple but I can't figure it out.

This is clearly a regression problem. If you were to treat each year as a separate class, classes 1900 and 2017 would be equally close to 2018 (the numerical value doesn't matter in classification). But obviously two predictions - 2017 vs 1900, when the true label is 2018 - are very different. Also regression problem will allow you to generalize to unseen years, as you stated yourself. This is practically impossible in classification, if these classes aren't present in training.
If your end result must be an integer, I'd suggest you implement an interpretation of regression output. For example, it could return a round value if it's within certain bounds or two years otherwise (when the model isn't sure):
regression_output=2000.23 -> result_year=2000
regression_output=2000.96 -> result_year=2001
regression_output=2000.45 -> result_year=2000/2001
This way you'll have one more parameter to tune. E.g., having the tolerance=0.5 will make your model always sure.

Related

Can logistic and lineair regression produce a prediction on a scale?

I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human

Cleveland heart disease dataset - can’t describe the class

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.
The dataset description says that the values go from 0 to 4 but the attribute description says:
0: < 50% coronary disease
1: > 50% coronary disease
I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?
If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.
I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.
While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.
References :
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.
is this dataset meant to be a multiclass or a binary classification problem?
Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?
Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

Wrong way to cascade classifiers in Weka

I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself.
When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why.
These are the exact steps I went through:
Open the data in Weka.
Click on add Filter and select AddClassification from Supervised->attribute.
Select a classifier. I select J48 with default settings.
Set "Output Classification" to false and set Output Distribution to true.
Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names: distribution_yourFirstClassName and distribution_yourSecondClassName.
Go the Classify tab and select a classifier: again I selected J48.
Run it. In this step I noticed much more accuracy than before.
Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?
I did try reproducing the same effect using a FilteredClassifier but it didn't work.
Thanks.
The process that you appear to have undertaken seems somewhat close to the Stacking ensemble method, where classifier outputs are used to generate an ensemble output (more on that here).
In your case however, the attributes and a previously trained classifier output is being used to predict your class. It is likely that most of the second J48 model's rules will be based on the first (As the class output will correlate more strongly to the J48 than the other attributes), but with some fine-tuning to improve model accuracy. In this case, the concept of 'two heads are better than one' is used to improve the overall performance of the model.
That's not to say that it is all good though. If you needed to use your J48 with unseen data, then you would not be able to use the same J48 that was used for your attributes (unless you saved it previously). Additionally, you are adding more processing work by using more than one classifier as opposed to the single J48. These costs would also need to be considered against the problem that you are tackling.
Hope this helps!
Okay, here is how I did cascaded learning:
I have the dataset D and divided into 10 equal sized stratified folds (D1 to D10) without repetition.
I applied algorithm A1 to train a classifier C1 on D1 to D9 and then just like you, applied C1 on D10 to give me the additional distribution of positive and negative classes. I name this D10 with the additional two (or more, depending on what information from C1 you want to be included in D10) attributes/features as D10_new.
Next, I applied the same algorithm to train a classifier C2 on D1 to D8 and D10 and then just like you, applied C2 on D9 to give me the additional distribution of positive and negative classes. I name this D9 with the additional attributes/features as D9_new.
In this way I create D1_new to D10_new.
Then I applied another classifier (perhaps with algorithm A2) on these D1_new to D10_new to predict the labels (a 10 fold CV is a good choice).
In this setup, you removed the bias of seeing the data prior to testing it. Also, it is advisable that A1 and A2 should be different.

Support vector machines for mutliple object categorization

I am trying to use linear SVMs for multi-class object category recognition. So far what I have understood is that there are mainly two approaches used- one-vs-all(OVA) and one-vs-one(OVO).
But I am having difficulty understanding its implementation. I mean the steps that I think is used are:
First the feature descriptors are prepared from let's say SIFT. So I have a 128XN feature vector.
Next to prepare a SVM classifier model for a particluar object category(say car), I take 50 images of car as the positive training set and total 50 images of rest categories taking randomly from each category (Is this part correct?). I prepare such models for all such categories (say 5 of them).
Next when I have an input image, do I need to input the image into all the 5 models and then check their values (+1/-1) for each of these models? I am having difficulty understanding this part.
In one-vs-all approach, you have to check for all 5 models. Then you can take the decision with the most confidence value. LIBSVM gives probability estimates.
In one-vs-one approach, you can take the majority. For example, you test 1 vs. 2, 1 vs. 3, 1 vs. 4 and 1 vs. 5. You classify it as 1 in 3 cases. You do the same for other 4 classes. Suppose for other four classes the values are [0, 1, 1, 2]. Therefore, class 1 was obtained most number of times, making that class as the final class. In this case, you could also do total of probability estimates. Take the maximum. That would work unless in one pair the classification goes extremely wrong. For example, in 1 vs. 4, it classifies 4 (true class is 1) with a confidence 0.7. Then just because of this one decision, your total of probability estimates may shoot up and give wrong results. This issue can be examined experimentally.
LIBSVM uses one vs. one. You can check the reasoning here. You can read this paper too where they defend one vs. all classification approach and conclude that it is not necessarily worse than one vs. one.
In short, your positive training samples are always the same. In one vs one you train n classifiers with negative samples from each of the negative classes taken separately. In one vs all you lump all negative samples together and train a single classifier.. The problem with the former approach is that you have to consider all n outcomes to decide on the class. The problem with the latter approach is that lumping al negativel object classes create may create a non homogeneous class that is hard to process and analyse.

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?
I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.
Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.
The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions
If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.
There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN
I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Resources