Using Reinforcement Learning for Classfication Problems

Can I use reinforcement learning on classification? Such as human activity recognition? And how?

There are two types of feedback. One is evaluative that is used in reinforcement learning method and second is instructive that is used in supervised learning mostly used for classification problems.
When supervised learning is used, the weights of the neural network are adjusted based on the information of the correct labels provided in the training dataset. So, on selecting a wrong class, the loss increases and weights are adjusted, so that for the input of that kind, this wrong class is not chosen again.
However, in reinforcement learning, the system explores all the possible actions, class labels for various inputs in this case and by evaluating the reward it decides what is right and what is wrong. It may be the case too that until it gets the correct class label it may be giving wrong class name as it is the best possible output it has found till now. So, it doesn't make use of the specific knowledge we have about the class labels, hence slows the convergence rate significantly as compared to supervised learning.
You can use reinforcement learning for classification problems but it won't be giving you any added benefit and instead slow down your convergence rate.

Short answer: Yes.
Detailed answer: yes but it's an overkill. Reinforcement learning is useful when you don't have labeled dataset to learn the correct policy, so you need to develop correct strategy based on the rewards. This also allows to backpropagate through non-differentiable blocks (which I suppose is not your case). The biggest drawback of reinforcement learning methods is that thay are typically took a VERY large amount of time to converge. So, if you possess labels, it would be a LOT more faster and easier to use regular supervised learning.

You may be able to develop an RL model that chooses which classifier to use. The gt labels being used to train the classifiers and the change in performance of those classifiers being the reward for the RL model. As others have said, it would probably take a very long time to converge, if it ever does. This idea may also require many tricks and tweaks to make it work. I would recommend searching for research papers on this topic.


What orders of hyperparameter tuning

I have using Neural Network for a classification problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate
number of iterations (epoch)
For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way. Is there a special order to tune the parameters? E.g learning rate first, then batch size, then ... I am not sure that all these parameters are independent. Which ones are clearly independent and which ones are clearly not independent? Should we then tune them together? Is there any paper or article which talks about properly tuning all the parameters in a special order?
There is even more than that! E.g. the number of layers, the number of neurons per layer, which optimizer to chose, etc...
So the real work in training a neural network is actually finding the best-suited parameters.
I would say there is no clear guideline because training a machine learning algorithm, in general, is always task-specific.
You see, there are many hyperparameters to tune, and you won't have time to try out every combination of each. For many hyperparameters, you will build somewhat of intuition on what a good choice would be, but for now, a great starting point is always using what has been proven by others to work. So if you find a paper on the same or similar task you could try to use the same or similar parameters as them too.
Just to share with you some small experiences I've made:
I rarely vary the learning rate. I mostly choose the Adam optimizer and stick with it.
The batch size I try to choose as big as possible without running out of memory
number of iterations you could just set to e.g. 1000. You can always look at the current loss and decide for yourself if you can stop when the net e.g. isn't learning anymore.
Keep in mind these are in no way rules or strict guidelines. Just some ideas until you've got a better intuition yourself. The more papers you've read and more nets you've trained you will understand what to chose when better.
Hope this serves a good starting point at least.

crossvalidation "balancing" for regression problems

Classification problems can exhibit a strong label imbalance in the given dataset. This can be overcome by subsampling certain class weight attributed weights, which allow for balancing the label distributions at least during model training. Stratification on the other hand will allow for keeping a certain label distribution, which stays for every respective fold.
For a regression problem this is by standard libaries e.g. scikit-learn not defined. There are few approaches to cover stratification and a well written theoretical approach for regression subsampling by Scott Lowe here.
I am wondering why label balancing for regression instead of classification problems has so few attention in the Machine Learning community? Regression problems also exhibit different characteristica that might be easier / harder acquired in a data collection setting. And then, is there any framework or paper that further addresses this issue?
The complexity of the problem lies in the continuous nature of regression. When you have the classification, it is very natural to split them into classes because they are basically already split into classes :) Now, if you have a regression, the number of possibilities to split is basically infinite and most importantly, it is just impossible to know what a good split would be. As in the article you sent, you might apply sorted or fractional approaches but in the end, you have no idea to what extent they would be correct. You can also split it into intervals. This is what the stack library does. In the documentation, it says: "For continuous target variable overstock uses binning and categoric split based on bins". What they do is, they first assign the continuous values to bins(classes) and then they apply stratification on them.
There are not many studies on this because everything you can come up with is going to be a heuristic. However, there can be exceptions if you can incorporate some domain knowledge. As an example, let's say that you are trying to predict the frequency of some electromagnetic waves from some set of features. In that case, you have prior knowledge of how the wave frequencies are split. ( So now it is natural to split them into continuous intervals with respect to their wavelengths and do a regression stratification. But otherwise, it is hard to come with something that would generalize.
I personally never encountered a study on this.

Research paper has Supervised and Unsupervised Learning definition

I am looking for some Research paper or books have good, basic definiton of what Supervised and Unsupervised Learning is. So that i am able to quote these definition in my project.
Thank you so much.
I would make a reference to the following book: Artificial Intelligence: A Modern Approach (3rd Edition) 3rd Edition by Stuart Russell and Peter Norvig. In more detail in Chapter 18 and in pages 693 and on there is an analysis of supervised and unsupervised learning. About unsupervised learning:
In unsupervised learning, the agent learns patterns in the input
even though no explicit feedback is supplied.
The most common unsupervised learning task is clustering:
detecting potentially useful clusters of input examples.
For example, a taxi agent might gradually develop a concept
of “good traffic days” and “bad traffic days” without ever being
given labeled examples of each by a teacher
While for supervised:
In supervised learning, the agent observes some example input–output
and learns a function that maps from input to output. In component 1 above,
the inputs are percepts and the output are provided by a teacher
who says “Brake!” or “Turn left.” In component 2, the inputs are camera
images and the outputs again come from a teacher who says “that’s a bus.”
In 3, the theory of braking is a function from states and braking actions
to stopping distance in feet. In this case the output value is available
directly from the agent’s percepts (after the fact); the environment
is the teacher.
The examples are mentioned in the text above.
Christopher M. Bishop, "Pattern Recognition and Machine Learning", p.3 (emphasis mine)
Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems...
In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data,
where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.
Which is as good as you can get. Basically, the most noticable difference is whether we have labels wrt. which we want learning model to optimize. If we don't have some of the labels, it's still can be described as weakly-supervised learning. If no labels are available,the only thing left is to find some structure in the data.
Thanks #Pavel Tyshevskyi for the answear. Your answer is perfect but it seem a littel but hard to understand for beginers like me.
And after hour of searching, i found my own answer version in "Machine Learning For Dummies, IBM Limited Edition" book, at part "Approaches to Machine Learning" of chapter 1 "Understanding Machine Learning". It has simpler definition and has example that can help me to understand better a bit. Link to the book: Machine Learning For Dummies, IBM Limited Edition
Supervised learning
Supervised learning typically begins with an established set of data and a certain understanding of how that data is classified. Supervised learning is intended to find patterns in data that can be applied to an analytics process. This data has labeled features that define the meaning of data. For example, there could be mil-lions of images of animals and include an explanation of what each animal is and then you can create a machine learning appli-cation that distinguishes one animal from another. By labeling this data about types of animals, you may have hundreds of cat-egories of different species. Because the attributes and the mean-ing of the data have been identified, it is well understood by the users that are training the modeled data so that it fits the details of the labels. When the label is continuous, it is a regression; when the data comes from a finite set of values, it known as classifica-tion. In essence, regression used for supervised learning helps you understand the correlation between variables. An example of supervised learning is weather forecasting. By using regression analysis, weather forecasting takes into account known historical weather patterns and the current conditions to provide a predic-tion on the weather.
The algorithms are trained using preprocessed examples, and at this point, the performance of the algorithms is evaluated with test data. Occasionally, patterns that are identified in a subset of the data can’t be detected in the larger population of data. If the model is fit to only represent the patterns that exist in the training subset, you create a problem called overfitting. Overfit-ting means that your model is precisely tuned for your training data but may not be applicable for large sets of unknown data. To protect against overfitting, testing needs to be done against unforeseen or unknown labeled data. Using unforeseen data for the test set can help you evaluate the accuracy of the model in predicting outcomes and results. Supervised training models have broad applicability to a variety of business problems, including fraud detection, recommendation solutions, speech recognition, or risk analysis.
Unsupervised learning
Unsupervised learning is best suited when the problem requires a massive amount of data that is unlabeled. For example, social media applications, such as Twitter, Instagram, Snapchat, and.....

clustering VS supervised classification, in the case of very small database

I'm trying to classify/cluster subjects according to 4 features in two classes: healthy and sick.
Two things to know: I know the labels/classes of each subject + I only have 40 subjects (in total: training + testing set!)
What should I choose in this case, clustering or classification?
Clustering vs classification is not the choice of method but choice of problem. What is the problem at hand? You have labeled data and want to get a model that can label more - this is by definition classification. In terms of what specific method of classification to use it is a whole new, research-driven, question, rather than a simple programming issue. In particular many classifiers will try to fit some sort of generative model to the data (and thus learn about the structure even without labels), but in the end - labels are there, and should be used.*
Clustering is based on unsupervised learning and classification is based on supervised learning. Unsupervised learning is used when you don't have the target labels, it is used to cluster the data into groups. Whereas supervised learning is used when you have labeled data.
In your statement you have mentioned that you have labels then go for classification algorithms like logistic regression, svm etc. Also if you have a small dataset then you should take care of over fitting, to overcome this go for simple algorithms.
Classification is type of supervised learning. In the Classification you know algorithm needs to predict from finite set of output. For example input data has information about people who take credit card. Then algorithm will learn pattern from input data and output column(take credit card or not).Once algorithm learn it will predict from unseen data take credit card or not. In this example there are only finite number of output(2 in this case - take credit card or not). This problem can be solved using classification.
Clustering is in the unsupervised learning. It mainly deal with data which is not labelled. Clustering algorithm will separate data based on similar characteristics

Best approach to what I think is a machine learning problem

I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.
Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.
