Feature combination/joint features in supervised learning

Feature combination/joint features in supervised learning - machine-learning

While trying to come up with appropriate features for a supervised learning problem I had the following idea and wondered if it makes sense and if so, how to algorithmically formulate it.
In an image I want to classify two regions, i.e. two "types" of pixels. Say I have some bounded structure, let's take a circle, and I know I can limit my search space to this circle. Within that circle I want to find a segmenting contour, i.e. a contour that separates my pixels into an inner class A and an outer class B.
I want to implement the following model:
I know that pixels close to the bounding circle are more likely to be in the outer class B.
Of course, I can use the distance from the bounding circle as a feature, then the algorithm would learn the average distance of the inner contour from the bounding circle.
But: I wonder if I can exploit my model assumption in a smarter way. One heuristic idea would be to weigh other features by this distance, so to say, if a pixel further away from the bounding circle wants to belong to the outer class B, it has to have strongly convincing other features.
This leads to a general question:
How can one exploit joint information of features, that were prior individually learned by the algorithm?
And to a specific question:
In my outlined setup, does my heuristic idea make sense? At what point of the algorithm should this information be used? What would be recommended literature or what would be buzzwords if I wanted to search for similar ideas in the literature?

This leads to a general question:
How can one exploit joint information of features, that were prior individually learned by the algorithm?
It is not really clear what you are really asking here. What do you mean by "individually learned by the algorithm" and what would be "joiint information"? First of all, problem is too broad, there is no such tring as "generic supervised learning model", each of them works in at least slightly different way, most falling into three classes:
Building a regression model of some kind, to map input data to the output and then agregate results for classification (linear regression, artificial neural networks)
Building geometrical separation of data (like support vector machines, classification-soms' etc.)
Directly (more or less) estimating probability of given classes (like Naive Bayes, classification restricted boltzmann machines etc.)
in each of them, there is somehow encoded "joint information" regarding features - the classification function is their joint information. In some cases it is easy do interpret (linear regression) and in some it is almost impossible (deep boltzmann machines, generally all deep architectures).
And to a specific question:
In my outlined setup, does my heuristic idea make sense? At what point of the algorithm should this information be used? What would be recommended literature or what would be buzzwords if I wanted to search for similar ideas in the literature?
To my best knowledge this concept is quite doubtfull. Many models tends to learn and work better, if your data is uncorrelated, while you are trying to do the opposite - correlate everything with some particular feature. This leads to one main concern - why are you doing this? To force model to use mainly this feature?
If it is so important - maybe a supervised learning is not the good idea, maybe you can directly model your problem by appling set of simple rules based on this particular feature?
If you know the feature is important, but you are aware that in some cases other things matter, and you cannot model them, then your problem will be how much to weight your feature. Should it be just distance*other_feature? Why not sqrt(distance)*feature? What about log(distance)*feature? There are countless possibilities, and seek for the best weighting scheme may be much more costfull, then finding a better machine learning model, which can learn your data from its raw features.
If you only suspect the importance of the feature, the best possible option would be to... do not trust this belief. Numerous studies have shown, that machine learning models are better in selecting features then humans. In fact, this is the whole point of non-linear models.
In literature, problem they you are trying to solve is generally refered as incorporating expert knowledge into the learning process. There are thousands of examples, where there is some kind of knowledge that cannot be directly encoded in data representation, yet too valuable to omit it. You should research terms like "machine learning expert knowledge", and its possible synomyms.

There's a fair amount of work treating the kind of problem you're looking at (which is called segmentation) as an optimisation to be performed on a Markov Random Field, which can be solved by graph theoretic methods like GraphCut. Some examples are the work of Pushmeet Kohli at Microsoft Research (try this paper).
What you describe is, in that framework, a prior on node membership, where p(B) is inversely proportional to the distance from the edge (in addition to any other connectivity constraints you want to impose, there's normally a connectedness one, and there will certainly be a likelihood term for the pixel's intensity). The advantage of doing this is that if you can express everything as a probability model, you don't need to rely on heuristics and you can use standard mechanisms for performing inference.
The downside is you need a fairly strong mathematical background to attempt this; I don't know what the scale of the project you're proposing is, but if you want results quickly or you're lacking the necessary background this is going to be pretty daunting.

Related

How to understand Markov Localisation Algorithm?

In my thesis project, I need to implement Monte Carlo Localisation algorithm (it's based on Markov Localisation). I have exactly one month of time to understand and implement the algorithm. I understand basics of probability and Bayes theorem. Now which topics I should get familiar with to understand Markov Algorithm? I have read couple of research papers 3-4 times, still I failed to understand everything.
I tried to do Google whichever terms I didn't understand but I couldn't get the essence of the algorithm. I want to understand systematically. I know what it does but I didn't fully understand how it does or why it does.
for e.g. in one of the research paper it was written that Markov algorithm can be used in global indoor positioning system or when you have multi-modal gaussian distribution. whereas Kalman filter can not be used for the same reasons. Now, I completely didn't understand.
second example, Markov Algorithm assume map is static and consider Markov assumption where measurements are independent and doesn't depend on previous measurements. but when environment is dynamic (objects are moving) , Markov assumption is not valid and we need to modify Markov algorithm to incorporate dynamic environment. Now, I don't understand why?
It would be great if someone point me out which topics should I learn to understand the algorithm. please keep in mind that I have only one month.

Particle Filter is what you are looking for to localize a robot.
To implement particle filter, you need an understanding of basic probability(mostly Bayes theorem), Gaussian distributions in 2D.
slides, video
Watch these course videos, which are really good.
for e.g. in one of the research paper it was written that Markov algorithm can be used in global indoor positioning system or when you have multi-modal gaussian distribution. whereas Kalman filter can not be used for the same reasons. Now, I completely didn't understand.
Kalman filter or Extend Kalman filter is used for unimodal distribution and also the initial estimation must be good enough to track.
Particle filter is multi modal, doesn't need an initial guess, but need more particles (or samples) to converge to a better estimate.
second example, Markov Algorithm assume map is static and consider Markov assumption where measurements are independent and doesn't depend on previous measurements. but when environment is dynamic (objects are moving) , Markov assumption is not valid and we need to modify Markov algorithm to incorporate dynamic environment. Now, I don't understand why?
If the objects are humans, it is not difficult to localize (unless the robot is completely covered by humans and robot is not able to see any part of the environment)even in a dynamic environment. A simple modification will be to consider laser rays which are in conformation with the map. Below paper explains this.
check this paper Markov Localization for Mobile Robots
in Dynami Environments

How to calculate distance when we have sparse dataset in K nearest neighbour

I am implementing K nearest neighbour algorithm for a very sparse data. I want to calculate the distance between a test instance and each sample in the training set, but I am confused.
Because most of the features in training samples don't exist in test instance or vice versa (missing features).
How can I compute the distance in this situation?

To make sure I'm understanding the problem correctly: each sample forms a very sparsely filled vector. The missing data is different between samples, so it's hard to use any Euclidean or other distance metric to gauge similarity of samples.
If that is the scenario, I have seen this problem show up before in machine learning - in the Netflix prize contest, but not specifically applied to KNN. The scenario there was quite similar: each user profile had ratings for some movies, but almost no user had seen all 17,000 movies. The average user profile was quite sparse.
Different folks had different ways of solving the problem, but the way I remember was that they plugged in dummy values for the missing values, usually the mean of the particular value across all samples with data. Then they used Euclidean distance, etc. as normal. You can probably still find discussions surrounding this missing value problem on that forums. This was a particularly common problem for those trying to implement singular value decomposition, which became quite popular and so was discussed quite a bit if I remember right.
You may wish to start here:
http://www.netflixprize.com//community/viewtopic.php?id=1283
You're going to have to dig for a bit. Simon Funk had a little different approach to this, but it was more specific to SVDs. You can find it here: http://www.netflixprize.com//community/viewtopic.php?id=1283
He calls them blank spaces if you want to skip to the relevant sections.
Good luck!

If you work in very high dimension space. It is better to do space reduction using SVD, LDA, pLSV or similar on all available data and then train algorithm on trained data transformed that way. Some of those algorithms are scalable therefor you can find implementation in Mahout project. Especially I prefer using more general features then such transformations, because it is easier debug and feature selection. For such purpose combine some features, use stemmers, think more general.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?

Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).

It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.

What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.

As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

What's the best approach to recognize patterns in data, and what's the best way to learn more on the topic?

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though:
1) It produces several false positives
2) If he finds a crack, he only finds small sections of it and denotes those sections as being separate cracks.
My job is to write software that will read this data, analyze it, and tell the difference between false-positives and actual cracks. I also need to determine how to group together all the small sections of a crack as one.
I have tried various ways of filtering the data to eliminate false-positives, and have been using neural networks to a limited degree of success to group cracks together. I understand there will be error, but as of now, there is just too much error. Does anyone have any insight for a non-AI expert as to the best way to accomplish my task or learn more about it? What kinds of books should I read, or what kind of classes should I take?
EDIT My question is more about how to notice patterns in my coworker's data and identify those patterns as actual cracks. It's the higher-level logic that I'm concerned with, not so much the low-level logic.
EDIT In all actuality, it would take AT LEAST 20 sample images to give an accurate representation of the data I'm working with. It varies a lot. But I do have a sample here, here, and here. These images have already been processed by my coworker's process. The red, blue, and green data is what I have to classify (red stands for dark crack, blue stands for light crack, and green stands for a wide/sealed crack).

In addition to the useful comments about image processing, it also sounds like you're dealing with a clustering problem.
Clustering algorithms come from the machine learning literature, specifically unsupervised learning. As the name implies, the basic idea is to try to identify natural clusters of data points within some large set of data.
For example, the picture below shows how a clustering algorithm might group a bunch of points into 7 clusters (indicated by circles and color):
(source: natekohl.net)
In your case, a clustering algorithm would attempt to repeatedly merge small cracks to form larger cracks, until some stopping criteria is met. The end result would be a smaller set of joined cracks. Of course, cracks are a little different than two-dimensional points -- part of the trick in getting a clustering algorithm to work here will be defining a useful distance metric between two cracks.
Popular clustering algorithms include k-means clustering (demo) and hierarchical clustering. That second link also has a nice step-by-step explanation of how k-means works.
EDIT: This paper by some engineers at Phillips looks relevant to what you're trying to do:
Chenn-Jung Huang, Chua-Chin Wang, Chi-Feng Wu, "Image Processing Techniques for Wafer Defect Cluster Identification," IEEE Design and Test of Computers, vol. 19, no. 2, pp. 44-48, March/April, 2002.
They're doing a visual inspection for defects on silicon wafers, and use a median filter to remove noise before using a nearest-neighbor clustering algorithm to detect the defects.
Here are some related papers/books that they cite that might be useful:
M. Taubenlatt and J. Batchelder, “Patterned Wafer Inspection Using Spatial Filtering for Cluster Environment,” Applied Optics, vol. 31, no. 17, June 1992, pp. 3354-3362.
F.-L. Chen and S.-F. Liu, “A Neural-Network Approach to Recognize Defect Spatial Pattern in Semiconductor Fabrication.” IEEE Trans. Semiconductor Manufacturing, vol. 13, no. 3, Aug. 2000, pp. 366-373.
G. Earl, R. Johnsonbaugh, and S. Jost, Pattern Recognition and Image Analysis, Prentice Hall, Upper Saddle River, N.J., 1996.

Your problem falls in the very broad field of image classification. These types of problems can be notoriously difficult, and at the end of the day, solving them is an art. You must exploit every piece of knowledge you have about the problem domain to make it tractable.
One fundamental issue is normalization. You want to have similarly classified objects to be as similar as possible in their data representation. For example, if you have an image of the cracks, do all images have the same orientation? If not, then rotating the image may help in your classification. Similarly, scaling and translation (refer to this)
You also want to remove as much irrelevant data as possible from your training sets. Rather than directly working on the image, perhaps you could use edge extraction (for example Canny edge detection). This will remove all the 'noise' from the image, leaving only the edges. The exercise is then reduced to identifying which edges are the cracks and which are the natural pavement.
If you want to fast track to a solution then I suggest you first try the your luck with a Convolutional Neural Net, which can perform pretty good image classification with a minimum of preprocessing and noramlization. Its pretty well known in handwriting recognition, and might be just right for what you're doing.

I'm a bit confused by the way you've chosen to break down the problem. If your coworker isn't identifying complete cracks, and that's the spec, then that makes it your problem. But if you manage to stitch all the cracks together, and avoid his false positives, then haven't you just done his job?
That aside, I think this is an edge detection problem rather than a classification problem. If the edge detector is good, then your issues go away.
If you are still set on classification, then you are going to need a training set with known answers, since you need a way to quantify what differentiates a false positive from a real crack. However I still think it is unlikely that your classifier will be able to connect the cracks, since these are specific to each individual paving slab.

I have to agree with ire_and_curses, once you dive into the realm of edge detection to patch your co-developers crack detection, and remove his false positives, it seems as if you would be doing his job. If you can patch what his software did not detect, and remove his false positives around what he has given you. It seems like you would be able to do this for the full image.
If the spec is for him to detect the cracks, and you classify them, then it's his job to do the edge detection and remove false positives. And your job to take what he has given you and classify what type of crack it is. If you have to do edge detection to do that, then it sounds like you are not far from putting your co-developer out of work.

There are some very good answers here. But if you are unable to solve the problem, you may consider Mechanical Turk. In some cases it can be very cost-effective for stubborn problems. I know people who use it for all kinds of things like this (verification that a human can do easily but proves hard to code).
https://www.mturk.com/mturk/welcome

I am no expert by any means, but try looking at Haar Cascades. You may also wish to experiment with the OpenCV toolkit. These two things together do face detection and other object-detection tasks.
You may have to do "training" to develop a Haar Cascade for cracks in pavement.

What’s the best approach to recognize patterns in data, and what’s the best way to learn more on the topic?
The best approach is to study pattern recognition and machine learning. I would start with Duda's Pattern Classification and use Bishop's Pattern Recognition and Machine Learning as reference. It would take a good while for the material to sink in, but getting basic sense of pattern recognition and major approaches of classification problem should give you the direction. I can sit here and make some assumptions about your data, but honestly you probably have the best idea about the data set since you've been dealing with it more than anyone. Some of the useful technique for instance could be support vector machine and boosting.
Edit: An interesting application of boosting is real-time face detection. See Viola/Jones's Rapid Object Detection using a Boosted Cascade of Simple
Features (pdf). Also, looking at the sample images, I'd say you should try improving the edge detection a bit. Maybe smoothing the image with Gaussian and running more aggressive edge detection can increase detection of smaller cracks.

I suggest you pick up any image processing textbook and read on the subject.
Particularly, you might be interested in Morphological Operations like Dilation and Erosion‎, which complements the job of an edge detector. Plenty of materials on the net...

This is an image processing problem. There are lots of books written on the subject, and much of the material in these books will go beyond a line-detection problem like this. Here is the outline of one technique that would work for the problem.
When you find a crack, you find some pixels that make up the crack. Edge detection filters or other edge detection methods can be used for this.
Start with one (any) pixel in a crack, then "follow" it to make a multipoint line out of the crack -- save the points that make up the line. You can remove some intermediate points if they lie close to a straight line. Do this with all the crack pixels. If you have a star-shaped crack, don't worry about it. Just follow the pixels in one (or two) directions to make up a line, then remove these pixels from the set of crack pixels. The other legs of the star will recognized as separate lines (for now).
You might perform some thinning on the crack pixels before step 1. In other words, check the neighbors of the pixels, and if there are too many then ignore that pixel. (This is a simplification -- you can find several algorithms for this.) Another preprocessing step might be to remove all the lines that are too thin or two faint. This might help with the false positives.
Now you have a lot of short, multipoint lines. For the endpoints of each line, find the nearest line. If the lines are within a tolerance, then "connect" the lines -- link them or add them to the same structure or array. This way, you can connect the close cracks, which would likely be the same crack in the concrete.
It seems like no matter the algorithm, some parameter adjustment will be necessary for good performance. Write it so it's easy to make minor changes in things like intensity thresholds, minimum and maximum thickness, etc.
Depending on the usage environment, you might want to allow user judgement do determine the questionable cases, and/or allow a user to review the all the cracks and click to combine, split or remove detected cracks.

You got some very good answer, esp. #Nate's, and all the links and books suggested are worthwhile. However, I'm surprised nobody suggested the one book that would have been my top pick -- O'Reilly's Programming Collective Intelligence. The title may not seem germane to your question, but, believe me, the contents are: one of the most practical, programmer-oriented coverage of data mining and "machine learning" I've ever seen. Give it a spin!-)

It sounds a little like a problem there is in Rock Mechanics, where there are joints in a rock mass and these joints have to be grouped into 'sets' by orientation, length and other properties. In this instance one method that works well is clustering, although classical K-means does seem to have a few problems which I have addressed in the past using a genetic algorithm to run the interative solution.
In this instance I suspect it might not work quite the same way. In this case I suspect that you need to create your groups to start with i.e. longitudinal, transverse etc. and define exactly what the behviour of each group is i.e. can a single longitudinal crack branch part way along it's length, and if it does what does that do to it's classification.
Once you have that then for each crack, I would generate a random crack or pattern of cracks based on the classification you have created. You can then use something like a least squares approach to see how closely the crack you are checking fits against the random crack / cracks you have generated. You can repeat this analysis many times in the manner of a Monte-Carlo analysis to identify which of the randomly generated crack / cracks best fits the one you are checking.
To then deal with the false positives you will need to create a pattern for each of the different types of false positives i.e. the edge of a kerb is a straight line. You will then be able to run the analysis picking out which is the most likely group for each crack you analyse.
Finally, you will need to 'tweak' the definition of different crack types to try and get a better result. I guess this could either use an automated approach or a manual approach depending on how you define your different crack types.
One other modification that sometimes helps when I'm doing problems like this is to have a random group. By tweaking the sensitivity of a random group i.e. how more or less likely a crack is to be included in the random group, you can sometimes adjust the sensitivty of the model to complex patterns that don't really fit anywhere.
Good luck, looks to me like you have a real challenge.

You should read about data mining, specially pattern mining.
Data mining is the process of extracting patterns from data. As more data are gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.
A good book on the subject is Data Mining: Practical Machine Learning Tools and Techniques
(source: waikato.ac.nz) ](http://www.amazon.com/Data-Mining-Ian-H-Witten/dp/3446215336 "ISBN 0-12-088407-0")
Basically what you have to do is apply statistical tools and methodologies to your datasets. The most used comparison methodologies are Student's t-test and the Chi squared test, to see if two unrelated variables are related with some confidence.

Best approach to what I think is a machine learning problem [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Thanks!
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.

Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.

If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.

It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.

This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.

The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.

SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.

I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.

You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.

What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-SVMs
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart