While trying to come up with appropriate features for a supervised learning problem I had the following idea and wondered if it makes sense and if so, how to algorithmically formulate it.
In an image I want to classify two regions, i.e. two "types" of pixels. Say I have some bounded structure, let's take a circle, and I know I can limit my search space to this circle. Within that circle I want to find a segmenting contour, i.e. a contour that separates my pixels into an inner class A and an outer class B.
I want to implement the following model:
I know that pixels close to the bounding circle are more likely to be in the outer class B.
Of course, I can use the distance from the bounding circle as a feature, then the algorithm would learn the average distance of the inner contour from the bounding circle.
But: I wonder if I can exploit my model assumption in a smarter way. One heuristic idea would be to weigh other features by this distance, so to say, if a pixel further away from the bounding circle wants to belong to the outer class B, it has to have strongly convincing other features.
This leads to a general question:
How can one exploit joint information of features, that were prior individually learned by the algorithm?
And to a specific question:
In my outlined setup, does my heuristic idea make sense? At what point of the algorithm should this information be used? What would be recommended literature or what would be buzzwords if I wanted to search for similar ideas in the literature?

It is not really clear what you are really asking here. What do you mean by "individually learned by the algorithm" and what would be "joiint information"? First of all, problem is too broad, there is no such tring as "generic supervised learning model", each of them works in at least slightly different way, most falling into three classes:
Building a regression model of some kind, to map input data to the output and then agregate results for classification (linear regression, artificial neural networks)
Building geometrical separation of data (like support vector machines, classification-soms' etc.)
Directly (more or less) estimating probability of given classes (like Naive Bayes, classification restricted boltzmann machines etc.)
in each of them, there is somehow encoded "joint information" regarding features - the classification function is their joint information. In some cases it is easy do interpret (linear regression) and in some it is almost impossible (deep boltzmann machines, generally all deep architectures).
To my best knowledge this concept is quite doubtfull. Many models tends to learn and work better, if your data is uncorrelated, while you are trying to do the opposite - correlate everything with some particular feature. This leads to one main concern - why are you doing this? To force model to use mainly this feature?
If it is so important - maybe a supervised learning is not the good idea, maybe you can directly model your problem by appling set of simple rules based on this particular feature?
If you know the feature is important, but you are aware that in some cases other things matter, and you cannot model them, then your problem will be how much to weight your feature. Should it be just distance*other_feature? Why not sqrt(distance)*feature? What about log(distance)*feature? There are countless possibilities, and seek for the best weighting scheme may be much more costfull, then finding a better machine learning model, which can learn your data from its raw features.
If you only suspect the importance of the feature, the best possible option would be to... do not trust this belief. Numerous studies have shown, that machine learning models are better in selecting features then humans. In fact, this is the whole point of non-linear models.
In literature, problem they you are trying to solve is generally refered as incorporating expert knowledge into the learning process. There are thousands of examples, where there is some kind of knowledge that cannot be directly encoded in data representation, yet too valuable to omit it. You should research terms like "machine learning expert knowledge", and its possible synomyms.

There's a fair amount of work treating the kind of problem you're looking at (which is called segmentation) as an optimisation to be performed on a Markov Random Field, which can be solved by graph theoretic methods like GraphCut. Some examples are the work of Pushmeet Kohli at Microsoft Research (try this paper).
What you describe is, in that framework, a prior on node membership, where p(B) is inversely proportional to the distance from the edge (in addition to any other connectivity constraints you want to impose, there's normally a connectedness one, and there will certainly be a likelihood term for the pixel's intensity). The advantage of doing this is that if you can express everything as a probability model, you don't need to rely on heuristics and you can use standard mechanisms for performing inference.
The downside is you need a fairly strong mathematical background to attempt this; I don't know what the scale of the project you're proposing is, but if you want results quickly or you're lacking the necessary background this is going to be pretty daunting.


Best approach to what I think is a machine learning problem

I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.
Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.
