I am a geophysics student and I am trying to predict shear wave velocity which is numerical data. I feel since it is numerical data it'd be regression analysis but the problem I have now is that I don't have a shear wave log I can use as a target which then makes the project unsupervised, How do I go about it, please?
I want to if it's possible to predict numerical data because I have tried picking out random logs I feel will predict it but how do I check the accuracy
The solution inhere for you is to make data out of the signal data. I was also working on similar kind of problem where I was to predict the intensity of fall and data that I got was signal data having x,y,z axis. I managed to solve the problem by initially creating the data using clustering methodology according to my use case.Now since I have supervised data I proceded with futher analysis and predictions.
Related
Building a classifier for classical problems, like image classification, is quite straightforward, since by visualization on the image we know the pixel values do contain the information about the target.
However, for the problems in which there is no obvious visualizable pattern, how should we evaluate or to see if the features collected are good enough for the target information? Or if there are some criterion by which we can conclude the collected features does not work at all. Otherwise, we have to try different algorithms or classifiers to verify the predictability of the collected data. Or if there is a thumb rule saying that if apply classical classifiers, like SVM, random forest and adaboost, we cannot get a classifier with a reasonable accuracy (70%) then we should give up and try to find some other more related features.
Or by some high dim visualization tool, like t-sne, if there is no clear pattern presented in some low dim latent space, then we should give up.
First of all, there might be NO features that explain the data well enough. The data may simply be pure noise without any signal. Therefore speaking about "reasonable accuracy" of any level e.g. 70% is improper. For some data sets a model that explains 40 % of its variance will be fantastic.
Having said that, the simplest practical way to evaluate the input features is to calculate correlations between each of them and the target.
Models have their own ways of evaluating features importance.
I have a set of data, which has 3 possible events. There are 24 features that effect which of the three events will happen.
I have training data with all the 24 features and which events happened.
What I want to do is using this data predict which of the three events will happen next, given all the 24 feature values are known.
Could you suggest some machine learning algorithm that I should use to solve this problem
This sounds like a typical classification problem in supervised learning. However, you haven't given us enough information to suggest a particular algorithm.
We would need statistics about the "shape" of your data: relative clustering and range, correlations among the features, etc. The critical points so far are that you have few classes (3) and many more features than classes. What have you considered so far? Backing up a little, what unsupervised classification algorithms have you researched well enough to use?
My personal approach is to hit such a generic problem with Naive Bayes or multi-class SVM, and use the resulting classification parameters as input for feature reduction. I might also try a CNN with one hidden layer (or none, just a single FC connection) and then examine the weights to eliminate extraneous features.
Given the large dimensionality, you might also try hitting it with k-means clustering to see whether the classification is already cohesive in 24-D space. Try k=6; in most runs, this will give you 3 good clusters and 3 tiny outliers.
Does that get you moving toward a solution?
I have a very large data set extracted from machine(stream data) where most of the data fall under one category. if I train a classifier using the current data the accuracy will be very low. how to identify the key features in the giving data? also how can I measure the probability of some previous features in the time series?
Typical methods for identifying important features include PCA and ICA. However, even more valuable than these methods is having an understanding of the underlying system your data is representing.
It's difficult to answer without more information about the data structure. The best classification approach depends on the structure of your data and the aims of your analysis. There are some classifiers which can cope quite well with skewed data, I'd suggest that you have a look at some of the ensemble methods such as boosting and random or rotation forests. Some of these classification methods, such as rotation forests, provide information about variable importance as part of the training process. If you just want to work out which features are most important, you could try using CART/random forests. If you want detailed help, though, I'd strongly suggest that you provide more information about your data structure and what you'd like to achieve.
I am trying to solve some classification problem. It seems many classical approaches follow a similar paradigm. That is, train a model with some training set and than use it to predict the class labels for new instances.
I am wondering if it is possible to introduce some feedback mechanism into the paradigm. In control theory, introducing a feedback loop is an effective way to improve system performance.
Currently a straight forward approach on my mind is, first we start with a initial set of instances and train a model with them. Then each time the model makes a wrong prediction, we add the wrong instance into the training set. This is different from blindly enlarge the training set because it is more targeting. This can be seen as some kind of negative feedback in the language of control theory.
Is there any research going on with the feedback approach? Could anyone shed some light?
There are two areas of research that spring to mind.
The first is Reinforcement Learning. This is an online learning paradigm that allows you to get feedback and update your policy (in this instance, your classifier) as you observe the results.
The second is active learning, where the classifier gets to select examples from a pool of unclassified examples to get labelled. The key is to have the classifier choose the examples for labelling which best improve its accuracy by choosing difficult examples under the current classifier hypothesis.
I have used such feedback for every machine-learning project I worked on. It allows to train on less data (thus training is faster) than by selecting data randomly. The model accuracy is also improved faster than by using randomly selected training data. I'm working on image processing (computer vision) data so one other type of selection I'm doing is to add clustered false (wrong) data instead of adding every single false data. This is because I assume I will always have some fails, so my definition for positive data is when it is clustered in the same area of the image.
I saw this paper some time ago, which seems to be what you are looking for.
They are basically modeling classification problems as Markov decision processes and solving using the ACLA algorithm. The paper is much more detailed than what I could write here, but ultimately they are getting results that outperform the multilayer perceptron, so this looks like a pretty efficient method.
I've built an algorithm for pedestrian detection using openCV tools. To perform classification I use a boosted classifier trained with the CvBoost class.
The problem of this implementation is that I need to feed my classifier the whole set of features I used for training. This makes the algorithm extremely slow, so much that each image takes around 20 seconds to be fully analysed.
I need a different detection structure, and openCV has this Soft Cascade class that seems like exactly what I need. Its basic principle is that there is no need to examine all the features of a testing sample, since a detector can reject most negative samples using a small number of features. The problem is that I have no idea how to train one given a fully labeled set of negative and positive examples.
I find no information about this online, so I am looking for any tips you can give me on how to use this soft cascade to make classification.
Best regards