For example, when I implement a svm and it doesn't work well. The problem is I made a wrong choice of alpha when implementing smo algorithm or I got the KKT function wrong. But how can I know what the problem is?
Thanks a lot.
In general, cross - validation is used to make sure that your model performs correctly.
Related
I have 66 features which i'm using to create a classifcation machine learning model in python. However, just to prevent issues like overfitting, I was wondering what the best way to reduce the number of fetures would be. I have read about PCA, but am not sure whether any good methodology exists to reduce features, and whether any tools exist in sklearn to help facilitate this.
Thanks.
The first thing you should then maybe do is reading through the documentation of scikit-learn's feature selection methods.
Every method has its perks and peeves, and which one is best (if there is even one) depends on the specific use-case.
That being said, the methods offered in scikit-learn are by no means exhaustive. But discussing different choices and elaborating on an appropriate method is maybe better asked on platforms like Cross Validated or similar.
I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.
I'm trying to use machine learning algorithms for repetitive form filling.
Here is a picture to illustrate that a little bit.
If you enter values in field A and B i would like to have a suggestion for field C.
For this case i really would like to implement a Machine learning algorithm so that the system stays really flexible and only makes suggestions by the knowledge that was build.
I've already started reading programming collective intelligence and Artificial intelligence a modern approach. I also started to play around with Weka a little bit and found a pretty good microsoft research paper on my problem too. But my main problem is that I can't really identify what algorithm group I should use. I'm primarily looking at Descision trees like C 4.5 but I'm not sure if this is the right way. Could you please give me any suggestions on my problem?
It looks like you're starting out... good luck.
Go for a Huffman tree / Genetic algorithm randomizer... for a quick solution.
Go for implementing everything you can think of, then using an external efficacy classifier to figure out what to use for the next iteration, and randomize something along the way.... for the more complex solution.
Decision trees are incredibly inflexible when it comes to this type of stuff. Try fuzzy logic algorithms.
There are several normalization methods to choose from. L1/L2 norm, z-score, min-max. Can anyone give some insights as to how to choose the proper normalization method for a dataset?
I didn't pay too much attention to normalization before, but I just got a small project where it's performance has been heavily affected not by parameters or choices of the ML algorithm but by the way I normalized the data. Kind of surprise to me. But this may be a common problem in practice. So, could anyone provide some good advice? Thanks a lot!
I'm trying to cluster a really large dataset - 3030764x162 into 4000 clusters using the cvKmeans2 function in OpenCV 2.1.
I would like to see which iteration the K-means algorithm is currently in (similar to what is displayed in Matlab), but I don't see any documentation that points to how I can do this.
It's kind of frustrating seeing a blank screen and not knowing when the code is going to terminate!
Thank you.
Unfortunate as it seems, the answer is No, you cannot. There are no debugging/informative statements anywhere in the kmeans function as provided by OpenCV. However, you may edit and add statements to the method as you deem appropriate.
#Sau,
May be you need some other way of doing it. Though my answer is not relevant to OpenCV.
I have not tried in OpenCV, I had once done KMeans clustering for a extremely large data set and it was more a option better than OpenCV as it worked in a distributed mode. Though very lengthy, but still you might be interested. Its Kmeans clustering using Mahout
Check it out