Is there any model/classifier that works best for NLP based projects like this? - machine-learning

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.

You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

Related

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

Classifying URLs into categories - Machine Learning

[I'm approaching this as an outsider to machine learning. It just seems like a classification problem which I should be able to solve with fairly good accuracy with Machine Larning.]
Training Dataset:
I have millions of URLs, each tagged with a particular category. There are limited number of categories (50-100).
Now given a fresh URL, I want to categorize it into one of those categories. The category can be determined from the URL using conventional methods, but would require a huge unmanageable mess of pattern matching.
So I want to build a box where INPUT is URL, OUTPUT is Category. How do I build this box driven by ML?
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get. I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
I'm building this inside an AWS ecosystem so I'm open to using Amazon ML if it makes things quicker and simpler.
I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
It is not. Building an effective ML solution requires both an understanding of problem scope/constraints (in your case, new categories over time? Runtime requirements? Execution frequency? Latency requirements? Cost of errors? and more!). These constraints will then impact what types of feature engineering / processing you may look at, and what types of models you will look at. Your particular problem may also have issues with non I.I.D. data, which is an assumption of most ML methods. This would impact how you evaluate the accuracy of your model.
If you want to learn enough ML to do this problem, you might want to start looking at work done in Malicious URL classification. An example of which can be found here. While you could "hack" your way to something without learning more about ML, I would not personally trust any solution built in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
Okay, I'll bite.
There are really two schools of thought currently related to prediction: "machine learners" versus statisticians. The former group focuses almost entirely on practical and applied prediction, using techniques like k-fold cross-validation, bagging, etc., while the latter group is focused more on statistical theory and research methods. You seem to fall into the machine-learning camp, which is fine, but then you say this:
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get.
While a "conceptual understanding of the systems and processes involved" is a prerequisite for doing advanced analytics, it isn't sufficient if you're the one conducting the analysis (it would be sufficient for a manager, who's not as close to the modeling).
With just a general idea of what's going on, say, in a logistic regression model, you would likely throw all statistical assumptions (which are important) to the wind. Do you know whether certain features or groups shouldn't be included because there aren't enough observations in that group for the test statistic to be valid? What can happen to your predictions and hypotheses when you have high variance-inflation factors?
These are important considerations when doing statistics, and oftentimes people see how easy it is to do from sklearn.svm import SVC or somthing like that and run wild. That's how you get caught with your pants around your ankles.
How do I build this box driven by ML?
You don't seem to have even a rudimentary understanding of how to approach machine/statistical learning problems. I would highly recommend that you take an "Introduction to Statistical Learning"- or "Intro to Regression Modeling"-type course in order to think about how you translate the URLs you have into meaningful features that have significant power predicting URL class. Think about how you can decompose a URL into individual pieces that might give some information as to which class a certain URL pertains. If you're classifying espn.com domains by sport, it'd be pretty important to parse nba out of http://www.espn.com/nba/team/roster/_/name/cle, don't you think?
Good luck with your project.
Edit:
To nudge you along, though: every ML problem boils down to some function mapping input to output. Your outputs are URL classes. Your inputs are URLs. However, machines only understand numbers, right? URLs aren't numbers (AFAIK). So you'll need to find a way to translate information contained in the URLs to what we call "features" or "variables." One place to start, there, would be one-hot encoding different parts of each URL. Think of why I mentioned the ESPN example above, and why I extracted info like nba from the URL. I did that because, if I'm trying to predict to which sport a given URL pertains, nba is a dead giveaway (i.e. it would very likely be highly predictive of sport).

Online machine learning for obstacle crossing or bypassing

I want to program a robot which will sense obstacles and learn whether to cross over them or bypass around them.
Since my project, must be realized in week and a half period, I must use an online learning algorithm (GA or such would take a lot time to test because robot needs to try to cross over the obstacle in order to determine is it possible to cross).
I'm really new to online learning so I don't really know which online learning algorithm to use.
It would be a great help if someone could recommend me a few algorithms that would be the best for my problem and some link with examples wouldn't hurt.
Thanks!
I think you could start with A* (A-Star)
It's simple and robust, and widely used.
There are some nice tutorials on the web like this http://www.raywenderlich.com/4946/introduction-to-a-pathfinding
Online algorithm is just the one that can collect new data and update a model incrementally without re-training with full dataset (i.e. it may be used in online service that works all the time). What you are probably looking for is reinforcement learning.
RL itself is not a method, but rather general approach to the problem. Many concrete methods may be used with it. Neural networks have been proved to do well in this field (useful course). See, for example, this paper.
However, to create real robot being able to bypass obstacles you will need much then just knowing about neural networks. You will need to set up sensors carefully, preprocess data from them, work out your model and collect a dataset. Not sure it's possible to even learn it all in a week and a half.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Best approach to what I think is a machine learning problem [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Thanks!
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.
Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-SVMs
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.

Resources