How to give a logical reason for choosing a model - machine-learning

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?

How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

Related

How to decide a predictive model for sales forecasting

I would like to know which model should I choose to forecast monthly sales. should I go for regression approaches or time-series methods for small 1.5-year data?
One of the first steps I would make is to clearly determine how many features you have.
In case of Univariate forecasting (observations in time of a single variable), you would most likely resort to even statistical approaches, such as ARIMA/SARIMA(I assume the concept of seasonality is known; if not, please read on properties of time series here : https://www.dummies.com/programming/big-data/data-science/key-properties-of-a-time-series-in-data-analysis/.
If you have multiple features(observations in time of multiple variables), you could first try with a VAR(vector autoregression).
Try these models at first, and only then proceed to more complicated ones such as LSTM/CNNs
Supporting #Nicolae Petridean's affirmation, the principle of Occam's Razor should always be applied: start with simple models and only after having tried several simpler ones should you progress to deep learning techniques.
Also, bear in mind that in the case of the latter, you will need much more data as compared to simpler statistical/mathematical models or even classical machine learning ones.
Depending on the data that you have either one or the other might work. Or other techniques. Try 2 simple models using each of the 2 techniques, and validate them against a common validation dataset. This way you will have your answer. Nobody can answer to your question unless has quite some good insights into the data that you have for training. Out of my belly I would probably start with a regression but in the end I assume you will end up using something else. It is always a good option to start with simple models first to better understand the problem and then progressively fine tune or do other tricks and more complicated models, depending on what the models you already have learn or not.
Have a look at this Kaggle competition : https://www.kaggle.com/c/competitive-data-science-predict-future-sales
Check several notebooks from there and maybe you will understand more on what works or does not work in this kind of prediction.
Link to notebooks : https://www.kaggle.com/c/competitive-data-science-predict-future-sales/notebooks

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Classifying URLs into categories - Machine Learning

[I'm approaching this as an outsider to machine learning. It just seems like a classification problem which I should be able to solve with fairly good accuracy with Machine Larning.]
Training Dataset:
I have millions of URLs, each tagged with a particular category. There are limited number of categories (50-100).
Now given a fresh URL, I want to categorize it into one of those categories. The category can be determined from the URL using conventional methods, but would require a huge unmanageable mess of pattern matching.
So I want to build a box where INPUT is URL, OUTPUT is Category. How do I build this box driven by ML?
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get. I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
I'm building this inside an AWS ecosystem so I'm open to using Amazon ML if it makes things quicker and simpler.
I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
It is not. Building an effective ML solution requires both an understanding of problem scope/constraints (in your case, new categories over time? Runtime requirements? Execution frequency? Latency requirements? Cost of errors? and more!). These constraints will then impact what types of feature engineering / processing you may look at, and what types of models you will look at. Your particular problem may also have issues with non I.I.D. data, which is an assumption of most ML methods. This would impact how you evaluate the accuracy of your model.
If you want to learn enough ML to do this problem, you might want to start looking at work done in Malicious URL classification. An example of which can be found here. While you could "hack" your way to something without learning more about ML, I would not personally trust any solution built in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
Okay, I'll bite.
There are really two schools of thought currently related to prediction: "machine learners" versus statisticians. The former group focuses almost entirely on practical and applied prediction, using techniques like k-fold cross-validation, bagging, etc., while the latter group is focused more on statistical theory and research methods. You seem to fall into the machine-learning camp, which is fine, but then you say this:
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get.
While a "conceptual understanding of the systems and processes involved" is a prerequisite for doing advanced analytics, it isn't sufficient if you're the one conducting the analysis (it would be sufficient for a manager, who's not as close to the modeling).
With just a general idea of what's going on, say, in a logistic regression model, you would likely throw all statistical assumptions (which are important) to the wind. Do you know whether certain features or groups shouldn't be included because there aren't enough observations in that group for the test statistic to be valid? What can happen to your predictions and hypotheses when you have high variance-inflation factors?
These are important considerations when doing statistics, and oftentimes people see how easy it is to do from sklearn.svm import SVC or somthing like that and run wild. That's how you get caught with your pants around your ankles.
How do I build this box driven by ML?
You don't seem to have even a rudimentary understanding of how to approach machine/statistical learning problems. I would highly recommend that you take an "Introduction to Statistical Learning"- or "Intro to Regression Modeling"-type course in order to think about how you translate the URLs you have into meaningful features that have significant power predicting URL class. Think about how you can decompose a URL into individual pieces that might give some information as to which class a certain URL pertains. If you're classifying espn.com domains by sport, it'd be pretty important to parse nba out of http://www.espn.com/nba/team/roster/_/name/cle, don't you think?
Good luck with your project.
Edit:
To nudge you along, though: every ML problem boils down to some function mapping input to output. Your outputs are URL classes. Your inputs are URLs. However, machines only understand numbers, right? URLs aren't numbers (AFAIK). So you'll need to find a way to translate information contained in the URLs to what we call "features" or "variables." One place to start, there, would be one-hot encoding different parts of each URL. Think of why I mentioned the ESPN example above, and why I extracted info like nba from the URL. I did that because, if I'm trying to predict to which sport a given URL pertains, nba is a dead giveaway (i.e. it would very likely be highly predictive of sport).

Parsing nonuniform data

I am trying to parse a collection of data that has two (or one) useful pieces, but may be organized in many different ways:
V01C01
Vol 1 Chapter 1
Chapter 1 Volume 1 - Alt title
V1.1
etc.
I don't want to use a massive collection of regexs, because there is no way to predict all of the combinations of how things will be organized (also some will have extraneous text). I feel like there is a branch of machine learning that may be perfect for this, but I'm not experienced in it enough to know.
Well that is an interesting problem for sure and there are a couple of things you could try.
Making the assumption that you don't have labels on your data, then the first thing I would try to do, is to check the connections between each instance using a clustering algorithm like k-means (http://en.wikipedia.org/wiki/K-means_clustering), keep in mind that this wouldn't solve your problem but would help you to explore your data and hopefully find a set of features to train a supervised learning classifier.
In the case that you do have labels on your data, or you could manually tag your set. Then you are in front a more manageable problem. At first glance, it would look a lot like a text or document classification problem (like classify emails as Spam/NoSpam), in which case a naive bayes classifier could be a good first attempt to attack the problem since is a easy algorithm to implement and can provide reasonable good results.
About Naives Bayes Classifier (https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html)
I made some assumptions here and I might be wrong based on that. Maybe if you clarify some points (like if you are able to manually tag the data) we would be able to help you further.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Resources