The use of Textblob for multiple domain tweets for sentiment - twitter

I'm building an application that analyze sentiment for news-related tweets in different domains, such as sports, disaster and technology, I'm using Textblob with the default mode (PatternAnalyzer). Does that provide a good sentiment even though domains are different? And how can I evaluate its performance? Or is it better to provide my own training data for each domain and train a classifier?

Textblob is a basic sentiment predictor and it wont be accurate. Its basic training model is trained on movie-review dataset which wont work efficiently for you. I would suggest you to create different datasets for each, if possible.

Related

Why do we use metric learning when we can classify

So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.

Collecting Machine learning training data

I am very new to machine learning, and need a couple of things clarified. I am trying to predict the probability of someone liking an activity based on their Facebook likes. I am using the Naive Bayes classifier, but am unsure on a couple of things. 1. What would my labels/inputs be? 2. What info do I need to collect for training data? My guess is create a survey and have questions on wether the person would enjoy an activity (Scale from 1-10)
In supervised classification, all classifiers need to be trained with known labeled data, this data is known as training data. Your data should have a vector of features followed by a special one called class. In your problem, if the person has enjoyed the activity or not.
Once you train the classifier, you should test it's behavior with another dataset in order not to be biased. This dataset must have the class as the train data. If you train and test with the same datasets your classifiers prediction may be really nice but unfair.
I suggest you to take a look to evaluation techniques like K Fold Cross Validation.
Another thing you should know is that the common Naïve Bayes classifier is used to predict binary data, so your class should be 0 or 1 meaning that the person you make a survey enjoyed or not the activity. Also it's implemented in packages like Weka (Java) or SkLearn (Python).
If you are really interested in Bayesian Classifiers I need to say that in fact, Naïve Bayes for binary classification is not the best one because Minsky in 1961 discovered that the decision boundaries are hyperplanes. Also the Brier Score is really bad and it is say that this classifier is not well calibrated. But, it make good predictions after all.
Hope it helps.
This may be fairly difficult with Naive Bayes. You'll need to collect (or calculate) samples of whether or not a person likes activity X, and also details on their Facebook likes (organized in some consistent way).
Basically, for Naive Bayes, your training data should be the same data type as your testing data.
The survey approach may work, if you have access to each person's Facebook like history.

How can we train and test a neural network with UNB ISCX benchmark dataset?

I have tried with KDD dataset on my neural net and now I want to extend using ISCX dataset. Some part of this dataset contains the HTTP DOS attacks labelled represents replica of real time network traffic but I couldn't figure out how can I convert them into Neural inputs(numeric) to train and test my neural net which would classify these intrusion vectors..
Appreciated for Any pointers..
I didn't work with this data set, but if you have sufficient information about features and values of each feature, you can create .arff file quickly and then use WEKA very easy.
Although you can use many applications but some user-friendly applications such as GUI of WEKA has the capability of working with discrete and non numerical features very easy. and can help you to start working with your data set as fast as possible.

Optimizing Keyword Weights for a Web Crawler

I'm playing around with writing a web crawler that scans for a specific set of keywords and then assigns a global score to each domain it encounters based on a cumulative score I assigned to each keyword (programming=1, clojure=2, javascript=-1, etc...).
I have set up my keyword scoring on a sliding scale of -10 to 10 and I have based my initial values on my own assumptions about what is and is not relevant.
I feel that my scoring model may be flawed, and I would prefer to feed a list of domains that match the criteria I'm trying to capture into an analysis tool and optimize my keyword weights based on some kind of statistical analysis.
What would be an appropriate analysis technique to generate an optimal scoring model for a list of "known good domains"? Is this problem suited for bayesian learning, monte carlo simulation, or some other technique?
So, given a training set of relevant and irrelevant domains, you'd like to build a model which classifies new domains to one of these categories. I assume the features you will be using are the terms appearing in the domains, i.e. this is can be framed as a document classification problem.
Generally, you are correct in assuming that letting statistical-based machine learning algorithms to do the "scoring" for you works better than assigning manual scores to keywords.
A simple way to approach the problem would be to using Bayesian learning, and specifically, Naive Bayes might be a good fit.
After generating a dataset from the domains you've manually tagged (e.g. collecting several pages from each domain and treating each as a document), you can experiment various algorithms using one of the machine learning frameworks, e.g. WEKA.
A primer on how to handle and load text documents to WEKA can be found here. After the data is loaded, you can use the framework to experiment with various classification algorithms, e.g. Naive Bayes, SVM, etc. Once you've found the method best fitting your needs, you can export the resulting model and use it via WEKA's Java API.

How to categorize continuous data?

I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.

Resources