Optimizing Keyword Weights for a Web Crawler - machine-learning

I'm playing around with writing a web crawler that scans for a specific set of keywords and then assigns a global score to each domain it encounters based on a cumulative score I assigned to each keyword (programming=1, clojure=2, javascript=-1, etc...).
I have set up my keyword scoring on a sliding scale of -10 to 10 and I have based my initial values on my own assumptions about what is and is not relevant.
I feel that my scoring model may be flawed, and I would prefer to feed a list of domains that match the criteria I'm trying to capture into an analysis tool and optimize my keyword weights based on some kind of statistical analysis.
What would be an appropriate analysis technique to generate an optimal scoring model for a list of "known good domains"? Is this problem suited for bayesian learning, monte carlo simulation, or some other technique?

So, given a training set of relevant and irrelevant domains, you'd like to build a model which classifies new domains to one of these categories. I assume the features you will be using are the terms appearing in the domains, i.e. this is can be framed as a document classification problem.
Generally, you are correct in assuming that letting statistical-based machine learning algorithms to do the "scoring" for you works better than assigning manual scores to keywords.
A simple way to approach the problem would be to using Bayesian learning, and specifically, Naive Bayes might be a good fit.
After generating a dataset from the domains you've manually tagged (e.g. collecting several pages from each domain and treating each as a document), you can experiment various algorithms using one of the machine learning frameworks, e.g. WEKA.
A primer on how to handle and load text documents to WEKA can be found here. After the data is loaded, you can use the framework to experiment with various classification algorithms, e.g. Naive Bayes, SVM, etc. Once you've found the method best fitting your needs, you can export the resulting model and use it via WEKA's Java API.

Related

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

When true positives are rare

Suppose you're trying to use machine learning for a classification task like, let's say, looking at photographs of animals and distinguishing horses from zebras. This task would seem to be within the state of the art.
But if you take a bunch of labelled photographs and throw them at something like a neural network or support vector machine, what happens in practice is that zebras are so much rarer than horses that the system just ends up learning to say 'always a horse' because this is actually the way to minimize its error.
Minimal error that may be but it's also not a very useful result. What is the recommended way to tell the system 'I want the best guess at which photographs are zebras, even if this does create some false positives'? There doesn't seem to be a lot of discussion of this problem.
One of the things I usually do with imbalanced classes (or skewed data sets) is simply generate more data. I think this is the best approach. You could go out in the real world and gather more data of the imbalanced class (e.g. find more pictures of zebras). You could also generate more data by simply making copies or duplicating it with transformations (e.g. flip horizontally).
You could also pick a classifier that uses an alternate evaluation (performance) metric over the one usually used - accuracy. Look at precision/recall/F1 score.
Week 6 of Andrew Ng's ML course talks about this topic: link
Here is another good web page I found on handling imbalanced classes: link
With this type of unbalanced data problem, it is a good approach to learn patterns associated with each class as opposed to simply comparing classes - this can be done via unsupervised learning learning first (such as with autoencoders). A good article with this available at https://www.r-bloggers.com/autoencoders-and-anomaly-detection-with-machine-learning-in-fraud-analytics/amp/. Another suggestion - after running the classifier, the confusion matrix can be used to determine where additional data should be pursued (I.e. many zebra errors)

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Automating the rumour identification process

Currenlty what we do, check the user discussion based on some keywords on social media. As per the keywords detection we identify that this can be rumour.
Approach to automate the process:
Keyword based : verifying the conversation for 1-2 gram based keywords. If keyword present, marking it as suspected conversation
Classifier based approach : Training the classifier with some prelabeled suspected conversations. Which ever being classified with >50% probability, marked as suspected.
For 2nd approach I am thinking of naive bayes classifier, and identifying the result with precision, recall, F measure value using scikit learn.
Is there any better approach to this? Or some model which can be combination of both approach?
There's no reason that the two approaches would be mutually exclusive. If you are going to be identifying keywords anyway, then you could easily extract a feature for machine-learning. And if you are doing machine-learning, you might as well include features that capture what you know about the keywords you have identified.
Is there a reason that you have chosen a Naive Bayes model? You may want to try a number of models to compare their performance. Your statement about 'identifying the result with precision, recall, F-measure' makes it seem like you don't understand how you make predictions with a machine-learning model. Those three metrics are the result of comparing a model's predictions with 'gold-standard' labels on a number of texts. I would recommend reading through an introduction to machine-learning. If you have already decided that you want to use scikit-learn, then perhaps you could work through their tutorial here. Another python library worth looking into is nltk, which has a free companion book here.
If python is not your preferred language, then there are lots of other options, too. For example, weka is a well-known tool written in java. It has a very user-friendly graphical interface for the basic functions, but it is not difficult to use from the command line as well.
Good luck!

Resources