How to get the hyper parameters more easily? - machine-learning

Many models in Machine Learning include hyper parameters. What is the best practice to find those hyper parameters using hold out data? Or what is your way to do that?

Grid search and manual search are the most widely use techniques for optimizing hyper-parameters of machine learning algorithms. However, a recent paper published by James Bergstra and Yoshua Bengio argued that random search is better than grid and manual search for hyper-parameter optimization. For more information about random ( grid and manual ) search please look at their paper:
Random Search for Hyper-Parameter Optimization
Recently, I submitted ( and accepted ) a paper for the Pattern Recognition Letters Journal. For that paper I used the random search technique.

Related

In machine learning which algorithm should I use to recommend, based on different features like rating,type,gender etc

I am developing a website, which will recommend recipes to the visitors based on their data. I am collecting data from their profile, website activity and facebook.
Currently I have data like [username/userId, rating of recipes, age, gender, type(veg/Non veg), cuisine(Italian/Chinese.. etc.)]. With respect to above features I want to recommend new recipes which they have not visited.
I have implemented ALS (alternating least squares) spark algorithm. In this we have to prepare csv which contains [userId,RecipesId,Rating] columns. Then we have to train this data and create the model by adjusting parameters like lamdas, Rank, iteration. This model generated recommendation, using pyspark
model.recommendProducts(userId, numberOfRecommendations)
The ALS algorithm accepts only three features userId, RecipesId, Rating. I am unable to include more features (like type, cuisine, gender etc.) apart from which I have mentioned above (userId, RecipesId, Rating). I want to include those features, then train the model and generate recommendations.
Is there any other algorithm in which I can include above parameters and generate recommendation.
Any help would be appreciated, Thanks.
Yes, there are couple of others algorithms. For your case, I would suggest that you Naive Bayes algorithm.
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Since you are working on a web application, a JS solution, I guess, would come handy to you.
(simple) https://www.npmjs.com/package/bayes
or for example:
(a bit more powerful) https://www.npmjs.com/package/naivebayesclassifier
There are algorithms called recommender systems in machine learning. In this we have content based recommender systems. They are mainly used to recommend products/movies based on customer reviews. You can apply the same algorithm using customer reviews to recommend recipes. For better understanding of this algorithm refer this links:
https://www.youtube.com/watch?v=Bv6VkpvEeRw&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=97
https://www.youtube.com/watch?v=2uxXPzm-7FY
You can go with powerful classification algorithms like
->SVM: works very well if you have more number of attributes.
->Logistic Regression: if you have huge data of customers.
You are looking for recommender systems using algorithms like collaborative filtering. I would suggest you to go through Prof.Andrew Ng's short videos on collaborative filtering algorithm and low-rank matrix factorization and also building recommender systems. They are a part of Coursera's Machine learning course offered by Stanford University.
The course link:
https://www.coursera.org/learn/machine-learning#%20
You can check week 9 for the content related to recommender systems.

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Optimizing Keyword Weights for a Web Crawler

I'm playing around with writing a web crawler that scans for a specific set of keywords and then assigns a global score to each domain it encounters based on a cumulative score I assigned to each keyword (programming=1, clojure=2, javascript=-1, etc...).
I have set up my keyword scoring on a sliding scale of -10 to 10 and I have based my initial values on my own assumptions about what is and is not relevant.
I feel that my scoring model may be flawed, and I would prefer to feed a list of domains that match the criteria I'm trying to capture into an analysis tool and optimize my keyword weights based on some kind of statistical analysis.
What would be an appropriate analysis technique to generate an optimal scoring model for a list of "known good domains"? Is this problem suited for bayesian learning, monte carlo simulation, or some other technique?
So, given a training set of relevant and irrelevant domains, you'd like to build a model which classifies new domains to one of these categories. I assume the features you will be using are the terms appearing in the domains, i.e. this is can be framed as a document classification problem.
Generally, you are correct in assuming that letting statistical-based machine learning algorithms to do the "scoring" for you works better than assigning manual scores to keywords.
A simple way to approach the problem would be to using Bayesian learning, and specifically, Naive Bayes might be a good fit.
After generating a dataset from the domains you've manually tagged (e.g. collecting several pages from each domain and treating each as a document), you can experiment various algorithms using one of the machine learning frameworks, e.g. WEKA.
A primer on how to handle and load text documents to WEKA can be found here. After the data is loaded, you can use the framework to experiment with various classification algorithms, e.g. Naive Bayes, SVM, etc. Once you've found the method best fitting your needs, you can export the resulting model and use it via WEKA's Java API.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources