I’m taking my first steps in ML, specifically with classifiers for text sentiment analysis. My approach is to make the usual 80% train dataset and 20% test. Having a trained model what is the best way to proceed in a production environment when new features appear (new words in texts not present in the initial dataset)?
In classification task, all feature must be seen at train time and new features can not be add to prediction phase later. For your problem you can use, Stemming or Lemmatizing . Or Something like LDA or Word2Vec with large number of document they trained
this chapter could be useful: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The problem that you are describing is generally known as "out of vocabulary" (OOV) words that appear in the test set but not in the training set. A traditional approach is to represent each OOV word with a special token, such as "UNKNOWN", and actually have those in the training data. This approach is discussed more fully in Section 4.3 of "Speech and Language Processing" by Jurafsky and Martin.
A more modern approach is to use Word2Vec. This is a really advanced topic that's found in neural networks.
Related
I want to build an RL agent which can justify if a handwritten word is written by the legitimate user or not. The plan is as follow:
Let's say I have written any word 10 times and extracted some geometrical properties for all of them to use as features. Then I have trained an RL agent to learn to take the decision on the basis of the differences between geometrical properties of new and the old 10 handwritten texts. Reward is assigned for correct identification and nothing or negative for incorrect one.
Am I going in the right direction or I am missing anything which is vital? Is it possible to train the agent with only 10 samples? Actally as a new student of RL, I am confused about use case of RL; if it is best fit for game solving and robotic problems or it is also suitable for predicting on the basis of training.
Reinforcement learning would be used over time. If you were following the stroke of the pen, over time, to find out which way it was going that would be more reinforcement learning's wheelhouse. The time dimension (or over a series of states) is why it's used in games like Starcraft II.
You are talking about taking a picture of the text that was written and eventually classifying it into a boolean (Good or Not). You are looking for more Convolutional neural networks to solve your problem (those types of algos are good for pictures).
Eventually you won't be able to tell. There are techniques with GAN's (Generative Adversarial Networks) that can train with your discriminator and finally figure out the pattern it's looking for and fool it. But this sounds good as a homework problem.
My knowledge of neural networks is very basic but here is my goal:
Given a set of short inputs (One word strings and numbers) I want the trained network to generate a paragraph of text related to the input data.
I've messed with RNNs before to do basic natural language generation but never based on a given input.
(I played around with https://github.com/karpathy/char-rnn for example)
There is so much information out there I'm not sure what sort of model I should be using or where to start.
The question is too broad to answer in a single answer, but I tried to mention a few things that will be helpful to continue your research on this area.
What is text-generation?
The problem you mentioned is mainly recognized as text-generation in literature. Given a piece of text (e.g., a sequence of characters, words or paragraphs) to the model, the model tries to complete, rest of the text. The better your model is, the better semantically and syntactically structure of the generated text will be.
Text Generation itself is a type of Language Modelling problem. Language Modelling is the core problem for many natural language processing (NLP). A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. What does it mean? For instance, in the sentence: A cat sits on the ..., the probability that the next word will be mat is larger than to be water. This simple idea is the main intuition behind language modeling. See chapter 4 of this book for a thorough explanation on this topic.
Different kinds of language modeling:
Different kinds of approach are proposed for language modeling which mostly categorized into Statistical and Neural language model. For a comparison between these two approaches take a look into this blog post.
Recently, the use of neural networks in the development of language models has become the dominant way because:
Nonlinear neural network models solve some of the shortcomings of
traditional language models: they allow conditioning on increasingly
large context sizes with only a linear increase in the number of
parameters, they alleviate the need for manually designing backoff
orders, and they support generalization across different contexts.
Page 109, Neural Network Methods in Natural Language Processing,
2017.
Different kinds of Neural Network for language modeling:
A bunch of Neural Networks architecture proposed for language modeling using: recurrent neural network, feedforward neural network, convolution neural network, etc. which have their own pros and cons. According to here the state-of-the-art benchmark achieved by RNN models.
RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. Visit here for further details on RNN.
How to implement RNN for text-generation?
See official example in Tensrflow here.
I would suggest you to start with some toy samples, like:
https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
Natural text generation is a complex task. It can be done with N_gram appoach, RNN networks (as you mentioned), the way how it can be done you may find by links above.
I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.
Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..
Sorry if my question sounds too naive... i am really new to machine learning and regression
i have recently joined a machine learning lab as a master student . my professor wants me to write "the experiments an analysis" section of a paper the lab is about to submit about a regression algorithm that they have developed.
the problem is i don't know what i have to do he said the algorithm is stable and completed and they have written the first part of paper and i need to write the evaluation part .
i really don't know what to do . i have participated in coding the algorithm and i understand it pretty well but i don't know what are the tasks i must take in order to evaluate and analysis its performance.
-where do i get data?
-what is the testing process?
-what are the analysis to be done?
i am new to research and paper writing and really don't know what to do.
i have read a lot of paper recently but i have no experience in analyzing ML algorithms.
could you please guide me and explain (newbie level) the process please.
detailed answers are appreciated
thanks
You will need a test dataset to evaluate the performance. If you
don't have that, divide your training dataset (that you're currently
running this algorithm on) into training set and cross validation set
(non overlapping).
Create the test set by stripping out the predictions (y values) from
the cross validation set.
Run the algorithm with the training dataset to train the model.
Once your model is trained, test it's performance using the stripped
off 'test set'.
To evaluate the performance, you can use the RMSE (Root Mean Squared
Error) metric. You will need to use the predictions that your
algorithm made for each sample in the test set and their
corresponding actual predictions (that you stripped off earlier to
feed in the test set). You can find more information here.
Machine learning model evaluation
Take a look at this paper. It has been written for people without a computer science background, so it should be fairly easy to follow. It covers:
model evaluation workflow
holdout validation
cross-validation
k-fold cross-validation
stratified k-fold cross-validation
leave-one-out cross-validation
leave-p-out cross-validation
leave-one-group-out cross-validation
nested cross-validation