How to find the main part in substring using neural networks? - machine-learning

I'm in a big confuse. Is it possible to find whe main word or substring (with the help of training set) in sentence. I'm parsing vacancies and trying to build a text-maining app, that could quess what skills are mentioned in the text. Yes, maybe this is task for some kind of global text search with skill's dictionary, but i'm really very curious, can NN help?
As you have guesed, I'm a newbie at ML.

Word2Vec is a basic application of Neural Networks that could help create a numerical representation of words, which you could use to build a smart interpretation of your sentences.
More interestingly, using a LSTM can handle context, and identify key words in sentences like in this paper: http://www.clsp.jhu.edu/~guoguo/papers/icassp2015_myhotword.pdf . This is a paper on identifying key words in sentences to allow for faster, more useful applications of voice recognition software. Here is the github: https://github.com/MajerMartin/lstm-dtw-keyword-spotting . It's too much to explain in this post but that should keep you busy and get you started training a neural network for keyword identification.

Short answer: No NNs cannot help.
Long answer: Maybe they can if you really, REALLY want them to and have tons of time and skill.
The Problem is that Neural Networks are used to handle numbers and not words.
Most types of Neural Networks rely on the capability to decide if two values are close to each other. This is still not easy with strings in a language context.
So if you don't want to spend the next few years researching Neural Networks I would look for a different approach ;)

Related

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

What type of neural network should I be using to generate a paragraph based on some input?

My knowledge of neural networks is very basic but here is my goal:
Given a set of short inputs (One word strings and numbers) I want the trained network to generate a paragraph of text related to the input data.
I've messed with RNNs before to do basic natural language generation but never based on a given input.
(I played around with https://github.com/karpathy/char-rnn for example)
There is so much information out there I'm not sure what sort of model I should be using or where to start.
The question is too broad to answer in a single answer, but I tried to mention a few things that will be helpful to continue your research on this area.
What is text-generation?
The problem you mentioned is mainly recognized as text-generation in literature. Given a piece of text (e.g., a sequence of characters, words or paragraphs) to the model, the model tries to complete, rest of the text. The better your model is, the better semantically and syntactically structure of the generated text will be.
Text Generation itself is a type of Language Modelling problem. Language Modelling is the core problem for many natural language processing (NLP). A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. What does it mean? For instance, in the sentence: A cat sits on the ..., the probability that the next word will be mat is larger than to be water. This simple idea is the main intuition behind language modeling. See chapter 4 of this book for a thorough explanation on this topic.
Different kinds of language modeling:
Different kinds of approach are proposed for language modeling which mostly categorized into Statistical and Neural language model. For a comparison between these two approaches take a look into this blog post.
Recently, the use of neural networks in the development of language models has become the dominant way because:
Nonlinear neural network models solve some of the shortcomings of
traditional language models: they allow conditioning on increasingly
large context sizes with only a linear increase in the number of
parameters, they alleviate the need for manually designing backoff
orders, and they support generalization across different contexts.
Page 109, Neural Network Methods in Natural Language Processing,
2017.
Different kinds of Neural Network for language modeling:
A bunch of Neural Networks architecture proposed for language modeling using: recurrent neural network, feedforward neural network, convolution neural network, etc. which have their own pros and cons. According to here the state-of-the-art benchmark achieved by RNN models.
RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. Visit here for further details on RNN.
How to implement RNN for text-generation?
See official example in Tensrflow here.
I would suggest you to start with some toy samples, like:
https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
Natural text generation is a complex task. It can be done with N_gram appoach, RNN networks (as you mentioned), the way how it can be done you may find by links above.

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Training a neural network with complex valued weights (initialized complex valued weights real valued inputs)

As the question states. I am aiming to train a neural network where the weights are complex numbers. Using the default scikit learn netwokrs and building on this (editing the source code) the main problem I have encountered is that the optimizing functions used in scikit learn taken from scipy only support numerical optimization of functions whose input are real numbers.
Scikit learn is rather poor for neural networks it seems specially if you are wishing to fork and edit the structure is rather unflexible.
As I have noticed and read in a paper here I need to change things such as the error function to ensure that at the top level the error remains in the domain of real numbers or the problem becomes ill defined.
My question here is are there any standard libraries that may do this already ? or any easy tweaks that I could do the lasagne or tensorflow to save my life ?
P.S. :
Sorry for not posting any working code. It is a difficult question to format to the stackoverflow standards and I do admit it may be out of topic in which case I apologize if such.
The easiest way to do this is to divide your feature into the real and imaginary components. I've done similar work with vector input from a leap motion and it significantly simplifies things if you divide vectors into their component axis.
Tensorflow has elementary complex number support.
If you have to build the neural network nodes by yourself, you can take a glance at this blog.
For holomorphic functions, complex BP are fairly straight forward.
For non-holomorphic functions, they need careful treat.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Resources