How to train a ML model converting text to code

How to train a ML model converting text to code - machine-learning

Looking for a working example Colab/Notebook showing training or fine-tuning of a text generation model capable of converting "short text" -> "programming code text".
I'm learning the topic and would like to fine-tune it with a custom metric on some public GitHub repos.
All I found so far are models that "continue a sentence" or simply generate the text out of the blue. Many thanks!

First, You can see CodeXGLUE and their repository, we have four categories:
code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation)
text-code (natural language code search, text-to-code generation)
code-text (code summarization)
text-text (documentation translation)
You want text-to-code generation task. Base benchmark on CodeXGLUE, one of the best models for this task is CoTexT. CoTexT support these programming languages : "go" ,"java", "javascript", "php", "python", "ruby". You can find the pre-trained of this model on huggingface from here and explaining about how to fine-tune this here.

Related

NLP / Rails sentiment search

I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍

You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

How can a machine learning model handle unseen data and unseen label?

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.

I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.

You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other

You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.

No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".

How To Train Stanford NER For Names That Include Spaces?

The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."

The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.

How to implement a sequence classification LSTM network in CNTK?

I'm working on implementation of LSTM Neural Network for sequence classification. I want to design a network with the following parameters:
Input : a sequence of n one-hot-vectors.
Network topology : two-layer LSTM network.
Output: a probability that a sequence given belong to a class (binary-classification). I want to take into account only last output from second LSTM layer.
I need to implement that in CNTK but I struggle because its documentation is not written really well. Can someone help me with that?

There is a sequence classification example that follows exactly what you're looking for.
The only difference is that it uses just a single LSTM layer. You can easily change this network to use multiple layers by changing:
LSTM_function = LSTMP_component_with_self_stabilization(
embedding_function.output, LSTM_dim, cell_dim)[0]
to:
num_layers = 2 # for example
encoder_output = embedding_function.output
for i in range(0, num_layers):
encoder_output = LSTMP_component_with_self_stabilization(encoder_output.output, LSTM_dim, cell_dim)
However, you'd be better served by using the new layers library. Then you can simply do this:
encoder_output = Stabilizer()(input_sequence)
for i in range(0, num_layers):
encoder_output = Recurrence(LSTM(hidden_dim)) (encoder_output.output)
Then, to get your final output that you'd put into a dense output layer, you can first do:
final_output = sequence.last(encoder_output)
and then
z = Dense(vocab_dim) (final_output)

here you can find a straightforward approach, just add the additional layer like:
Sequential([
Recurrence(LSTM(hidden_dim), go_backwards=False),
Recurrence(LSTM(hidden_dim), go_backwards=False),
Dense(label_dim, activation=sigmoid)
])
train it, test it and apply it...

CNTK published a hands-on tutorial for language understanding that has an end to end recipe:
This hands-on lab shows how to implement a recurrent network to process text, for the Air Travel Information Services (ATIS) task of slot tagging (tag individual words to their respective classes, where the classes are provided as labels in the training data set). We will start with a straight-forward embedding of the words followed by a recurrent LSTM. This will then be extended to include neighboring words and run bidirectionally. Lastly, we will turn this system into an intent classifier.

I'm not familiar with CNTK. But since the question has been left unanswered for so long, I can perhaps suggest some advice to help you with the implementation?
I'm not sure how experienced you are with these architectures; but before moving to CNTK (which seemingly has a less active community), I'd suggest looking at other popular repositories (like Theano, tensor-flow, etc.)
For instance, a similar task in theano is given here: kyunghyuncho tutorials. Just look for "def lstm_layer" for the definitions.
A torch example can be found in Karpathy's very popular tutorials
Hope this helps a bit..

NLTK for Named Entity Recognition

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:
sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)
I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:
(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)
Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.
Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.
Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

Named entity recognition is not an easy problem, do not expect any library to be 100% accurate. You shouldn't make any conclusions about NLTK's performance based on one sentence. Here's another example:
sentence = "I went to New York to meet John Smith";
I get
(S
I/PRP
went/VBD
to/TO
(NE New/NNP York/NNP)
to/TO
meet/VB
(NE John/NNP Smith/NNP))
As you can see, NLTK does very well here. However, I couldn't get NLTK to recognise today or tomorrow as temporal expressions. You can try Stanford SUTime, it is a part of Stanford CoreNLP- I have used it before I it works quite well (it is in Java though).

If you wish to correctly identify the date or time from the text messages you can use Stanford's NER.
It uses the CRF(Conditional Random Fields) Classifier. CRF is a sequential classifier. So it takes the sequences of words into consideration.
How you frame or design a sentence, accordingly you will get the classified data.
If your input sentence would have been Let's meet on wednesday at 9am., then Stanford NER would have correctly identified wednesday as date and 9am as time.
NLTK supports Stanford NER. Try using it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to train a ML model converting text to code - machine-learning

Related

NLP / Rails sentiment search

How can a machine learning model handle unseen data and unseen label?

How To Train Stanford NER For Names That Include Spaces?

How to implement a sequence classification LSTM network in CNTK?

NLTK for Named Entity Recognition

Categories

Resources