Dataset of students' responses to programming task in text - machine-learning

I'm working on an Intelligent Tutoring System for programming where the tutor asks questions about code and the student answer in natural language (English). As a part of analyzing the answer, I'm using text similarity. However, this does not tell me what wrong with the answer i.e., misunderstand a concept. Therefore, I'm Thinking of using ML to classify the responses and identify any misconceptions.
My question is, where can I find a dataset that contains textual answers for programming tasks (JAVA)?

You can find tons of JAVA questions/answers in a Stack Overflow dataset.
Here is the links for downloading data and querying with BigQuery API https://www.kaggle.com/stackoverflow/stackoverflow
Filter on tag JAVA and you are good to analyse, explore your data and do some NLP on it.

Related

Can we use a clustering algorithm to cluster stackoverflow post.xml data dump?

I'm trying to group question posts in Stackoverflow post.xml data dump into their relavant programming language (ex: classify java question posts into java class). We can do this using classification. But I want to use clustering for this. Will it be possible to use clustering? If it is possible How? (My final objective is to cluster posts based on their programming language and predict trending programming languages)
So far I have preprocessed the stackoverflow posts.xml data and ready to cluster. Text preprocessing part is completed.

Fine-tuning GPT-2/3 on new data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm trying to wrap my head around training OpenAI's language models on new data sets. Is there anyone here with experience in that regard?
My idea is to feed either GPT-2 or 3 (I do not have API access to 3 though) with a textbook, train it on it and be able to "discuss" the content of the book with the language model afterwards. I don't think I'd have to change any of the hyperparameters, I just need more data in the model.
Is it possible??
Thanks a lot for any (also conceptual) help!
Presently GPT-3 has no way to be finetuned as we can do with GPT-2, or GPT-Neo / Neo-X. This is because the model is kept on their server and requests has to be made via API. A Hackernews post says that finetuning GPT-3 is planned or in process of construction.
Having said that, OpenAI's GPT-3 provide Answer API which you could provide with context documents (up to 200 files/1GB). The API could then be used as a way for discussion with it.
EDIT:
Open AI has recently introduced Fine Tuning beta.
https://beta.openai.com/docs/guides/fine-tuning
Thus it will be best answer to the question to follow through description on that link.
You can definitely retrain GPT-2. Are you only looking to train it for language generation purposes or do you have a specific downstream task you would like to adapt the GPT-2?
Both these tasks are possible and not too difficult. If you want to train the model for language generation i.e have it generate text on a particular topic, you can train the model exactly as it was trained during the pre-training phase. This means training it on a next-token prediction task with a cross-entropy loss function. As long as you have a dataset, and decent compute power, this is not too hard to implement.
When you say, 'discuss' the content of the book, it seems to me that you are looking for a dialogue model/chatbot. Chatbots are trained in a different way and if you are indeed looking for a dialogue model, you can look at DialoGPT and other models. They can be trained to become task-oriented dialog agents.

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Generating ques-answer pairs from unstructured text

I have to create a system that generates all possible question answer pairs from unstructured text in a specific domain.Many questions may have the same answer but the system should generate all possible types of questions that an answer can have.The questions formed should be meaningful and grammatically correct.
For this purpose, I used nltk and trained an NER, creating entities according to my domain and then I created some rules to identify the question word using the combination of NER identified entities and POS tagged words. But this approach isn't working fine as I am not able to create meaningful questions from the text. Moreover, some question words are wrongly identified and some question words are missed. I also read research papers on using RNN for this purpose but I don't have a large training data since the domain is pretty small. Can anyone suggest a better approach?

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.
The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:
1- The website Stackoverflow is a nice place.
2- Stackoverflow is a website.
The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:
1- The website Stackoverflow is a nice place.
2- I visit Stackoverflow regularly.
Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.
My question: is there better techniques to cluster documents?
In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.
Topic models such as LDA might work even better.
As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.
If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.
While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.
Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.
"Steve Jobs was the CEO at Apple" is clearly about the company
"I'm eating the most delicious apple" is clearly about the fruit
"I'm going to the big apple when I travel to the USA" is most likely about visiting New York
Long answer:
TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).
You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.
The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.
Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.

Resources