Transcript dataset for natural language processing

Transcript dataset for natural language processing - machine-learning

I've been searching on the web, and found media such as CNN and NPR provide links to access to their transcripts. To obtain them requires writing something like a crawler which is not so convenient. The reason is that I'm trying to use some transcripts of TV show, interview, radio, movie as training data in my natural language processing projects. So I'm wondering whether there's any collection or database freely available on the web so that I can download all of them at once without writing a crawler by myself?

I would recommend the British National Corpus. I would also mention the American National Corpus, but the transcripts there are only of phone calls or face to face conversations - no news, tv shows, etc.
You also mentioned CNN and NPR. There are transcripts from 1996 as an LDC corpus here.

Related

Legality on data usage

So I'm working on a project for my University where our current plan is to use the YouTube API and do some data analysis. We have some ideas since we're looking at the Terms of Service and the Developer Policies, but we're not entirely sure about a few things.
Our project does not focus on things such as monetary gain or predicting estimated income from a video or anything of that nature, or anything regarding trying to determine user data such as passwords/usernames, etc. It's much more about the content and statistics of the videos rather than anything else.
Our current ideas that we want to be sure would be ok to do and use:
Determine the category of a video given its title
Determine the category of a video given its tags
Determine the category of a video given its description
Determine the category of a video given its thumbnail
Some combination of above to create an ensemble model
Clustering on videos category/view counts
Sentiment analysis on comments
Trending topics over time
These are just a vague list for now but I would love to be able to reach out more to figure out what all we're allowed to use the data for.

Natural Language Processing Models Timeline

I was looking for all the models being developed for NLP since word2vec till now. I was thinking of writing a detailed guide for the timeline of models in NLP as an article. Please help me here.
It would be great if answer given will be as:
Model name, year published, link to the paper and little summary about the model.

This repository contains landmark research papers in Natural Language Processing that came out in this century.
Efficient Estimation of Word Representations in Vector Space, Google
Distributed Representations of Words and Phrases, Google
Distributed Representations of Sentences and Documents, Google
Enriching Word Vectors with Subword Information, Facebook
Bag of Tricks for Efficient Text Classification, Facebook
Hierarchical Probabilistic Neural Network Language Model
A Scalable Hierarchical Distributed Language Model
BERT Pre-training of Deep Bidirectional Transformers for Language Understanding, Google
Language Models are Unsupervised Multitask Learners, OpenAI
Wav2Letter, Facebook
Misspelling Oblivious Word Embeddings, Facebook
refer to this repo: https://github.com/Akshat4112/NLP-research-papers

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.

As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!

Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Accent detection API?

I've been doing some research on the feasibility of building a mobile/web app that allows users to say a phrase and detects the accent of the user (Boston, New York, Canadian, etc.). There will be about 5 to 10 predefined phrases that a user can say. I'm familiar with some of the Speech to Text API's that are available (Nuance, Bing, Google, etc.) but none seem to offer this additional functionality. The closest examples that I've found are Google Now or Microsoft's Speaker Recognition API:
http://www.androidauthority.com/google-now-accents-515684/
https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Because there are going to be 5-10 predefined phrases I'm thinking of using a machine learning software like Tensorflow or Wekinator. I'd have initial audio created in each accent to use as the initial data. Before I dig deeper into this path I just wanted to get some feedback on this approach or if there are better approaches out there. Let me know if I need to clarify anything.

There is no public API for such a rare task.
Accent detection as language detection is commonly implemented with i-vectors. Tutorial is here. Implementation is available in Kaldi.
You need significant amount of data to train the system even if your sentences are fixed. It might be easier to collect accented speech without focusing on the specific sentences you have.
End-to-end tensorflow implementation is also possible but would probably require too much data since you need to separate speaker-instrinic things from accent-instrinic things (basically perform the factorization like i-vector is doing). You can find descriptions of similar works like this and this one.

You could use(this is just an idea, you will need to experiment a lot) a neural network with as many outputs as possible accents you have with a softmax output layer and cross entropy cost function

Semantic analysis of tweets

I have know how to communicate with twitter and how to retrieve tweets but I am looking for further working on these tweets.
I have two categories food and sports. Now I want to categorize tweets into food and sports. Can anyone please suggest me how to categorize on basis of computer algorithm?
regards
Gaurav

I've been doing some work recently with Latent Dirichlet Allocation. The general idea is that documents contain words that are generated from topics. What you could try doing is loading a corpus of documents known to be about the topics you are interested in, update with the tweets of interest, and then select tweets that have strong probabilities for the same topics as your known documents.
I use R for LDA (package:topicmodels and package:lda), but I think there are some prebuilt python tools for this too. I would probably steer away from trying to write your own unless you have a solid grounding in Bayesian statistics.
Here's the documentation for the topicmodels package: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf

I doubt that a set of algorithm could possibly categorize tweets in open domain. In other words I don't think a set of rules can possibly categorizes open domain tweets. You need to parse tweets into a semantic representation customized for the categorization.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Transcript dataset for natural language processing - machine-learning

I would recommend the British National Corpus. I would also mention the American National Corpus, but the transcripts there are only of phone calls or face to face conversations - no news, tv shows, etc. You also mentioned CNN and NPR. There are transcripts from 1996 as an LDC corpus here.

Related

Legality on data usage

Natural Language Processing Models Timeline

Using NLP or machine learning to extract keywords off a sentence

Accent detection API?

Semantic analysis of tweets

Categories

Resources