Where can I find a dataset of lecture videos together with transcripts and perhaps notes? I have a machine learning project that require these but I can't seem to find any existing datasets of lecture videos together with transcripts. Any help would be greatly appreciated!
Courseras course "Machine learning foundations: a case study approach" will provide you videos, pdfs with the notes and contents, ipython notebooks(now called Jypyter notebook) as well as practical exercises, and of course, datasets to these exercises:
https://www.coursera.org/learn/ml-foundations
Related
Friends,
We are trying work on a problem where we have a dump of only reviews but there is no rating in a .csv file. Each row in .csv is one review given by customer of a particular product, lets a TV.
Here, I wanted to do classification of that text into below pre-defined category given by the domain expert of that products:
Quality
Customer
Support
Positive Feedback
Price
Technology
Some reviews are as below:
Bought this product recently, feeling a great product in the market.
Was waiting for this product since long, but disappointed
The built quality is not that great
LED screen is picture perfect. Love this product
Damm! bought this TV 2 months ago, guess what, screen showing a straight line, poor quality LED screen
This has very complicated options, documentation of this TV is not so user-friendly
I cannot use my smart device to connect to this TV. Simply does not work
Customer support is very poor. I don't recommend this
Works great. Great product
Now, with above 10 reviews by 10 different customers, how do I categorize them into the given buckets (you can call multilabel classification or Named Entity Recognition or Information extraction with sentiment analysis or be it anything)
I tried all NLP word frequency counting related stuff (in R) and referred StanfordNLP (https://nlp.stanford.edu/software/CRF-NER.shtml) and many more. But could not get a concrete solution.
Can anybody please guide me how can we tackle this problem? Thanks !!!
Most NLP frameworks will handle multi-class classification. Word count by itself in R will not likely be very accurate. A python library you can explore is Spacy. Commercial APIs like Google, AWS, Microsoft can also be used. You will need quite a few examples per category for training. Feel free to post your code and the problem or performance gap you see for further help.
I'm learning Machine Learning with Tensorflow. I've work with some dataset like Iris flower data and Boston House, but all those data's values was float.
Yes I'm looking for a dataset that contain data's values are in string format to practice. Can you give me some suggestions?
Thanks
I provide you just two easy-to-start places:
Tensorflow website has three very good tutorials to deal with word embedding, language modeling and sequence-to-sequence models. I don't have enough reputation to link them directly but you can easily find them here. They provide you with some tensorflow code to deal with human language
Moreover, if you want to build a model from scratch and you need only the dataset, try ntlk corpora. They are easy to download directly from the code.
Facebook's ParlAI project lists a good amount of datasets for Natural Language Processing tasks
IMDB's reviews dataset is also a classic example, also Amazon's reviews for sentiment analysis. If you take a look at kernels posted on Kaggle you'll get a lot of insights about the dataset and the task.
I can't find image files for WDRef dataset. Where to get it?
In publication authors wrote:
To address this issue, we introduce a new
dataset, Wide and Deep Reference dataset (WDRef), which is both wide (around
3,000 subjects) and deep (2,000+ subjects with over 15 images, 1,000+ subjects
with more than 40 images). To facilitate further research and evaluation on
supervised methods on the same test bed, we also share two kinds of extracted
low-level features of this dataset. The whole dataset can be downloaded from
our project website http://home.ustc.edu.cn/~chendong/JointBayesian/.
But, there are only LE and LBP features on their website.
Only features are available for WDRef dataset.
Another large dataset is here:
http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html
Also on this webpage, it is confirmed that WDRef is feature only public.
I'm at a cross roads, ive been using Mahout to classify some documents, and have stumbled across OpenNLP document classifier.
They seem to do very similar things, and i cant figure out if its worth converting what I currently have written in mahout, and provide an OpenNLP implementation instead.
Are there some blatently obvious advantages mahout has over OpenNLP for document classification?
My situation is that I have several hundred thousand news articles, and i only want to extract a subset of them. Mahout does this reasonably well, - im using Naive Bayes for term counting, and then TF-IDF to determine which category the documents fall into. The model is updated as and when new articles are found, so the model is consistently improving over time.
It seems OpenNLP document classifier does something very similar (although i have not tested how accurate it is). - does anyone have experience using both, who can say diffentively why one would be used above the other?
I don't have experience with these two, but while trying to figure out if one of them would make a difference in a personal project, I stumbled upon this blog, and I quote:
Data categorization with OpenNLP is another approach with more accuracy and performance rate as compared to mahout.
You can check the blog post here.
I have know how to communicate with twitter and how to retrieve tweets but I am looking for further working on these tweets.
I have two categories food and sports. Now I want to categorize tweets into food and sports. Can anyone please suggest me how to categorize on basis of computer algorithm?
regards
Gaurav
I've been doing some work recently with Latent Dirichlet Allocation. The general idea is that documents contain words that are generated from topics. What you could try doing is loading a corpus of documents known to be about the topics you are interested in, update with the tweets of interest, and then select tweets that have strong probabilities for the same topics as your known documents.
I use R for LDA (package:topicmodels and package:lda), but I think there are some prebuilt python tools for this too. I would probably steer away from trying to write your own unless you have a solid grounding in Bayesian statistics.
Here's the documentation for the topicmodels package: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf
I doubt that a set of algorithm could possibly categorize tweets in open domain. In other words I don't think a set of rules can possibly categorizes open domain tweets. You need to parse tweets into a semantic representation customized for the categorization.