Requesting a Geoparse bench mark twitter dataset - twitter

Can anyone help me find out a benchmark dataset for geoparsing of social media texts. I need a benchmark dataset to evaluate my algorithms. I came to know a dataset GEOPARSE TWITTER BENCHMARK DATASET from Suthampton University. but I am unable to access it. Can anyone help me in this ?

Related

Machine Learning Dataset for Cyberbullying Tweets Detection

We are currently working on cyberbullying tweets detection using machine learning, we are unable to find a dataset for the same. So can someone please help us by sending the data set. We will continue our work depending on the data set.
We tried on specific sites and we created a dataset ourselves but that's doesn't seem to do the work. so please help us by sending the dataset for the same.
You may use the Offensive Language Identification Dataset here:
https://sites.google.com/site/offensevalsharedtask/olid

How to gather a dataset of labeled resumes for training ML models

I am building an algorithm for classifying resumes using CNN. I have found pretty cool concepts but I can't test them because I can't find a labeled database of resumes.
Is there a legal, and free way to obtain such dataset ?
You can browse through Kaggle to see if there is any that will match what you need, but here is one that could be useful gotten from Kaggle.
https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset

Looking for a dataset that contain string value in Machine Learning

I'm learning Machine Learning with Tensorflow. I've work with some dataset like Iris flower data and Boston House, but all those data's values was float.
Yes I'm looking for a dataset that contain data's values are in string format to practice. Can you give me some suggestions?
Thanks
I provide you just two easy-to-start places:
Tensorflow website has three very good tutorials to deal with word embedding, language modeling and sequence-to-sequence models. I don't have enough reputation to link them directly but you can easily find them here. They provide you with some tensorflow code to deal with human language
Moreover, if you want to build a model from scratch and you need only the dataset, try ntlk corpora. They are easy to download directly from the code.
Facebook's ParlAI project lists a good amount of datasets for Natural Language Processing tasks
IMDB's reviews dataset is also a classic example, also Amazon's reviews for sentiment analysis. If you take a look at kernels posted on Kaggle you'll get a lot of insights about the dataset and the task.

Semantic analysis of tweets

I have know how to communicate with twitter and how to retrieve tweets but I am looking for further working on these tweets.
I have two categories food and sports. Now I want to categorize tweets into food and sports. Can anyone please suggest me how to categorize on basis of computer algorithm?
regards
Gaurav
I've been doing some work recently with Latent Dirichlet Allocation. The general idea is that documents contain words that are generated from topics. What you could try doing is loading a corpus of documents known to be about the topics you are interested in, update with the tweets of interest, and then select tweets that have strong probabilities for the same topics as your known documents.
I use R for LDA (package:topicmodels and package:lda), but I think there are some prebuilt python tools for this too. I would probably steer away from trying to write your own unless you have a solid grounding in Bayesian statistics.
Here's the documentation for the topicmodels package: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf
I doubt that a set of algorithm could possibly categorize tweets in open domain. In other words I don't think a set of rules can possibly categorizes open domain tweets. You need to parse tweets into a semantic representation customized for the categorization.

sentiment analysis and efficient clustering of raw text with minimal context

Say I have an email chain where 2 people discuss about a problem and its solution. I have some context too. Example, the email chain is about some problem in using iPhone 6 with iOS 7. Thats it. From the content/text of these emails, I need to figure out what exactly the problem is and what exactly is the solution proposed.
Now, if we port this problem to big data i.e. millions of such email chains, I want to know how to classify or cluster them.
I am using Apache Spark's MLlib - LDA, FPgrowth and Kmeans (+ a huge list of stop words). But my results dont look correct. Playing around with params for these algorithms is just giving me knowledge but not good results. My biggest problem is not having training data. Unfortunately, most solutions I see online use manually created training data. Any help?
Try word2vec. You can use it to create word vectors or sentence vectors. And also do k-means clustering on top of them.
If you are looking for noisy text datasets, you can check out
Ubuntu Chat Corpus
Enron e-mail dataset

Resources