how to collect millions of tweets? - twitter

I was browsing through fflick, nicely made app on top of twitter. How do they
collect millions of tweets?
accurately (mostly) categorize tweets into postive or negative sentiments?

The collect millions of tweets probably by crawler twitter with their API. Probably searching with Streaming API for keywords related to films, or just searching their own timeline looking for what their followers have to say about films.
Don't know. Probably using some natural language processing techniques from good old AI textbooks. :-)

2) look for smileys - ;), :), :D, :(

A few places provide the latter vas a service now. Check out ViralHeat and Evri:
http://www.viralheat.com/home/features
http://www.readwriteweb.com/archives/sentiment_analysis_is_ramping_up_in_2009.php

Related

How to filter out unwanted/official Twitter posts

I am now doing an NLP project which needs some resources from twitter.
I want to get those tweets posted by "real people" instead of any kind of "official accounts", including celebrities, ads, institutions, media, etc. such as #CNN #TodayWeather #obama #DailySale #BestPrice #FashionTrend.
Hence, is there a better way to do so?
I have considered about it for a long time. By using twitter's API, the returned JSON includes a key called "verified". This can be used to detect weather an account is that kind of "official account". However, today, this blue "V" tick is not only for those shining celebrities. Anyone can apply for it as long as they are a real person. So, I think using this solution will rule out a lot of precious resources.
Moreover, I also considered using textual spam filter. yeah, of course, they are quite good in most cases. However, some accounts, such as #FT, their posts never sound like a spammy ad. But it is not what I want.
I want to ask for a better solution. It can be a long term solution, such as using NLP and NeuroNets to learn from labels. But, well, a prompt solution will be very welcomed.
THX

How to get the greatest Tweets flow?

I am working on a project using the Twitter API. I am doing data analysing on Twitter Data. Actually, I have to stream the greatest number of Tweets ever to perform my algorithm on it. Everything works well but I am quite disapointed by the number of Tweets I can actually collect.
By using the STREAMING API, I have only access to around 4500 Tweets each hour in an area like London. The more Tweets I might collect, the better my analysis. I read somewhere that someone was collecting like more than 100,000 each hour...
Do you think my authentification might be rate limited ?
Is SEARCH allows us to collect more Tweets than STREAMING ?
Thank you.

How do I collect 1000 recent tweets of lot of users(say 80000) from twitter?

I am trying to collect 1000 recent tweets of lot of users (around 80000) for my research work. I tried using REST API but due to rate limit its becoming not practical for 80000 users. Lot of Research papers said they have collected tweets and other information of thousands of users but i could not figure out how did they do. WHat is the best way to do the same.
Use the Twitter Streaming API.
You will be able to get real-time tweets from everyone. You can filter it down with search parameters if you want.

About data mining by using twitter data

I plan to write a thesis about using sentiment information to enhance the predictivity of some financial trading model for currency.
The sentiment data should be twitter threads including some keyword, like "EUR.USD". And I will filter out some sentiment words to identify the sentiment. Simple idea. Then we try to see whether here is any relation between the degree of sentiment and the movement of EUR.USD.
My big concern is on twitter data. As we all know that the twitter set up the limit to see the history data. You could only browser back for like 5 days. It is not enough since our strategy based on daily sentiment.
I noticed that google have some fantastic thing like timeline about the twitter updates: http://www.readwriteweb.com/archives/googles_twitter_timeline_lets_you_explore_the_past.php
But first of all, I am in Switzerland and seems I have no such function on my google which is too smart to identify my location and may block some US google version function like this. Secondly, even I could see some fancy interactive google timeline control on my firefox, How could I dig out data from my query and save them? Does google supply such api?
The Google service you mentioned has shut down recently so you won't be able to use it. (http://www.searchenginejournal.com/google-realtime-shuts-down-as-twitter-deal-expires/31007/)
If you need a longer timespan of data to analyze I see the following options:
pay for historical data :) (https://dev.twitter.com/docs/twitter-data-providers)
if you don't want to pay, you need to fetch tweets containing EUR/USD whatever else (you could use the streaming API for this) and store them somehow. Run this service for a while (if possible) and you'll have more than just 5 days of data.

Twitter Streaming API - tracking exact multiple keywords in exact order

I'm just beginning to play with the Twitter Streaming API.
If I specify
$sc->setTrack(array('just bought from'));
This will correctly pull only tweets that have all 3 keywords - but doesn't maintain the order.
1) I want the keywords to appear in the same order like
"I just bought apple from itunes"
but the above also returns tweets like
"I bought some apples and just removed them from the bag"
2) Is there a way to specify the exact words say "NBA basketball" with nothing in between - in the sense I dont want tweets like this to be returned
Watching basketball on NBA tv
I just want tweets which contain the exact phrase to be returned like
I love watching NBA basketball
3) Also is there a way to specify negative keywords
Any tips if this is possible.
Thanks
Currently the answer to all three questions is no. The general recommendation is to do this post processing on your side. The negative keyword is something that's been asked for quite a bit, but currently we don't have a scheme that would let us support this in a scalable way

Resources