Ways to pull (potentially) large amounts of data from Twitter - twitter

I've been playing around with the Twitter API using Twitter4j. I am trying to pull data given a keyword and date, and example of a query I would run using the REST API would be
bagels since:2014-12-27
Which would give me all tweets containing the keyword 'bagels' since 2014-12-27.
This works in theory, but I've quickly exceeded the rate limits since each query allows up to 100 results, and only 180 queries are allowed within a 15-minute interval. There are many keywords that return more than 18k results.
Is there a better way to pull large amounts of data from Twitter? I looked at the Streaming API but I don't know if I can pull data from a certain date range.

There are a few things you can do to improve your rates:
Make sure your count is maxed at 100, which it looks like you're doing.
Use Application-Only authorization - it increases your rate limit to 450.
Use the max_id, since_id parameters to page through data and avoid querying for results you're already received. See the Working with Timelines docs to see what I mean.
Consider using Gnip if you're willing to pay to remove rate limits.

Related

Getting realtime twitter search results using the streaming API

I have an application where I need to get complete, realtime search results from twitter (preferably polling every 500ms or less). Based on my understanding, doing this using the search API will run into rate limits very quickly. However, the streaming API doesn't seem to support getting complete anything (only a 5% sample).
More specifically, I have a search query term which typically comes up with <20 matching tweets per hour, and I would like to be informed of these new tweets within 1-2 seconds, and it is considered a failure if I am not notified within 5 seconds. Due to the relatively low frequency of posting, missing even one tweet is very undesirable.
Is there any way I can realistically do this using twitter API, or is my only choice to write a browser extension to repeatedly refresh the search page?
The answer is "yes". Although you are rate limited (the limit is closer to 1% than 5%), that is only a cutoff based on your query results. Very roughly, you can stream about 60 tweets per second max. In your case, you say you expect under 20 tweets per hour, so you should have no problem getting all those tweets.
You also require a latency less than 5 seconds. In my experience latency has always been a second or two. I think you should be fine.

Twitter rate limiting confusion?

So I'm currently using NodeXL to get search for a particular Twitter hashtag, and I'm having trouble on understand how exactly the rate-limiting works. I looked it up in Twitter's API Rate Limits page, and also this SO post, but even after reading both, I don't really understand. The API page says:
Search will be limited at 180 queries per 15 minute window for the time being.
and also
Rate limiting in version 1.1 of the API is primarily considered on a per-user basis — or more accurately described, per access token in your control. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window per leveraged access token.
But I'm totally confused... probably because I've never really worked with anything database, or social network analysis before.
When it says that it always 180 queries per 15 minutes, what exactly constitutes a query? The way the search works on NodeXL is that you limit the amount of tweets you are searching for. So if I search once and set my tweet limit to 1000 tweets, is that only 1 query?
Sorry if this seems like a stupid or really elementary question, but I just don't have any experience with this stuff at all, and any help would be much appreciated, thanks!
When it says that it always 180 queries per 15 minutes, what exactly
constitutes a query?
Whenever you make one request to Twitter, its considered as one query. For Search API you can make 180 calls per 15 minute.
So if I search once and set my tweet limit to 1000 tweets, is that
only 1 query?
Yes, but you can't set count to 1000 since the maximum tweets you can return per request is 100 as it mentioned here.
You can retrieve the latest 100 tweets with the normal search query and for pagination you should use since_id and max_id to retrieve the next 100 tweets for fresh tweets.
The number of queries you can make per 15 min windows varies by API. For example, you can query 180 requests per 15 min window if you use Search API. But, if you use API like GET friends/ids, it's limited to 15 query per 15 min windows. i.e you can make call only 15 times per 15 minutes.
Here's the Rate limits chart where you can find how many requests you can make per 15 window for each API.

How does Twitter Search API rate limit work?

I am not clear about what the Twitter rate limit, "350 requests per hour per access token/user", means. How they are limiting the request? In 1 request how much data i can get?
The rate limits are based on request, not the amount of data (e.g. bytes) you receive. With that in mind, you can maximize requests by using the available parameters of the particular endpoint you're calling. I'll give you a couple examples to explain what I mean:
One way is to set count, if supported, to the highest available value. On statuses/home_timeline, you can max out count at 200. If you aren't using it now, you're getting the default of 20, which means that you would (theoretically) need to do 10 more queries the get the same amount of data. More queries mean you eat up rate limit.
Using statuses/home_timeline again, notice that you can page through data using since_id and max_id, as described in Working with Timelines. Essentially, you keep track of the tweets you already requested so you can save on requests by only getting the newest tweets.
Rate limits are in 15 minute windows, so you can pace your requests to minimize the chance of running out in any give time window.
Use a combination of Streams and requests, which increases your rate limit.
There are more optimizations like this that help you save request limit, some more subtle than others. Looking at the rate limits per API, studying parameters, and thinking about how the API is used can help you minimize rate limit usage.

Getting as much tweets associated to a day's trends as possible

I am storing in a database, every 30 minutes, Twitter's trending topics of a country Y. No problem with that.
Now, I want to get as much tweets as possible matching those trending topics for research purposes.
Since I would like to study the patterns of the trends, I would like continuous tweet data of at least 3 days centered in the day the trend peak was detected, for every trending topic. In order to achieve that, I thought of doing the following:
Suppose I am in day X. I could retrieve the unique trends of day X-2, and for every trend, look for tweets matching the trend in the interval [X-3, X-1], that is 3 days. However, the problem here is Twitter rate limitations. If I have 100 trending topics in day X-2, and I make 20 GET search requests/trend, I would end up doing a total of 2,000 requests, which overpasses Twitter's 350 hourly rate limit. If make 300 req/hour, it would take more than 6 hours to get the data for only one day...
Does anybody know any other (better) way for getting tweets associated with trends?
Thanks in advance
Twitter Streaming API?
Twitter Streaming API doesn't deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. (Note*-At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.)
Is there any way to get more tweets from the streaming?
None that I know. But, do refer the below mentioned information if you are considering to switch among search or streaming API.
Please choose your case:
If you need real time data and your number of requests are high:
Go for Streaming API
The streaming API requires that you keep the connection active. This requires a server process with an infinite loop, to get the latest tweets.
Advantage
1)Lag in retrieving results: Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API
2)Not rate limited.
If you need aggregate data regardless of its time range and your number of requests are not high:
Go for Search API
The search API is the easier of the two methods to implement but it is rate limited .Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query.
Advantage
1)Finding tweets in the past:The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets
2)Easier to implement

Twitter Search API rate limit

I'm using the Twitter search API (for example: http://search.twitter.com/search.rss?q=%23juventus&rpp=100&page=4)
I read here: http://search.twitter.com/api/ this:
We do not rate limit the search API under ordinary circumstances, however we have put measures in place to limit the abuse of our API. If you find yourself encountering these limits, please contact us and describe your app's requirements.
The limit seems random: sometimes I do 150 requests sometimes 300, generally, after 5 minutes I can do other requests.
I was wondering if is it possible do more requests
They'll detect floods and throttle accordingly rather than publisher defined limits, which is why it appears random. It'll also no doubt be based on load from other sources at the time.
If you need lots more, then they gave you the answer - contact them telling them why.

Resources