Getting as much tweets associated to a day's trends as possible - twitter

I am storing in a database, every 30 minutes, Twitter's trending topics of a country Y. No problem with that.
Now, I want to get as much tweets as possible matching those trending topics for research purposes.
Since I would like to study the patterns of the trends, I would like continuous tweet data of at least 3 days centered in the day the trend peak was detected, for every trending topic. In order to achieve that, I thought of doing the following:
Suppose I am in day X. I could retrieve the unique trends of day X-2, and for every trend, look for tweets matching the trend in the interval [X-3, X-1], that is 3 days. However, the problem here is Twitter rate limitations. If I have 100 trending topics in day X-2, and I make 20 GET search requests/trend, I would end up doing a total of 2,000 requests, which overpasses Twitter's 350 hourly rate limit. If make 300 req/hour, it would take more than 6 hours to get the data for only one day...
Does anybody know any other (better) way for getting tweets associated with trends?
Thanks in advance

Twitter Streaming API?
Twitter Streaming API doesn't deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. (Note*-At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.)
Is there any way to get more tweets from the streaming?
None that I know. But, do refer the below mentioned information if you are considering to switch among search or streaming API.
Please choose your case:
If you need real time data and your number of requests are high:
Go for Streaming API
The streaming API requires that you keep the connection active. This requires a server process with an infinite loop, to get the latest tweets.
Advantage
1)Lag in retrieving results: Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API
2)Not rate limited.
If you need aggregate data regardless of its time range and your number of requests are not high:
Go for Search API
The search API is the easier of the two methods to implement but it is rate limited .Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query.
Advantage
1)Finding tweets in the past:The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets
2)Easier to implement

Related

What's the most efficient way to handle quota for the YouTube Data API when developing a chat bot?

I'm currently developing a chat bot for one specific YouTube channel, which can already fetch messages from the currently active livechat. However I noticed my quota usage shooting up, so I took the "liberty" to calculate my quota cost.
My API call currently looks like this https://www.googleapis.com/youtube/v3/liveChat/messages?liveChatId=some_livechat_id&part=snippet,authorDetails&pageToken=pageTokenIfProvided, which uses up 5 units. I checked this by running one API call and comparing the quota usage before and after (so apologies, if this is inaccurate). The response contains pollingIntervalMillis set to 5086 milliseconds. Currently, my bot adds that interval to the current datetime and schedules the next fetch at that time (using Celery), so it currently fetches messages at a rate of 4-6 seconds. I'm gonna take the liberty and always wait for 6 seconds.
Calculating my API quota would result in a usage of 72.000 units per day:
10 requests per minute * 60 minutes * 24 hours = 14.400 requests per day
14.400 requests * 5 units per request = 72.000 units per day
This means that if I used the pollingIntervalMillis as a guideline for how often to request, I'd easily reach the maximum quota of 10.000 units by running the bot for 3 hours and 20 minutes. In order to not use up the quota by just fetching chat messages, I would need to run 1 API call per minute (1,3889 approximately). This is very unfeasible for a chatbot, since this is only for fetching messages and not even sending any messages to the chat.
So my question is: Is there maybe a more efficient way to fetch chat messages which won't use up the quota so much? Or will I only get this resolved by applying for a quota extension? And if this is only resolved by a quota extension, how much would I need to ask for reliably? Around 100k units? Even more?
I am also asking myself how something like Streamlabs Chatbot (previously known as AnkhBot) accomplishes this without hitting the quota limit despite thousands of users using their API client, their quota must probably be in the millions or billions.
And another question would be how I'd actually fill out the form, if the bot is still in this "early" state of development?
You pretty much hit the nail on the head. Services like Streamlabs are owned by larger companies, in their case Logitech. They not only have the money to throw around for things like increasing their API quota, but they also have professional relationships with companies like Google to decrease their per unit cost.
As for efficiency, the API costs are easily found in the documentation, but for live chat as you've found, you're going to be hitting the API for 5 units per hit. The only way to improve your overall daily cost with your calls is to perform them less frequently. While once per minute is clearly excessively long, once every 15-18 seconds could reduce the overall cost of your API quota increase, while making the chat bot adequately responsive.
Of course that all depends on your desired usage of the data, but still a recommendation if you're implementing the bot still in the realm of hobbyist usage.

Getting realtime twitter search results using the streaming API

I have an application where I need to get complete, realtime search results from twitter (preferably polling every 500ms or less). Based on my understanding, doing this using the search API will run into rate limits very quickly. However, the streaming API doesn't seem to support getting complete anything (only a 5% sample).
More specifically, I have a search query term which typically comes up with <20 matching tweets per hour, and I would like to be informed of these new tweets within 1-2 seconds, and it is considered a failure if I am not notified within 5 seconds. Due to the relatively low frequency of posting, missing even one tweet is very undesirable.
Is there any way I can realistically do this using twitter API, or is my only choice to write a browser extension to repeatedly refresh the search page?
The answer is "yes". Although you are rate limited (the limit is closer to 1% than 5%), that is only a cutoff based on your query results. Very roughly, you can stream about 60 tweets per second max. In your case, you say you expect under 20 tweets per hour, so you should have no problem getting all those tweets.
You also require a latency less than 5 seconds. In my experience latency has always been a second or two. I think you should be fine.

Ways to pull (potentially) large amounts of data from Twitter

I've been playing around with the Twitter API using Twitter4j. I am trying to pull data given a keyword and date, and example of a query I would run using the REST API would be
bagels since:2014-12-27
Which would give me all tweets containing the keyword 'bagels' since 2014-12-27.
This works in theory, but I've quickly exceeded the rate limits since each query allows up to 100 results, and only 180 queries are allowed within a 15-minute interval. There are many keywords that return more than 18k results.
Is there a better way to pull large amounts of data from Twitter? I looked at the Streaming API but I don't know if I can pull data from a certain date range.
There are a few things you can do to improve your rates:
Make sure your count is maxed at 100, which it looks like you're doing.
Use Application-Only authorization - it increases your rate limit to 450.
Use the max_id, since_id parameters to page through data and avoid querying for results you're already received. See the Working with Timelines docs to see what I mean.
Consider using Gnip if you're willing to pay to remove rate limits.

Twitter rate limiting confusion?

So I'm currently using NodeXL to get search for a particular Twitter hashtag, and I'm having trouble on understand how exactly the rate-limiting works. I looked it up in Twitter's API Rate Limits page, and also this SO post, but even after reading both, I don't really understand. The API page says:
Search will be limited at 180 queries per 15 minute window for the time being.
and also
Rate limiting in version 1.1 of the API is primarily considered on a per-user basis — or more accurately described, per access token in your control. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window per leveraged access token.
But I'm totally confused... probably because I've never really worked with anything database, or social network analysis before.
When it says that it always 180 queries per 15 minutes, what exactly constitutes a query? The way the search works on NodeXL is that you limit the amount of tweets you are searching for. So if I search once and set my tweet limit to 1000 tweets, is that only 1 query?
Sorry if this seems like a stupid or really elementary question, but I just don't have any experience with this stuff at all, and any help would be much appreciated, thanks!
When it says that it always 180 queries per 15 minutes, what exactly
constitutes a query?
Whenever you make one request to Twitter, its considered as one query. For Search API you can make 180 calls per 15 minute.
So if I search once and set my tweet limit to 1000 tweets, is that
only 1 query?
Yes, but you can't set count to 1000 since the maximum tweets you can return per request is 100 as it mentioned here.
You can retrieve the latest 100 tweets with the normal search query and for pagination you should use since_id and max_id to retrieve the next 100 tweets for fresh tweets.
The number of queries you can make per 15 min windows varies by API. For example, you can query 180 requests per 15 min window if you use Search API. But, if you use API like GET friends/ids, it's limited to 15 query per 15 min windows. i.e you can make call only 15 times per 15 minutes.
Here's the Rate limits chart where you can find how many requests you can make per 15 window for each API.

How does Twitter Search API rate limit work?

I am not clear about what the Twitter rate limit, "350 requests per hour per access token/user", means. How they are limiting the request? In 1 request how much data i can get?
The rate limits are based on request, not the amount of data (e.g. bytes) you receive. With that in mind, you can maximize requests by using the available parameters of the particular endpoint you're calling. I'll give you a couple examples to explain what I mean:
One way is to set count, if supported, to the highest available value. On statuses/home_timeline, you can max out count at 200. If you aren't using it now, you're getting the default of 20, which means that you would (theoretically) need to do 10 more queries the get the same amount of data. More queries mean you eat up rate limit.
Using statuses/home_timeline again, notice that you can page through data using since_id and max_id, as described in Working with Timelines. Essentially, you keep track of the tweets you already requested so you can save on requests by only getting the newest tweets.
Rate limits are in 15 minute windows, so you can pace your requests to minimize the chance of running out in any give time window.
Use a combination of Streams and requests, which increases your rate limit.
There are more optimizations like this that help you save request limit, some more subtle than others. Looking at the rate limits per API, studying parameters, and thinking about how the API is used can help you minimize rate limit usage.

Resources