Twitter Streaming API to follow thousands of users - twitter

I'm considering using the Twitter Streaming API (public streams) to keep track of the latest tweets for many users (up to 100k). Despite having read various sources regarding the different rate limits, I still have couple of questions:
According to the documentation: The default access level allows up to 400 track keywords, 5,000 follow userids. What are the best practices to follow more the 5k users. Creating, for example, 20 applications to get 20 different access tokens?
If I follow just one single user, does the rule of thumb "You get about 1% of all tweets" indeed apply? And how does this changes if I add more users up to 5k?
Might using the REST API be a reasonable alternative somehow, e.g., by polling the latest tweets of users on a minute-by-minute basis?

What are the best practices to follow more the 5k users. Creating, for example, 20 applications to get 20 different access tokens?
You don't want to use multiple applications. This response from a mod sums up the situation well. The Twitter Streaming API documentation also specifically calls out devs who attempt to do this:
Each account may create only one standing connection to the public endpoints, and connecting to a public stream more than once with the same account credentials will cause the oldest connection to be disconnected.
Clients which make excessive connection attempts (both successful and unsuccessful) run the risk of having their IP automatically banned.
A rate limit is a rate limit--you can't get more than Twitter allows.
If I follow just one single user, does the rule of thumb "You get about 1% of all tweets" indeed apply? And how does this changes if I add more users up to 5k?
The 1% rule still applies, but it is very unlikely impossible for one user to be responsible for at least 1% of all tweet volume in a given time interval. More users means more tweets, but unless all 5k are very high-volume tweet-ers you shouldn't have a problem.
Might using the REST API be a reasonable alternative somehow, e.g., by polling the latest tweets of users on a minute-by-minute basis?
Interesting idea, but probably not. You're also rate-limited in the Search API. For GET/statuses/user_timeline, the rate limit is 180 queries per 15 minutes. You can only get the tweets for one user with this endpoint, and the regular GET/search/tweets doesn't accept user id as a parameter, so you can't take advantage of that (also 180 query/15 min rate limited).
The Twitter Streaming and REST API overviews are excellent and merit a thorough reading. Tweepy unfortunately has spotty documentation and Twython isn't too much better, but they both leverage the Twitter APIs directly so this will give you a good understanding of how everything works. Good luck!

To get past the 400 keywords and 5k followers, you need to apply for enterprise access.
Basic
400 keywords, 5,000 userids and 25 location boxes
One filter rule on one allowed connection, disconnection required to adjust rule
Enterprise
Up to 250,000 filters per stream, up to 2,048 characters each.
Thousands of rules on a single connection, no disconnection needed to add/remove rules using Rules API
https://developer.twitter.com/en/enterprise

Related

Desire2Learn Max API Limit

Is there a max limit on the valance API. I've made a number of calls, but I put some self throttling in the program. It makes a call to the user page, loops through the data, and then makes another call. It probably averaged 1 call every second or so.
I'm looking at expanding some functionality and I'm worried that we may reach a limit if we aren't careful about how we go doing everything.
So, is there a limit to how often we can call the valance api?
The back-end LMS can be configured to rate limit on Valence Learning Framework API calls; however, by default this does not get configured as active. To be sure, you should consult with the administrators of your back-end LMS.
Update: Brightspace no longer supports this kind of rate limiting mentioned. As Brightspace evolved, D2L found that the rate limiting was not providing the value that was originally intended, and as a result D2L deprecated the feature. D2L is no longer rate limiting the Brightspace APIs and instead depend on developer self-governance and asynchronous APIs for more resource intensive operations (the APIs around importing courses, for example). When you use the Brightspace APIs, you should be mindful that you are using the same computing resources as made available to end users interacting with the web UI, and if you over-stress these resources (as can easily be done through any API), you can have a negative impact on these end users.

Will Twitter's rate limits allow me to do the data mining necessary to construct a complete social network graph of about 600K users?

Primary question: Will Twitter's rate limits allow me to do the data mining necessary to construct a complete social network graph with all directed edges among about 600K users?
Here is the idea:
The edges/ties/relations in the network will be follower/followed relationships.
Start with a specific list of approximately 600 Twitter users, chosen because they all are from all of the news outlets in a large city.
Collect all of the followers and friends (people they follow) for all 600 users. These users probably have an average number of followers of 2,000 each. They probably have an average number of friends (people they follow) of 500.
Since these followers of the 600 are all in the same city, it is expected that many of these followers would be the same users following these 600 people. So let's approximate and guess that these 600 users have approximately 600,000 followers and friends in total. So this would be a subgraph/network of 600,600 total Twitter users.
So once I have collected all of the 600,000 followers and friends of all of these 600 people, I want to be able to construct a social network of all of these 600,600 people AND their followers. This would require me to be able to at least find all of the directed edges amongst these 600,600 users (whether or not each of these 600,600 users follow each other).
With Twitter rate limits, would this kind of data mining be feasible?
I'll answer these questions in reverse order, starting with David Marx first:
Well, I do have access to a pretty robust computer research center with a ton of storage capacity, so that should not be an issue. I don't know if the software can handle it, however.
Chances are that I will have to scale down the project, which is OK. The idea for me is to start out with a bigger idea, figure out how big it can be, and then pare down accordingly.
Following up on Anony-Mousse's question now: Part of my problem is that I am not sure I am interpreting the Twitter rate limits correctly. I'm not sure if it's 15 requests per 15 minutes, or 30 requests per 15 minutes. And I think 1 request will get 5000 followers/friends, so you could presumably collect 75,000 friends or followers every 15 minutes if the limit is 15 requests per 15 minutes. I'm also trying to figure out if there is any process for requesting higher rate limits for any kind of research purposes.
Here is where they list the limits:
https://dev.twitter.com/docs/rate-limiting/1.1/limits
Primary question: Will Twitter's rate limits allow me to do the data mining (...)
Yes, it is technically feasible, however it will take ages in case you are using only one API user access tokens. I mean here probably more than 6 Months of uninterrupted run.
To be more precise:
the extraction of nodes (twitter users) can be done very quickly as you will use users/lookup API endpoint, which lets you extract 100 nodes per request, and make 180 requests per 15 minutes window (per access token you have)
the extraction of edges (follow relationship between users) is the slow part, you will use friends/ids and followers/ids API endpoints, limited at 15 queries per 15 minutes and letting you extract at most 5000 friends of followers for a unique user per request.
You can use the nodes metadata (descriptions texts, locations, languages, time zones) to perform some interesting analysis, even without having extracted the 'graph' (follow relationships between everyone)
A work around this is to parallelize sub-parts of the extraction by spreading the extraction across several access tokens. Seems compliant to me regarding the terms of use, as long as you respect protected accounts.
In any case you should filter out extraction of edges for celebrities (you probably do not want to extract the followers of hootsuite, there are almost 6 millions of them).
disclaimer: self-promotion here: in case you do not want to develop this yourself I could do the extraction for you and provide you the graph file, as I am extracting twitter graphs at tribalytics. (I have read this and that before posting).
I'm also trying to figure out if there is any process for requesting higher rate limits for any kind of research purposes
Officially, there are no more white-listed apps with higher rate limits, like there could be with previous version of twitter's API. You probably should still contact twitter and see whether they can help you as your work is for academic purpose.
Chances are that I will have to scale down the project, which is OK
I would advise you to reduce your initial list of 600 users as much as you can. Only keep those who are really central regarding to your topic, and whose audience is not too large. Extracting graph of local celebrities will give you a graph with many people not related at all to the population you want to study.

Is It Possible to Apply SQS Limits for IAM Users?

I'm currently working on a project which has a large amount of IAM users, each of whom need limited access to particular SQS queues.
For instance, let's say I have an IAM user named 'Bob' and an SQS queue named 'BobsQueue'. What I'd like to do is grant Bob full permission to manage his queue (BobsQueue), but I'd like to restrict his usage such that:
Bob can make only 10 SQS requests per second to BobsQueue.
Bob cannot make more than 1,000,000 SQS requests per month.
I'd essentially like to apply arbitrary usage restrictions to this SQS queue.
Any ideas?
From the top of my head none of the available AWS services offers resource usage limits at all, except if built into the service's basic modus operandi (e.g. the Provisioned Throughput in Amazon DynamoDB) and Amazon SQS is no exception, insofar the Available Keys supported by all AWS services that adopt the access policy language for access control currently lack such resource limit constraints.
While I can see your use case, I think it's actually more likely to see something like this see the light as an accounting/billing feature, insofar it would make sense to allow cost control by setting (possibly fine grained) limits for AWS resource usage - this isn't available either yet though.
Please note that this feature is frequently requested (see e.g. How to limit AWS resource consumption?) and it's absence actually allows to launch what Christofer Hoff aptly termed an Economic Denial of Sustainability attack (see The Google attack: How I attacked myself using Google Spreadsheets and I ramped up a $1000 bandwidth bill for a somewhat ironic and actually non malicious example).
Workaround
You might be able to achieve an approximation of your specification by facilitating Shared Queues with an IAM policy granting access to user Bob as outlined in Example AWS IAM Policies for Amazon SQS and monitoring this queue with Amazon CloudWatch in turn by Creating Amazon CloudWatch Alarms for one or more of the Amazon SQS Dimensions and Metrics you want to limit, e.g. NumberOfMessagesSent. Once the limit is reached you could revoke the IAM grant for user Bob for this shared queue until he is in compliance again.
Obviously it is not necessarily trivial to implement the 'per second'/'per-month' specification based on this metric alone without some thorough bookkeeping, nor will you be able to 'pull the plug' precisely when the limit is reached, rather need to account for the processing time and API delays.
Good luck!

Twitter Rate Limiting IP/OAuth Concerns

I have a series of webapps that collects all terms relating to a subject using the Public Streaming API. So far, I've been taking a very, very arduous route of creating a new account for each stream, setting up a new Twitter application on that account, copying the OAuth tokens, spinning up a new EC2 instance, and setting up the stream.
This allows me to have the streams coming from multiple different IPs, OAuth generation is easy with the generator tool when you create an app, and because they are each in different accounts I don't meet any account limits.
I'm wondering if there's anything I can do to speed up the process, specifically in terms of EC2 instances. Can I have a bunch of streams running off the same instance using different accounts?
If you run multiple consumers from a single machine you may be temporarily banned,
repeated bans may end up getting you banned for longer periods.
At least, this happened to me a few times in the past.
What I had found at the time was:
same credentials, same ip -> block/ban
different credentials, same ip -> mostly ok, but banned from time to time
different credentials, different ip -> ok
This was a few years ago, so I am not sure the same is still true, but I'd expect twitter to have tightened the rules, rather than having relaxed them.
(Also, I think you're infringing their ToS)
You should check the new Twitter API version 1.1. It was released a few days ago and many changes were made on how the rates are calculated.
One big change is that the IP is completely ignored now. So you don't have to create many instances anymore (profit!)
From the Twitter dev #episod:
Unlike in API v1, all rate limiting is per user per app -- IP address has no involvement in rate limiting consideration. Rate limits are completely partitioned between applications.

How to prevent gaming of website rewards for new visitors

I'm about to embark on a website build where a company wants to reward new visitors with a gift. The gift has some monetary value, and I'm concerned about the site being gamed. I'm looking for ways to help reduce the chance that any one person can drain the entire gift inventory.
The plans call for an integration with Facebook, so authenticating with your FB credentials will provide at least a bit of confidence that a new visitor is actually a real person (assuming that scripting the creation of 100's of FB accounts and then authenticating with them is no simple task).
However, there is also a requirement to reward new visitors who do not have FB accounts, and this is where I'm looking for ideas. An email verification system by itself won't cut it, because it's extremely easy to obtain countless number of email address (me+1#gmail.com, me+2#gmail.com, etc). I've been told that asking for a credit card number is too much of a barrier.
Are there some fairly solid strategies or services for dealing with situations like this?
EDIT: The "gift" is virtual - like a coupon
Ultimately, this is an uphill, loosing battle. If there will be incentive to beat the system, someone will try and they will eventually succeed. (See for example: every DRM scheme ever implemented.)
That said, there are strategies to reduce the ease of gaming the system.
I wouldn't really consider FB accounts to be that secure. The barrier to creating a new FB account is probably negligibly higher than creating a new webmail account.
Filtering by IP address is bound to be a disaster. There may be thousands of users behind a proxy on a single IP address (cough, AOL), and a scammer could employ a botnet to distribute each account requests to a unique IP. It is likely to be more trouble than it is worth to preemptively block IPs, but you could analyze the requests later—for example, before actually sending the reward—to see if there's lots of suspicious behavior from an IP.
Requiring a credit card number is a good start, but you've already ruled that out. Also consider that one individual can have 10 or more card numbers between actual credit cards, debit cards, and one-time-use card numbers.
Consider sending a verification code via SMS to PSTN numbers. This will cost you some money (a few cents per message), but it also costs a scammer a decent amount of change to acquire a large number of phone numbers to receive those messages. (Depending on the value of your incentive, the cost a prepaid SIM may make it cost-prohibitive.) Of course, if a scammer already has many SMS-receiving PSTN numbers at his disposal, this won't work.
First thing I wonder is if these gifts need to be sent to a physical address. It's easy to spoof 100 email addresses or FB accounts but coming up with 100 clearly unique physical addresses is much harder, obviously.
Of course, You may be giving them an e-coupon or something so address might not be an option.
Once upon a time I wrote a pretty intense anti-gaming script for a contest judging utility. While this was many months of development and is far too complex to describe in great detail, I can outline the basic features of the script:
For one we logged every detail we could when a user applied for the contest. It was pretty easy to catch obvious similarities in accounts by factoring the average time between logins / submissions from a group of criteria (like IP, browser, etc - all things that can be spoofed so by themselves it is unreliable). In addition, I compared account credentials for obvious gaming - like acct1#yahoo.com, acct2#yahoo.com, etc. by using a combination of levenshtein distance which is not solely reliable - as well as a parsing script that broke apart the various details of the credentials and looked for patterns.
Depending on the scores of each test, we assigned a probability of gaming as well as a list of possible account matches. Then it was up to the admins to exclude them from the results.
You could go on for months refining your algorithm and never get it perfect. That's why my script only flagged accounts and did not take any automatic action.
Since you're talking about inventory, can we therefore assume your gift is an actual physical item?
If so, then delivery of the gift will require a physical address for delivery - requiring unique addresses (or, allowing duplicate addresses but flagging those users for manual review) should be a good restriction.
My premise is this: While you can theoretically run a script to create 100s of Facebook or Google accounts, exercising physical control over hundreds of distinct real world delivery locations is a whole different class of problem.
I would suggest a more 'real world' solution in stead of all the security: make it clear that it is one coupon per address. Fysical (delivery and/or payment) address. Then just do as you want, maybe limit it by email or something for the looks of it, but in the end, limit it per real end-user, not per person receiving the coupon.

Resources