Get public users of a service (Tumblr, Twitter) - twitter

Assuming it's not available as part of API, how can one obtain a full or partial list of public users of a web service, e.g. Twitter, Tumblr, YouTube?
Acceptable alternative: get a random public user.
I was interested in this for testing APIs with a random account. This is useful to catch edge cases when developing an app for the API; For example when developing a Tumblr theme, seeing what volumes of text/images are posted, special character use, and so on.

Can you even imagine a full list of (public) users of largely used web services? That's a vast load of data. I hardly believe that any API would offer that for many reasons:
performance/load issues,
data/information privacy,
abusing possibilities,...
For regular usage of the service's API you simply don't need that. Otherwise it would stink with some gray/black techniques.
Anyway to answer you question objectively: In order to get full or partial list of users from web service it have to provide any kind of API which would allow you to do that. So good starting point is to look at documentation, for example Twitter API, Youtube API, etc...
By swift look I don't see any method that would offer that. It might change in the future but as mentioned above I strongly doubt about that.
Another option is to mine partial list of users via search APIs or traversing the site with a robot. Also obtaining such a list is an option. However I would check whether this is even legal and not against terms of use or something like that.

Related

Is there a generic approach to unshortening URLs without generating "hits"?

I have been tasked with performing realtime and historical analysis of shortened links that users include in their messaging, comparing them against databases of known-malicious URLs (e.g. Google Safe Browsing, OpenPhish).
However, a core use-case for link shortening done by our user-base is the collection of "hit" data, reflecting whether their link was clicked, who clicked it, when, etc. This presents the problem that I need to "unshorten" URLs without actually accessing them directly.
Some providers, such as bit.ly, provide an API method to retrieve information about their shortened links. But, each API seems to be different, and often requires provider-specific API credentials.
Are there any popular libraries, services, or standardized approaches for dealing with this problem?
In which language do you want? Here is one for bitly v4 in php:
https://github.com/phplicengine/bitly

Does everyone need their own YouTube API v3 Key?

I'm working on an HTML file that allows people to find unanswered comments for their channel's videos. I'd like to make this available to the general public, or at least those willing to do a little work on their own. I don't plan on hosting it on a web site - just making the HTML page available, probably on GitHub. At least that's my thoughts right now.
(By the way, to avoid a discussion on authentication/authorization, it currently doesn't require authorization since I'm only accessing public comments, so it does indeed run in a browser, without being hosted.)
Since the web page uses the YouTube API v3, it requires an API key. Am I correct in assuming I don't want everyone using my API key? Does this mean that anyone who wants to use this HTML file needs their own API key?
Or am I thinking about this all wrong, and there's a better way to release this code? Thanks.
When you request for comments, it costs one unit (docs). Daily limit is 1,000,000 units. So if you exceed that, you might want to use multiple YouTube API keys. So technically no, your users don't need their own YouTube API keys, but personally I would make every user use their own API keys.
Creating Multiple Google/YouTube Data API Keys

Some general Twitter4J questions

I'm trying to do a write up of Twitter4J for part of a uni project, but I'm getting hung up on a few things. From the Twitter4J api:
void sample()
Starts listening on random sample of all public
statuses. The default access level provides a small proportion of the
Firehose. The "Gardenhose" access level provides a proportion more
suitable for data mining and research applications that desire a
larger proportion to be statistically significant sample.
This implies that by default, a "default access" is provided to the stream, but another type of access, "Gardenhose access" is available. Is this correct? And if so, how do you access the higher Gardenhose access?
I'm asking as I've seen some answers on SO suggest that there is only one level of access - the Gardenhose, and I'm trying to clear this up once and for all.
In addition to this, I would like a reference (if possible) to the number of tweets the sample stream allows access to. I've read lots of people cite 1% for "default access" and 10% for "gardenhose access" - but I can't find this anywhere in the API.
So to sum up, two questions:
Does the sample stream have a "default access" and a "gardenhose access", or just one of those?
How much of the Twitter firehose stream can these levels of access gain?
If replying, please have links to reference-able API where possible.
The gardenhose is different from the default sample stream, you would have had to request access from Twitter in order to use it.
However, I am not sure if Twitter still allows access to the gardenhose, or even if it still exists. It seems the current mechanism may be to use one of Twitter's preferred data partners:
Using the Streaming API?
Every Twitter account can connect to a small sampling of the Streaming API. Accounts that need increased access for data gathering or analytical reasons should check out our preferred partners page.
(source)
It may be different for students or educational instutions and that the gardenhose is still available to you. Previously you would have to either e-mail api-research#twitter.com or you could use the following form, but I have no idea if these methods work still - the post is quite old.
As for the percentage of Tweets that the default sample stream allows access to, the best reference I could find was a comment made by a Twitter employee on the developer forums - emphasis mine:
I would recommend just using the 1% sample stream from https://stream.twitter.com/1/statuses/sample.json that you can connect to with your Twitter account. It's unlikely that you'll be in a situation where you can access all of the data and will have to make do with a sample. At about 230 million tweets a day, you'd still be theoretically getting 2.3 million tweets a day.
(source)
Although, again this is an old post.
Regarding the firehose stream, as specified by the documentation you need to be granted permission to access it, I believe very few people have full access to this stream:
GET statuses/firehose
This endpoint requires special permission to access.
Returns all public statuses. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.
Overall documentation is scarce on the different access levels and what they offer, I suggest contacting Twitter directly to discuss your requirements or contacting one of their data partners.
Apologies if this wasn't as concrete as you would have liked, good luck with your research.

Track multiple search terms with twitter streaming

I would like to build a web application that tracks some user defined search terms in real-time and provides a real-time visualization. http://www.monitter.com/ is an app I've found that is similar in its requirements. What is the appropriate API to use for it? Initially I thought the streaming API was the obvious choice, but the limitation of one concurrent connection means that I can only track one search term at a time(with one user account). I could get around this by making multiple user accounts, but that seems like the wrong approach.
I looked at user streams but the language for that API seems to be more geared towards desktop applications.
So, what is the most best API for my use case? Thanks.
Actually you can track up to 400 keywords/terms via one streaming API connection.
https://dev.twitter.com/docs/streaming-api/methods#track
Depending on language you are using there are multiple interfaces you can use.
If you are using PHP, then I can suggest Phirehose as it works quite well and has multiple examples for different usages scenarios included.
http://code.google.com/p/phirehose/wiki/Introduction
Whats not there - when processing received tweets you will need to figure out how to match which tweet corresponds to which keyword/term because twitter streaming API gives all matching tweets in one stream.
Investigating further using Firebug, I found that monitter.com simply polls the REST search api every second or so on the client side. This is what I ended up doing as well.

What's the best service to use for filtering out spam/abuse/malware links for a link shortening webapp?

I have two services - Lincr and LinkBunch. Lincr is a plain jane URL shortening service, while LinkBunch lets you shorten multiple links into one link. I've had too much spam posted into the services, so I had to shut down Lincr. Now, even LinkBunch seems to be facing the same problem, and it's been disabled by my web host for that reason.
I can't keep shutting down sites like this because of bad links being posted, so I need a malware-filtering API that I can use to filter out the links as and when they are posted.
There are services that let me download an entire bunch of bad links to check against, but instead, I'd prefer doing a live API call on a per-link basis. What can I use for that?
Finally, what's the best malware filtering service out there?
Lincr is down. On LinkBunch, where is your Captcha?
On either site, do you limit the number of posts by IP? Do you use a delay in your response? What about using hidden fields to reduce spam (http://www.reviewmylife.co.uk/blog/2008/05/30/hidden-field-spam-trap-for-phpformmail/)?
I know I'm dodging the question a bit, but you should at least take basic anti-spam measures before resorting to API calls. Even APIs will still fail for newly-hacked / newly-spammed sites.

Resources