Is there a generic approach to unshortening URLs without generating "hits"? - hyperlink

I have been tasked with performing realtime and historical analysis of shortened links that users include in their messaging, comparing them against databases of known-malicious URLs (e.g. Google Safe Browsing, OpenPhish).
However, a core use-case for link shortening done by our user-base is the collection of "hit" data, reflecting whether their link was clicked, who clicked it, when, etc. This presents the problem that I need to "unshorten" URLs without actually accessing them directly.
Some providers, such as bit.ly, provide an API method to retrieve information about their shortened links. But, each API seems to be different, and often requires provider-specific API credentials.
Are there any popular libraries, services, or standardized approaches for dealing with this problem?

In which language do you want? Here is one for bitly v4 in php:
https://github.com/phplicengine/bitly

Related

Does everyone need their own YouTube API v3 Key?

I'm working on an HTML file that allows people to find unanswered comments for their channel's videos. I'd like to make this available to the general public, or at least those willing to do a little work on their own. I don't plan on hosting it on a web site - just making the HTML page available, probably on GitHub. At least that's my thoughts right now.
(By the way, to avoid a discussion on authentication/authorization, it currently doesn't require authorization since I'm only accessing public comments, so it does indeed run in a browser, without being hosted.)
Since the web page uses the YouTube API v3, it requires an API key. Am I correct in assuming I don't want everyone using my API key? Does this mean that anyone who wants to use this HTML file needs their own API key?
Or am I thinking about this all wrong, and there's a better way to release this code? Thanks.
When you request for comments, it costs one unit (docs). Daily limit is 1,000,000 units. So if you exceed that, you might want to use multiple YouTube API keys. So technically no, your users don't need their own YouTube API keys, but personally I would make every user use their own API keys.
Creating Multiple Google/YouTube Data API Keys

Downloading Twitter corpus

I am working on a data mining system and one of the requirements is it being able to perform the analysis without the use of API. Is there a way to download the Twitter database (or a big part of it, at least) and work with it locally?
There is a paper about creating corpora from twitter. It is called “TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora”. I recommend to read it because it also covers licensing issues etc. They also provide there code on Github.
In fact, you cannot download the twitter data dumps directly. I can download single tweets and stored them in a corpus. But, it is also not allowed to share that data. Therefore, the authors built the Tworpus client to create private twitter corpora.
APIs are the official way of getting Twitter data and they work really well so it is not comprehensible why you do not want to use APIs. The web scraping is a work around but not recommended, in addition you would like to get a big part of it, so I do not think you will be satisfied with it. You can also buy the data from Gnip.

Get public users of a service (Tumblr, Twitter)

Assuming it's not available as part of API, how can one obtain a full or partial list of public users of a web service, e.g. Twitter, Tumblr, YouTube?
Acceptable alternative: get a random public user.
I was interested in this for testing APIs with a random account. This is useful to catch edge cases when developing an app for the API; For example when developing a Tumblr theme, seeing what volumes of text/images are posted, special character use, and so on.
Can you even imagine a full list of (public) users of largely used web services? That's a vast load of data. I hardly believe that any API would offer that for many reasons:
performance/load issues,
data/information privacy,
abusing possibilities,...
For regular usage of the service's API you simply don't need that. Otherwise it would stink with some gray/black techniques.
Anyway to answer you question objectively: In order to get full or partial list of users from web service it have to provide any kind of API which would allow you to do that. So good starting point is to look at documentation, for example Twitter API, Youtube API, etc...
By swift look I don't see any method that would offer that. It might change in the future but as mentioned above I strongly doubt about that.
Another option is to mine partial list of users via search APIs or traversing the site with a robot. Also obtaining such a list is an option. However I would check whether this is even legal and not against terms of use or something like that.

Track multiple search terms with twitter streaming

I would like to build a web application that tracks some user defined search terms in real-time and provides a real-time visualization. http://www.monitter.com/ is an app I've found that is similar in its requirements. What is the appropriate API to use for it? Initially I thought the streaming API was the obvious choice, but the limitation of one concurrent connection means that I can only track one search term at a time(with one user account). I could get around this by making multiple user accounts, but that seems like the wrong approach.
I looked at user streams but the language for that API seems to be more geared towards desktop applications.
So, what is the most best API for my use case? Thanks.
Actually you can track up to 400 keywords/terms via one streaming API connection.
https://dev.twitter.com/docs/streaming-api/methods#track
Depending on language you are using there are multiple interfaces you can use.
If you are using PHP, then I can suggest Phirehose as it works quite well and has multiple examples for different usages scenarios included.
http://code.google.com/p/phirehose/wiki/Introduction
Whats not there - when processing received tweets you will need to figure out how to match which tweet corresponds to which keyword/term because twitter streaming API gives all matching tweets in one stream.
Investigating further using Firebug, I found that monitter.com simply polls the REST search api every second or so on the client side. This is what I ended up doing as well.

What's the best service to use for filtering out spam/abuse/malware links for a link shortening webapp?

I have two services - Lincr and LinkBunch. Lincr is a plain jane URL shortening service, while LinkBunch lets you shorten multiple links into one link. I've had too much spam posted into the services, so I had to shut down Lincr. Now, even LinkBunch seems to be facing the same problem, and it's been disabled by my web host for that reason.
I can't keep shutting down sites like this because of bad links being posted, so I need a malware-filtering API that I can use to filter out the links as and when they are posted.
There are services that let me download an entire bunch of bad links to check against, but instead, I'd prefer doing a live API call on a per-link basis. What can I use for that?
Finally, what's the best malware filtering service out there?
Lincr is down. On LinkBunch, where is your Captcha?
On either site, do you limit the number of posts by IP? Do you use a delay in your response? What about using hidden fields to reduce spam (http://www.reviewmylife.co.uk/blog/2008/05/30/hidden-field-spam-trap-for-phpformmail/)?
I know I'm dodging the question a bit, but you should at least take basic anti-spam measures before resorting to API calls. Even APIs will still fail for newly-hacked / newly-spammed sites.

Resources