Crawling Twitter using multiple keywords - twitter

We have some quick question about putting together a web crawler to collect some data from Twitter.
For example, if we want to use a few hundreds of user IDs as keywords to collect some necessary Tweets. However, it seems we can only use a limited number of keywords (12?) and we can launch one crawler at one time.
Any tips?

Due to Twitter API restrictions (rate limiting) it is not imaginable to crawl the whole website, unless you have a lot of time or unless you pay in order to have a special access to the twitterhose, which should be the only way to solve the problem you mentioned

Related

Beginner Question: How to access the number of impressions from *other users'* tweets?

I've got a bunch of free online HTML, CSS, and JS tutorials under my belt and I want to try using them to make a browser extension. But I want to make sure that the data I want to use is actually accessible before getting started.
My goal is to make a browser extension for twitter.com that shows the number of impressions of any tweet next to the likes, retweets, and replies. My basic idea is to get the status URL of any given tweet, poll the Twitter API for the number of impressions of that tweet, store that in a variable, and then use CSS to display a little eye icon and the number stored in the impressions variable.
I know that I can find the number of impressions of all of my tweets, both through Twitter Analytics, and also just going to my profile page and clicking the little bar chart icon next to views, retweets, etc. But I'm not clear on whether I can do that for other people's tweets via Twitter's API or anything else. Can you?
For the record, I'm not too concerned about the varying definition of "impression," since it will be consistently applied across all tweets and I'm mostly interested in giving users a comparison between tweets. This is part of a research project to see how this might change how people engage with social media if they know how many views a given post has. If there's a simpler way to go about that using existing platforms, I'm open to suggestions.
Thanks for the advice!
No, impressions data is private. If you are authenticated to the Twitter API then you can use the new Twitter Developer Labs Tweets API to get private metrics like impressions, but you cannot get that for other people's Tweets. Also, the Twitter API does not support CORS, so I don't think you'll be successful trying to use it from a browser extension.

Building a Twitter Search Box With Search Suggestions

I am developing a site that is integrated with Twitter content and I would like to enhance my search box providing search suggestions for hashtags and handles as the user types. Is there any way to get this autocomplete data generated from Twitter?
thanks().InAdvance();
There isn't anything in the Twitter API that does that. Besides, it wouldn't work either because the rate limits would never permit that type of interaction. e.g. you might have n queries in a 15 minute window. If you eat up that much rate limit, it leaves less to iterate through the rest of the results an support subsequent queries, leaving the user waiting until the next 15 minute window. I understand what you want to do, but 3rd party APIs, like Twitter, have very specific pre-defined functionality and don't work like a general purpose database.

Displaying tweets from multiple users (similar to Embedded Timelines) without twitter-side user lists

I am new to Twitter and need some tips.
I need to display tweet feed from multiple users on some webpage.
The first thing I stumbled upon is Embedded Timelines. It allows to display tweets from list of users but the gotcha is that those lists should be maintained on Twitter-side (i.e. I cannot specify #qwe and #asd only on my side and get timeline without adding those users into list on Twitter-side).
The thing is that list of users that should be included into timeline is dynamic and managing those lists through Twitter API will probably be painful. Not to mention that my website will probably generate tons of those lists and I feel that I will violate some api quotas sooner or later.
So, my question is - am I stuck with using Embedded Timelines that refer some user list on Twitter-side and managing those lists through, say Twitter REST api, or there is a simplier way to do what I want?
It's pretty simple to display tweets for multiple users.
Links to start with
This post explains some of the search queries you can make
This post is a simple library to make requests to the twitter API that 'just works'
Your Query
Okay, so you want multiple users. The endpoint you're looking at using is the search/tweets one: https://api.twitter.com/1.1/search/tweets.json.
The query string uses :from and you can interpolate multiple froms with AND/OR.
An example query for the GET request:
?q=from:user1+OR+from:user2
Read more about the search API queries here.
Your "over-the-quote" issue
This is something you're going to need to figure out yourself - depending on the number of requests you expect to make, and the twitter imposed limits, maybe some sort of caching or saving information when you hit your limit, and only pull back from the cache whilst you're hitting your limit..

Finding top twitter users?

There is a large number of sites like Twitaholic or Twittergrader that offer rankings of Twitter users depending on the number of followers, influence, etc. I haven't found much information, though, on how do they compute these rankings.
My guess is that they begin with a handful users and keep exploring the followers' graph, while periodically updating the information of the users they already know of.
So the question is: is this the right approach or is there a more trivial way of doing it?
The sites you mention started years ago, and at that time they were given whitelisting by Twitter, which means that they can make tens of thousands of API requests per hour. Twitter no longer gives out new whitelisted accounts, so this type of analysis cannot be done by new sites. New accounts are only allowed to make 350 API requests per hour.
It is in fact possible just to use the Twitter API to examine and remember everything about every user, which is what quite a few sites do. twitter streaming api

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

Resources