I'm trying to do a write up of Twitter4J for part of a uni project, but I'm getting hung up on a few things. From the Twitter4J api:
void sample()
Starts listening on random sample of all public
statuses. The default access level provides a small proportion of the
Firehose. The "Gardenhose" access level provides a proportion more
suitable for data mining and research applications that desire a
larger proportion to be statistically significant sample.
This implies that by default, a "default access" is provided to the stream, but another type of access, "Gardenhose access" is available. Is this correct? And if so, how do you access the higher Gardenhose access?
I'm asking as I've seen some answers on SO suggest that there is only one level of access - the Gardenhose, and I'm trying to clear this up once and for all.
In addition to this, I would like a reference (if possible) to the number of tweets the sample stream allows access to. I've read lots of people cite 1% for "default access" and 10% for "gardenhose access" - but I can't find this anywhere in the API.
So to sum up, two questions:
Does the sample stream have a "default access" and a "gardenhose access", or just one of those?
How much of the Twitter firehose stream can these levels of access gain?
If replying, please have links to reference-able API where possible.
The gardenhose is different from the default sample stream, you would have had to request access from Twitter in order to use it.
However, I am not sure if Twitter still allows access to the gardenhose, or even if it still exists. It seems the current mechanism may be to use one of Twitter's preferred data partners:
Using the Streaming API?
Every Twitter account can connect to a small sampling of the Streaming API. Accounts that need increased access for data gathering or analytical reasons should check out our preferred partners page.
(source)
It may be different for students or educational instutions and that the gardenhose is still available to you. Previously you would have to either e-mail api-research#twitter.com or you could use the following form, but I have no idea if these methods work still - the post is quite old.
As for the percentage of Tweets that the default sample stream allows access to, the best reference I could find was a comment made by a Twitter employee on the developer forums - emphasis mine:
I would recommend just using the 1% sample stream from https://stream.twitter.com/1/statuses/sample.json that you can connect to with your Twitter account. It's unlikely that you'll be in a situation where you can access all of the data and will have to make do with a sample. At about 230 million tweets a day, you'd still be theoretically getting 2.3 million tweets a day.
(source)
Although, again this is an old post.
Regarding the firehose stream, as specified by the documentation you need to be granted permission to access it, I believe very few people have full access to this stream:
GET statuses/firehose
This endpoint requires special permission to access.
Returns all public statuses. Few applications require this level of access. Creative use of a combination of other resources and various access levels can satisfy nearly every application use case.
Overall documentation is scarce on the different access levels and what they offer, I suggest contacting Twitter directly to discuss your requirements or contacting one of their data partners.
Apologies if this wasn't as concrete as you would have liked, good luck with your research.
Related
Is google form a Privacy-Preserving way to conduct a survey?
some people are not comfortable with it. Is it because most people have a google account and if they do not go on private mode, they give more information about themselves to google? does google use the responses?
No.
The contents of google forms (which usually feed into google spreadsheets) is shared between the submitters (only their own data, obviously), you as the form owner, and the entirety of google's internal infrastructure.
Google using the data directly would be a really major infraction, just as it would be if they acted on the contents of a gmail account, however, they have plenty of scope to use the information in indirect, less-obvious ways. For example, the data that someone submits in a form could be used on other sites for ad targeting. Google does this in gmail; if someone sends you an email about something, you can expect to see ads on that subject both within gmail and on other sites. To be fair, they may have stopped that particular practice, but the wider point is that you really can't tell.
"Private mode" is irrelevant in this case; it gives very little protection to start with, and if a form requires you to be logged in to a google account, they know exactly who you are anyway.
On top of this you have the problems caused by the Schrems II judgement that effectively made it illegal to store any personal data (in the GDPR sense) in the US about people in the EU. Prior to this judgement, Google relied on the Privacy Shield arrangement and "Standard Contractual Clauses" (SCCs) to allow this. Privacy Shield is simply dead, and while SCCs are valid in general, they are not usable in the US (though both Google and Facebook have been trying to gaslight to the contrary) because the ongoing lack of US federal privacy laws and the persistent overreach of US security agencies renders it impossible to make their claims valid. This is unlikely to change in the near future.
Assuming it's not available as part of API, how can one obtain a full or partial list of public users of a web service, e.g. Twitter, Tumblr, YouTube?
Acceptable alternative: get a random public user.
I was interested in this for testing APIs with a random account. This is useful to catch edge cases when developing an app for the API; For example when developing a Tumblr theme, seeing what volumes of text/images are posted, special character use, and so on.
Can you even imagine a full list of (public) users of largely used web services? That's a vast load of data. I hardly believe that any API would offer that for many reasons:
performance/load issues,
data/information privacy,
abusing possibilities,...
For regular usage of the service's API you simply don't need that. Otherwise it would stink with some gray/black techniques.
Anyway to answer you question objectively: In order to get full or partial list of users from web service it have to provide any kind of API which would allow you to do that. So good starting point is to look at documentation, for example Twitter API, Youtube API, etc...
By swift look I don't see any method that would offer that. It might change in the future but as mentioned above I strongly doubt about that.
Another option is to mine partial list of users via search APIs or traversing the site with a robot. Also obtaining such a list is an option. However I would check whether this is even legal and not against terms of use or something like that.
So I'm working on switching to using the v3 version of the YouTube api (which is so much better it's like a completely different product), but I'm either missing something or it is ...
Being able to fetch an arbitrary list of videos, and their details, in one call is going to make life significantly better, but in the videos list method, the the video details "snippet" contains the "channelId", not the "author".
I've spent quite a bit of time looking through the documentation, but can't find any way of getting from a channelId to the human readable author name.
How am I expected to map a video to an author?
It's not possible to get back a display name (either legacy YouTube name or Google+ name) for a channel as part of the video.snippet response. You need to take the channelId and perform a channels.list(id=channelId1,channelId2,...,part=snippet) operation to get that information. The good part is that you can pass in up to 50 channel ids in a single call.
This sort of separation of information into different resources with ids effectively serving as keys linking the resources was a deliberate decision. The engineering team is aware that it will require developers to make an additional API call, but they're in favor of that design.
At the same time, the API is still in an experimental release, and if you have any feedback about using the API while doing real-world development, feel free to open a feature request in the issue tracker. If enough people give feedback about a certain aspect of the API, that could factor in to the final revision's design.
The accepted answer may have been correct at the time of writing, but as of 2/2018 the snippet part now includes a channelTitle property.
I would like to build a web application that tracks some user defined search terms in real-time and provides a real-time visualization. http://www.monitter.com/ is an app I've found that is similar in its requirements. What is the appropriate API to use for it? Initially I thought the streaming API was the obvious choice, but the limitation of one concurrent connection means that I can only track one search term at a time(with one user account). I could get around this by making multiple user accounts, but that seems like the wrong approach.
I looked at user streams but the language for that API seems to be more geared towards desktop applications.
So, what is the most best API for my use case? Thanks.
Actually you can track up to 400 keywords/terms via one streaming API connection.
https://dev.twitter.com/docs/streaming-api/methods#track
Depending on language you are using there are multiple interfaces you can use.
If you are using PHP, then I can suggest Phirehose as it works quite well and has multiple examples for different usages scenarios included.
http://code.google.com/p/phirehose/wiki/Introduction
Whats not there - when processing received tweets you will need to figure out how to match which tweet corresponds to which keyword/term because twitter streaming API gives all matching tweets in one stream.
Investigating further using Firebug, I found that monitter.com simply polls the REST search api every second or so on the client side. This is what I ended up doing as well.
I would like to include local business address/phone numbers into my site.
Does anyone have thoughts on using google local search api vs. twitter's geo api vs. purchasing a directory listing?
Mainly depends on your site and needs (real time, offline..).
Google local gives very good results, the best from my experience (compared to other apis).You should check the terms of service of each service. If I remember correctly, google doesn't allow using it's local api if you site charges users for money.
Also, I think google TOS limits you to client side usage, but you should read the TOS to see if it's true.
Haven't tried the twitter geo api too much, but I remember it didn't fit my needs.
Purchasing a directory listing is not cheap. Again, depends on your needs; do you need US business listings? World wide? If you want US businesses, the leading companies for purchasing a DB of listings are: localeze, infousa, acxiom.
Besides Google Local Search (which actually has been deprecated), there's now SimpleGeo Places, which is free for low volume use and without restrictive terms of service. I don't work for them.
Could also use the Google Places API (which has not been deprecated) using the instructions here.