For a research project, I need to download the top 100 most used words on Twitter, multiple times per hour. However, as far as I can tell, the Twitter API only supports downloading the top 10 most used words ("trends").
My questions therefore are:
Am I missing something in the API? Is there another way to fetch more than 10 trends?
If there isn't, does anybody know of a workaround for this problem?
Put ?count=50 at the end of the URL to get the top 50. I haven't been able to get more than 50.
http://api.twitter.com/1/trends/current.json?count=50
you should monitor your timeline get all tweets , save in database, analyze via NLP, and save words(for example person names), after aggregate and get counts, for example "Obama 50 times, Java 10 times, linux 5 times"
Related
The issue I'm seeing is that, in an attempt to filter out channels, when I select the "video" filter (through the YouTube API or a manual search of YouTube) many videos (i.e., not channels) disappear from my search results.
To see what I'm talking about:
1) search YouTube for "Salesforce Lightning|Apex|API"
2) sort by View Count
==> Note that the first video result in the list is one entitled "Hands-on Training: Get Started with Apex Code for Admins" with about 59,000 views. There are also many other videos in the search results with thousands of views.
3) Now filter for Videos only.
==> Note that 69 results (including channels) has been reduced to 218 results. And that the top result is a video with only 402 views.
This doesn't make sense, and seems like a bug to me. But strangely, it seems like--in some cases--a couple hours after I notice issues with some of these atypical (probably not run by the typical YouTube user) queries the results spontaneously correct themselves.
Can anyone provide insight into why this is happening, and what I can do to workaround this behavior? It almost seems like, for atypical queries (for which no results have been cached), YouTube must have to run some aggregation queries or background processes. Which may mean that the initial search fails, and will fail for some unknown time, but then after some number of hours or days the search will succeed. Is this true? Is their a reliable time window I can use? I.e., if I know the time window than I can pass my initial query to prime the pump, and then re-run the query a couple hours (or whatever) later and grab the desired data.
I appreciate your thoughts and assistance!
I want to retrieve all stocks from few exchanges - by retrieve the stocks that inside those exchanges (by taking from http://www.nasdaq.com/screening/company-list.aspx).
And then I will quote for all stocks from google or Yahoo.
My question is if I will quote all of them for every 5 seconds or 10 seconds - will they block me?
What is the correct way for getting all stocks and they updated data?
Thanks!
David,
tl;dr - yahoo finace is OK (scraping 2,000 stocks) if you insert pauses in your code
I have some clumsy, but working code (my first attempt at scrapping) that pulls some data from Yahoo Finance. While I don't like the code and I will rewrite it for nasdaq.com in following weeks, I can tell you that I'm not getting blocked.
I have a few years old list of stocks for Russel 2000 so there are around 2,000 tickers I'm slowly going through and pulling some data from balance sheet. I'm using Selenium (see my question history, there is only one to see/get working code), code loads Chromium web browser (Linux) clicks on Balance sheet, scrape some data, clicks quarterly link, scraps more data and then closes the browser. For every ticker (stock).
Just to be on a safe side, I put several pauses into my code, for every scrap or navigation on site I added between 5 and 10 seconds. That way I'm slowly scraping data and Yahoo seems to be OK with this :-) It takes about one minute per ticker. I'm running this scrap job (for the first time!) now for over 30 hours lol and I'm currently at ticker that starts with T so I have few more hours to go.
I have read somewhere that some sites can spot this slow scraping also. So as an idea, instead of just hard code pause of say 7 seconds, you could run random number generator between IDK, 7-15 seconds and that way pauses will be more random and less prone to be spotted... Just a though Hope this helps a little bit even if with delay.
Ah, and if this answer does help you, please be so kind to mark it as solved and up vote it. Maybe I can get a point or two for it. My points are so low I can't even vote other posts that I like and that helped me.
I've been playing around with the Twitter API using Twitter4j. I am trying to pull data given a keyword and date, and example of a query I would run using the REST API would be
bagels since:2014-12-27
Which would give me all tweets containing the keyword 'bagels' since 2014-12-27.
This works in theory, but I've quickly exceeded the rate limits since each query allows up to 100 results, and only 180 queries are allowed within a 15-minute interval. There are many keywords that return more than 18k results.
Is there a better way to pull large amounts of data from Twitter? I looked at the Streaming API but I don't know if I can pull data from a certain date range.
There are a few things you can do to improve your rates:
Make sure your count is maxed at 100, which it looks like you're doing.
Use Application-Only authorization - it increases your rate limit to 450.
Use the max_id, since_id parameters to page through data and avoid querying for results you're already received. See the Working with Timelines docs to see what I mean.
Consider using Gnip if you're willing to pay to remove rate limits.
I'm seeing issues where adding multiple entries to a playlist in a short amount of time seems to fail regularly without any error responses.
I'm using the json-c format with version 2.1 of the api. If I send POST requests to add 7 videos entries to a playlist then I see results of between 3-5 of them actually being added to the playlist.
I am getting back a 201 created response from the api for all requests.
Here's what a request looks like:
{"data":{"position":0,"video":{"duration":0,"id":"5gYXlTe0JTk","itemsPerPage":0,"rating":0,"startIndex":0,"totalItems":0}}}
and here's the response:
{"apiVersion":"2.1","data":{"id":"PLL_faWZNDjUU42ieNrViacdvqvG714P4QjvSDgGRg1kc","position":4,"author":"Lance Andersen","video":{"id":"5gYXlTe0JTk","uploaded":"2012-08-16T19:27:19.000Z","updated":"2012-09-28T20:20:39.000Z","uploader":"usanahealthsciences","category":"Education","title":"What other products does USANA offer?","description":"Discover USANA's other high-quality products: the Sens skin and hair care line, USANA Foods, the RESET weight-management program, and Rev3 Energy.","thumbnail":{"sqDefault":"http://i.ytimg.com/vi/5gYXlTe0JTk/default.jpg","hqDefault":"http://i.ytimg.com/vi/5gYXlTe0JTk/hqdefault.jpg"},"player":{"default":"http://www.youtube.com/watch?v=5gYXlTe0JTk&feature=youtube_gdata_player","mobile":"http://m.youtube.com/details?v=5gYXlTe0JTk"},"content":{"5":"http://www.youtube.com/v/5gYXlTe0JTk?version=3&f=playlists&d=Af8Xujyi4mT-Oo3oyndWLP8O88HsQjpE1a8d1GxQnGDm&app=youtube_gdata","1":"rtsp://v6.cache3.c.youtube.com/CkgLENy73wIaPwk5JbQ3lRcG5hMYDSANFEgGUglwbGF5bGlzdHNyIQH_F7o8ouJk_jqN6Mp3Viz_DvPB7EI6RNWvHdRsUJxg5gw=/0/0/0/video.3gp","6":"rtsp://v7.cache7.c.youtube.com/CkgLENy73wIaPwk5JbQ3lRcG5hMYESARFEgGUglwbGF5bGlzdHNyIQH_F7o8ouJk_jqN6Mp3Viz_DvPB7EI6RNWvHdRsUJxg5gw=/0/0/0/video.3gp"},"duration":72,"aspectRatio":"widescreen","rating":5.0,"likeCount":"6","ratingCount":6,"viewCount":1983,"favoriteCount":0,"commentCount":0,"accessControl":{"comment":"allowed","commentVote":"allowed","videoRespond":"moderated","rate":"allowed","embed":"allowed","list":"allowed","autoPlay":"allowed","syndicate":"allowed"}},"canEdit":true}}
The problem doesn't change if I set the position attribute.
If I send them sequentially with a 5 second delay between them then the results are more reliable with 6 of the 7 usually making it on the playlist.
It seems like there is a race condition happening on the api server side.
I'm not sure how to handle this problem since I am seeing zero errors in the api call responses.
I have considered doing batch processing, but can't find any documentation on it for the json-c format. I'm not sure it that would make a difference anyways.
Is there a solution to reliably adding playlist entries to a playlist?
This was fixed in and update to the youtube data apis around the 25th of October.
I am storing in a database, every 30 minutes, Twitter's trending topics of a country Y. No problem with that.
Now, I want to get as much tweets as possible matching those trending topics for research purposes.
Since I would like to study the patterns of the trends, I would like continuous tweet data of at least 3 days centered in the day the trend peak was detected, for every trending topic. In order to achieve that, I thought of doing the following:
Suppose I am in day X. I could retrieve the unique trends of day X-2, and for every trend, look for tweets matching the trend in the interval [X-3, X-1], that is 3 days. However, the problem here is Twitter rate limitations. If I have 100 trending topics in day X-2, and I make 20 GET search requests/trend, I would end up doing a total of 2,000 requests, which overpasses Twitter's 350 hourly rate limit. If make 300 req/hour, it would take more than 6 hours to get the data for only one day...
Does anybody know any other (better) way for getting tweets associated with trends?
Thanks in advance
Twitter Streaming API?
Twitter Streaming API doesn't deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. (Note*-At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.)
Is there any way to get more tweets from the streaming?
None that I know. But, do refer the below mentioned information if you are considering to switch among search or streaming API.
Please choose your case:
If you need real time data and your number of requests are high:
Go for Streaming API
The streaming API requires that you keep the connection active. This requires a server process with an infinite loop, to get the latest tweets.
Advantage
1)Lag in retrieving results: Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API
2)Not rate limited.
If you need aggregate data regardless of its time range and your number of requests are not high:
Go for Search API
The search API is the easier of the two methods to implement but it is rate limited .Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query.
Advantage
1)Finding tweets in the past:The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets
2)Easier to implement