Rails + geocodes: Finding latitude and longitudes for 20k addresses - ruby-on-rails

I am using the geocoder gem for working with geolocation data.
Now, I have a list of 20k addresses and I to find for them their latitude and longitude coordinations. I am using for this Bing Maps, which allows to send 125k requests per day. So that's good.
But there's a problem - because when I send quickly a few requests through geocoder to decode some addresses, instead of returning the addresses it returns an empty array (nothing).
I think it's because of sending too many requests within a very short period of time. So I was thinking about putting there a delay between making those calls to geocoder, like:
sleep 3 # pause for 3 seconds
This is just a thought - how big should be the pause between the calls of geocoder? Or is there any better way to process 20k of data with using geocoder?
Thank you

Unless this is for a Windows app the limit is 125,000 a year. Windows apps can do 50,000 a day. Note that non-enterprise accounts are rate limited. When you make too many requests in a short period of time an empty response will be returned and a flag in the header will indicate that the request was rate limited. This is documented here: http://msdn.microsoft.com/en-us/library/ff701703.aspx

Related

What's the most efficient way to handle quota for the YouTube Data API when developing a chat bot?

I'm currently developing a chat bot for one specific YouTube channel, which can already fetch messages from the currently active livechat. However I noticed my quota usage shooting up, so I took the "liberty" to calculate my quota cost.
My API call currently looks like this https://www.googleapis.com/youtube/v3/liveChat/messages?liveChatId=some_livechat_id&part=snippet,authorDetails&pageToken=pageTokenIfProvided, which uses up 5 units. I checked this by running one API call and comparing the quota usage before and after (so apologies, if this is inaccurate). The response contains pollingIntervalMillis set to 5086 milliseconds. Currently, my bot adds that interval to the current datetime and schedules the next fetch at that time (using Celery), so it currently fetches messages at a rate of 4-6 seconds. I'm gonna take the liberty and always wait for 6 seconds.
Calculating my API quota would result in a usage of 72.000 units per day:
10 requests per minute * 60 minutes * 24 hours = 14.400 requests per day
14.400 requests * 5 units per request = 72.000 units per day
This means that if I used the pollingIntervalMillis as a guideline for how often to request, I'd easily reach the maximum quota of 10.000 units by running the bot for 3 hours and 20 minutes. In order to not use up the quota by just fetching chat messages, I would need to run 1 API call per minute (1,3889 approximately). This is very unfeasible for a chatbot, since this is only for fetching messages and not even sending any messages to the chat.
So my question is: Is there maybe a more efficient way to fetch chat messages which won't use up the quota so much? Or will I only get this resolved by applying for a quota extension? And if this is only resolved by a quota extension, how much would I need to ask for reliably? Around 100k units? Even more?
I am also asking myself how something like Streamlabs Chatbot (previously known as AnkhBot) accomplishes this without hitting the quota limit despite thousands of users using their API client, their quota must probably be in the millions or billions.
And another question would be how I'd actually fill out the form, if the bot is still in this "early" state of development?
You pretty much hit the nail on the head. Services like Streamlabs are owned by larger companies, in their case Logitech. They not only have the money to throw around for things like increasing their API quota, but they also have professional relationships with companies like Google to decrease their per unit cost.
As for efficiency, the API costs are easily found in the documentation, but for live chat as you've found, you're going to be hitting the API for 5 units per hit. The only way to improve your overall daily cost with your calls is to perform them less frequently. While once per minute is clearly excessively long, once every 15-18 seconds could reduce the overall cost of your API quota increase, while making the chat bot adequately responsive.
Of course that all depends on your desired usage of the data, but still a recommendation if you're implementing the bot still in the realm of hobbyist usage.

unlimited reverse Geocoding request

I have implemented an application which can import a GPX file and it contains many lat/long coordinates (several hundred).
My application performs a reverse geocoding for each lat/long. I want to display in a tableview all lat/long with their address (which will be ordered by country and town).
I'm performing only one geocoding request at a time (like it's described in Apple's documentation of reverseGeocodeLocation() in Core Location.
When I get the answer of a geocoding request then I request the next one and then until I have resolved all my Lat/Long.
Unfortunately when I have resolved around 40 Lat/Long, the reverseGeocodeLocation() raises an error (CLError.network) and then I need to wait 60 seconds before the resolve the next ones etc...
I have several questions:
- Is there a way to go faster? because if I have 800 lat/long it will takes around 20 minutes (800/40) * 60 = 1200 seconds (it's very long for an App running on a iPhone)
- Even if I don't perform improvements on the rate, is there a risk that if I have many users then Apple will forbid my App because the number of Geocoding request will be too high?
Thanks for your feedback
Regards,
Sebastien
Is the GPX file static or will it be changing for every user/download? If static, then maybe just reverse geocode all the coords beforehand?
Alternately, look for a different service to do reverse geocoding.

Getting realtime twitter search results using the streaming API

I have an application where I need to get complete, realtime search results from twitter (preferably polling every 500ms or less). Based on my understanding, doing this using the search API will run into rate limits very quickly. However, the streaming API doesn't seem to support getting complete anything (only a 5% sample).
More specifically, I have a search query term which typically comes up with <20 matching tweets per hour, and I would like to be informed of these new tweets within 1-2 seconds, and it is considered a failure if I am not notified within 5 seconds. Due to the relatively low frequency of posting, missing even one tweet is very undesirable.
Is there any way I can realistically do this using twitter API, or is my only choice to write a browser extension to repeatedly refresh the search page?
The answer is "yes". Although you are rate limited (the limit is closer to 1% than 5%), that is only a cutoff based on your query results. Very roughly, you can stream about 60 tweets per second max. In your case, you say you expect under 20 tweets per hour, so you should have no problem getting all those tweets.
You also require a latency less than 5 seconds. In my experience latency has always been a second or two. I think you should be fine.

Ways to pull (potentially) large amounts of data from Twitter

I've been playing around with the Twitter API using Twitter4j. I am trying to pull data given a keyword and date, and example of a query I would run using the REST API would be
bagels since:2014-12-27
Which would give me all tweets containing the keyword 'bagels' since 2014-12-27.
This works in theory, but I've quickly exceeded the rate limits since each query allows up to 100 results, and only 180 queries are allowed within a 15-minute interval. There are many keywords that return more than 18k results.
Is there a better way to pull large amounts of data from Twitter? I looked at the Streaming API but I don't know if I can pull data from a certain date range.
There are a few things you can do to improve your rates:
Make sure your count is maxed at 100, which it looks like you're doing.
Use Application-Only authorization - it increases your rate limit to 450.
Use the max_id, since_id parameters to page through data and avoid querying for results you're already received. See the Working with Timelines docs to see what I mean.
Consider using Gnip if you're willing to pay to remove rate limits.

Getting as much tweets associated to a day's trends as possible

I am storing in a database, every 30 minutes, Twitter's trending topics of a country Y. No problem with that.
Now, I want to get as much tweets as possible matching those trending topics for research purposes.
Since I would like to study the patterns of the trends, I would like continuous tweet data of at least 3 days centered in the day the trend peak was detected, for every trending topic. In order to achieve that, I thought of doing the following:
Suppose I am in day X. I could retrieve the unique trends of day X-2, and for every trend, look for tweets matching the trend in the interval [X-3, X-1], that is 3 days. However, the problem here is Twitter rate limitations. If I have 100 trending topics in day X-2, and I make 20 GET search requests/trend, I would end up doing a total of 2,000 requests, which overpasses Twitter's 350 hourly rate limit. If make 300 req/hour, it would take more than 6 hours to get the data for only one day...
Does anybody know any other (better) way for getting tweets associated with trends?
Thanks in advance
Twitter Streaming API?
Twitter Streaming API doesn't deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. (Note*-At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.)
Is there any way to get more tweets from the streaming?
None that I know. But, do refer the below mentioned information if you are considering to switch among search or streaming API.
Please choose your case:
If you need real time data and your number of requests are high:
Go for Streaming API
The streaming API requires that you keep the connection active. This requires a server process with an infinite loop, to get the latest tweets.
Advantage
1)Lag in retrieving results: Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API
2)Not rate limited.
If you need aggregate data regardless of its time range and your number of requests are not high:
Go for Search API
The search API is the easier of the two methods to implement but it is rate limited .Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query.
Advantage
1)Finding tweets in the past:The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets
2)Easier to implement

Resources