Twitter Rate Limits and Cron caching with rails - ruby-on-rails

I have a small rails app on Heroku that pulls in my client's latest Tweet to display on all pages. It is hitting Twitter rate limits already. I'm trying to come up with a solution. Would the following be a sensible approach ...
Use a cron gem like Whenever to pull down the latest Tweet every minute and write it to a file, then have pages pull the Tweet from that file instead of directly from Twitter.

Yes, this is one possibility. Or you could use caching to store the tweets, for example using Memcached. This will also make your app faster.

I'm not familiar with the specific rate limits on twitter, but if they're expressed in requests/minute then the cron job might work. Whatever you do, you need to stop letting incoming traffic drive your requests. Typically you'd create a queue and have a single worker pull requests off of it. That worker would take care of rate limiting itself so you don't go over.
API rate limits are a necessary evil. Maybe you can make a gem to help other folks easily throttle themselves.

I ended up using memcache to cache the requests:
latest_tweet = Rails.cache.read "latest_tweet"
if !latest_tweet
latest_tweet = Twitter.user_timeline("sometwitterusername").first.text
Rails.cache.write("latest_tweet", latest_tweet, :expires_in => 5.minutes)
end

Related

Caching an HTTP request made from a Rails API (google-id-token)?

ok, first time making an API!
My assumption is that if data needs to be stored on the back end such that it persists across multiple API calls, it needs to be 1) in cache or 2) in a Database. is that right?
I was looking at the code for the gem "google-id-token". it seems to do just what i need for my google login application. My front end app will send the google tokens to the API with requests.
the gem appears to cache the public (PEM) certificates from Google (for an hour by default) and then uses them to validate the Google JWT you provide.
but when i look at the code (https://github.com/google/google-id-token/blob/master/lib/google-id-token.rb) it just seems to fetch the google certificates and put them into an instance variable.
am i right in thinking that the next time someone calls the API, it will have no memory of that stored data and just fetch it again?
i guess its a 2 part question:
if i put something in an #instance_variable in my API, will that data exist when the next API call comes in?
if not, is there any way that "google-id-token" is caching its data correctly? maybe HTTP requests are somehow cached on the backend and therefore the network request doesnt actually happen over and over? can i test this?
my impulse is to write "google-id-token" functionality in a way that caches the google certs using MemCachier. but since i dont know what I'm doing i thought i would ask.? Maybe the gem works fine as is, i dont know how to test it.
Not sure about google-id-token, but Rails instance variables are not available beyond single requests and views (and definitely not from one user's session to another).
You can low-level cache anything you want with Rails.cache.fetch this is put in a block, takes a key name, and an expiration. So it looks like this:
Rails.cache.fetch("google-id-token", expires_in: 24.hours) do
#instance_variable = something
end
If the cache exists and it is not past its expiration date/time, Rails grabs it from the cache; otherwise, it would make your API request.
It's important to note that low-level caching doesn't work with mem_store (the default for development) and so you need to implement a solution with redis or memcached or something like that for development, too. Also, make sure the file tmp/cache.txt exists. You can run rails dev:cache or just touch it to create it.
More on Rails caching

Refresh data with API every X minutes

Ruby on Rails 4.1.4
I made an interface to a Twitch gem, to fetch information of the current stream, mainly whether it is online or not, but also stuff like the current title and game being played.
Since the website has a lot of traffic, I can't make a request every time a user walks in, so instead I need to cache this information.
Cached information is stored as a class variable ##stream_data inside class: Twitcher.
I've made a rake task to update this using cronjobs, calling Twitcher.refresh_stream, but naturally that is not running within my active process (to which every visitor is connecting to) but instead a separate process. So the ##stream_data on the actual app is always empty.
Is there a way to run code, within my currently running rails app, every X minutes? Or a better approach, for that matter.
Thank you for your time!
This sounds like a good call for caching
Rails.cache.fetch("stream_data", expires_in: 5.minutes) do
fetch_new_data
end
If the data is in the cache and is not old then it will be returned without executing the block, if not the block is used to populate the cache.
The default cache store just keeps things in memory so doesn't fix your problem: you'll need to pick a cache store that is shared across your processes. Both redis and memcached (via the dalli gem) are popular choices.
Check out Whenever (basically a ruby interface to cron) to invoke something on a regular schedule.
I actually had a similar problem with using google analytics. Google analytics requires that you have an API key for each request. However the api key would expire every hour. If you requested a new api key for every google analytics request, it'd be very slow per request.
So what I did was make another class variable ##expires_at. Now in every method that made a request to google analytics, I would check ##expires_at.past?. If it was true, then I would refresh the api key and set ##expires_at = 45.minutes.from_now.
You can do something like this.
def method_that_needs_stream_data
renew_data if ##expires_at.past?
# use ##stream_data
end
def renew_data
# renew ##stream_data here
##expires_at = 5.minutes.from_now
end
Tell me how it goes.

Caching( optimizing) Strategy with API live stream on Rails

So I built a website that uses Twitch.tv API, which is a gaming live stream website. The requests are long and slow, and I would like to cache it somehow. The problem is that there are a lot of dynamic attributes, if they are still online, or how many viewers there are. Since the traffic to my website is low at the moment, expiring Cache early isn't going to help much. Also, I have a page where it lists all the live streams, and it requests to see if the stream is online. So even if no one is online it still takes a while to load. Is there anyway to retrieve api faster without caching?
here is twitch.tv api doc
Since you don't own the Twitch.tv API, unfortunately I would say there is really nothing you can do to make their calls faster.
The good news is that you can cache the calls you make to them, which will make things appear faster to your users.
The way to cache the calls is to create a key and then cache the return JSON from the API. To create the key I would just use the URL you are calling for the API. Then just give the cached value an expiration time of a few minutes and when it expires you make another API call to re-populate the cache.
Also I'd look at Varnish (https://www.varnish-cache.org/) which does a lot of HTTP caching really well. Could work really well for you and it has the concept of a grace period that tries to hide the expensive calls made when the cache expires.

Best Way to log API Calls, per minute / per hour

We are using reverse-geocoding in a rails webservice, and have run into quota problems when using the Google reverse geocoder through geokit. We are also implementing the simple-geo service, and I want to be able to track how many requests per minute/hour we are making.
Any suggestions for tracking our reverse-geocoding calls?
Our code will look something like the following. Would you do any of these?
Add a custom logger and process in the background daily
Use a super-fantastic gem that I don't know about that does quotas and rating easily
Insert into database a call and do queries there.
Note: I don't need the data in real-time, just want to be able to know in an hourly period, what's our usual and max requests per hour. (and total monthly requests)
def use_simplegeo(lat, lng)
SimpleGeo::Client.set_credentials(SIMPLE_GEO_OAUTHTOKEN, SIMPLE_GEO_OAUTHSECRET)
# maybe do logging/tracking here?
nearby_address = SimpleGeo::Client.get_nearby_address(lat, lng)
located_location = LocatedLocation.new
located_location.city = nearby_address[:place_name]
located_location.county = nearby_address[:county_name]
located_location.state = nearby_address[:state_code]
located_location.country = nearby_address[:country]
return located_location
end
Thanks!
The first part here is not answering the question you are asking but my be helpful if haven't considered it before.
Have you looked at not doing your reverse geocoding using your server (i.e. through Geokit) but instead having this done by the client? In other words some Javascript loaded into the user's browser making Google geocoder API calls on behalf of your service.
If your application could support this approach than this has a number of advantages:
You get around the quota problem because your distributed users each have their own daily quota and don't consume yours
You don't expend server resources of your own doing this
If you still would like to log your geocoder queries and you are concerned about the performance impact to your primary application database then you might consider one of the following options:
Just create a separate database (or databases) for logging (which write intensive) and do it synchronously. Could be relational but perhaps MongoDB or Redis might work either
Log to the file system (with a custom logger) and then cron these in batches into structured, queriable storage later. The storage could be external such as on Amazon's S3 if that works better.
Just write a record into SimpleGeo each time you do a Geocode and add custom meta-data to those records to tie them back to your own model(s)

Why would you upload assets directly to S3?

I have seen quite a few code samples/plugins that promote uploading assets directly to S3. For example, if you have a user object with an avatar, the file upload field would load directly to S3.
The only way I see this being possible is if the user object is already created in the database and your S3 bucket + path is something like
user_avatars.domain.com/some/id/partition/medium.jpg
But then if you had an image tag that tried to access that URL when an avatar was not uploaded, it would yield a bad result. How would you handle checking for existence?
Also, it seems like this would not work well for most has many associations. For example, if a user had many songs/mp3s, where would you store those and how would you access them.
Also, your validations will be shot.
I am having trouble thinking of situations where direct upload to S3 (or any cloud) is a good idea and was hoping people could clarify either proper use cases, or tell me why my logic is incorrect.
Why pay for storage/bandwidth/backups/etc. when you can have somebody in the cloud handle it for you?
S3 (and other Cloud-based storage options) handle all the headaches for you. You get all the storage you need, a good distribution network (almost definitely better than you'd have on your own unless you're paying for a premium CDN), and backups.
Allowing users to upload directly to S3 takes even more of the bandwidth load off of you. I can see the tracking concerns, but S3 makes it pretty easy to handle that situation. If you look at the direct upload methods, you'll see that you can force a redirect on a successful upload.
Amazon will then pass the following to the redirect handler: bucket, key, etag
That should give you what you need to track the uploaded asset after success. Direct uploads give you the best of both worlds. You get your tracking information and it unloads your bandwidth.
Check this link for details: Amazon S3: Browser-Based Uploads using POST
If you are hosting your Rails application on Heroku, the reason could very well be that Heroku doesn't allow file-uploads larger than 4MB:
http://docs.heroku.com/s3#direct-upload
So if you would like your users to be able to upload large files, this is the only way forward.
Remember how web servers work.
Unless you're using a sort of async web setup like you could achieve with Node.JS or Erlang (just 2 examples), then every upload request your web application serves ties up an entire process or thread while the file is being uploaded.
Imagine that you're uploading a file that's several megabytes large. Most internet users don't have tremendously fast uplinks, so your web server spends a lot of time doing nothing. While it's doing all of that nothing, it can't service any other requests. Which means your users start to get long delays and/or error responses from the server. Which means they start using some other website to get the same thing done. You can always have more processes and threads running, but each of those costs additional memory which eventually means additional $.
By uploading straight to S3, in addition to the bandwidth savings that Justin Niessner mentioned and the Heroku workaround that Thomas Watson mentioned, you let Amazon worry about that problem. You can have a single-process webserver effectively handle very large uploads, since it punts that actual functionality over to Amazon.
So yeah, it's more complicated to set up, and you have to handle the callbacks to track things, but if you deal with anything other than really small files (and even in those cases), why cost yourself more money?
Edit: fixing typos

Resources