Storing infrequently used data sets in S3 - ruby-on-rails

I want a solution to store data that is not often accessed. For example, analytical data that I may not need to frequently access and instead only refer to counter columns stored in records in a local postgres database.
I’ve looked at Amazon SimpleDB and DynamoDB but I think S3 might be the best solution (for now). I like DynamoDB, but I think its overkill for what I need. I’ve also looked into the Amazon Big Data solutions, which I feel is almost definitely overkill. I’ve heard of companies using S3 to store infrequently accessed data sets (rather than the traditional use of media).
What should I use? Is S3 a horribly bad idea? If S3 is a good idea, how would I use it to store data sets? I’ve used S3, carrierwave and fog a lot for storing media, but not for data.
Example data
I want to store all the responses from emails that are sent by Mandrill. This will eventually accumulate a lot of data and I may not use the data for a long time, but I want to store it regardless.
I’m thinking about combining the emails into a set and grouping them by day. And I can store the counters (which I will access frequently) in a single record in my local postgres database (one record per day). I won’t be able to search through the raw email data without loading the day, but I don’t think I need to.

Related

When fetching data using an api, is it best to store that data on another database, or is it best to keep fetching that data whenever you need it? [duplicate]

This question already has an answer here:
Caching calls to an external API in a rails app
(1 answer)
Closed 6 years ago.
I'm using the TMDB api in order to fetch information such as film titles and release years, but am wondering whether I need to create an extra database in order to store all this information locally, rather than keep having to use the api to get the info? For example, should I create a film model and call:
film.title
and by doing so accessing a local database with the title stored on it, or do I call:
Tmdb::Movie.detail(550).title
and by doing so making another call to the api?
Having dealt with a large Rails application that made service calls to about a dozen other applications, caching is your best bet. The problem with the database solution is keeping it up to date. The problem with making the calls every time is that it's too slow. There is a middle ground. For this you want Ruby on Rails Low Level Caching:
Sometimes you need to cache a particular value or query result instead of caching view fragments. Rails' caching mechanism works great for storing any kind of information.
The most efficient way to implement low-level caching is using the Rails.cache.fetch method. This method does both reading and writing to the cache. When passed only a single argument, the key is fetched and value from the cache is returned. If a block is passed, the result of the block will be cached to the given key and the result is returned.
An example that is pertinent to your use case:
class TmdbService
def self.movie_details(id)
Rails.cache.fetch("tmdb.movie.details.#{id}", expires_in: 4.hours) do
Tmdb::Movie.detail id
end
end
You can then configure your Rails application to use memcached or the database for the cache, it doesn't matter. The point is you want this cached data to expire at some point to ensure you are getting up-to-date information.
This is a big decision to make. If the amount of data you get through the API is not huge you can store all of it in your database. This way you will get the data much faster and your application will work even when the API is down.
If the amount of data you get is huge and you don't have sources to store all the data, you should at least store the most important data in your database as cache.
If you do not store any data on you own you are dependent on the source of data and it can have downtime.
Problem with storing data on your side is when the data change and you need to synchronize. In that case it is still good to store data on your side as cache to get results faster and synchronize the data periodically.
Calls to a local database are way faster than calls to external APIs. I would expect a local database to return within a few milliseconds, whereas an API will probably take hundreds of milliseconds. And local calls are less likely effected by network issues or downtimes.
Therefore I would always cache the result of an API call in a local database and occasionally updated the local version with a newer version from the API.
But in the end it depends on your requirement: Do you need real-time or is a cached version okay? How often do you need that data and how often is is updated? How fast is the API and is latency an issue? Does the API have a rate limit (a maximum number of request per time)?

How can I store images in my heroku postgres DB

I've found some similar questions, and good link on the topic. Such as this one (Easiest way to store images in database on Heroku?), where the top answer is to re-think using their DB and instead go with S3.
What I need is store 120 100KB images. That's it. There is no dynamic aspect. It's not like I have 10,000 users, and each one needs to store their profile picture. None of that. I just need to store 120 100KB images. The amount of images will never never change, neither growing or shrinking. 3 years from now, the same 120 images will be all that there is.
For these reasons, I want to store these in my DB. S3 is overkill, could cost some, extra time and cost needed implementing that solution, etc. In my DB, it's a measly 12MB which is a fraction of a fraction of the DB's total size.
How can I store the images? What datatype should I use and how can I upload them into my DB?
Just store them as bytea fields in a table. Insert them using a script in whatever your preferred language is and its database adapter. e.g. using Python with psycopg2 you'd open('filename','rb') the file, then pass it as a query parameter to the execute method when doing your INSERT.
For images that small there's no point using pg_largeobject and the lo wrapper, which can be useful for really big files. The only real advantage of that is that you can use lo_import to read the files directly into the database.

What is the most efficient way to store temp data for processing?

I am writing an application in Ruby, which collect huge amount of data from API calls and stores it in a file. After that it processes it one by one. I was wondering if there is a way better than this to achieve the same?
Note: I want to process the records one by one by storing all of them locally because they may change during API calls.
I would look at storing the information in an in-memory key/value store (such as memcached or redis). If you use an in-memory key/value store, you can update information based on subsequent API calls rather than having multiple records in a file which represent the same data, just with different values.
Keep in mind, however, if your data is significantly large, you may run out of memory. That said, if you are into the gigabytes of data, the way you have implemented your solution may be the best route to take.

How is data loaded from a remote database stored locally (in memory)

What is the best way that data loaded from a remote database can be stored locally on iOS. (You don't need to provide any code, I just want want to know the best way conceptually.)
For instance, take Twitter for iOS as an example. When it loads the tweets, does it just pull the tweet data from the remote database and store them in a local database on the iPhone? Or would it be better if the data is just stored locally as an array of objects or something similar?
See, I'm figuring that to be able to use/display the data from the remote database (for instance, in a dynamic table view), the data would have to be stored as objects anyway so maybe they should just be stored as objects in an array. However, when researching this, I did see a lot of articles about local databases, so I was thinking maybe its more efficient to load the remote data as a table and store it in a local database and use data directly from the local table to display the data or something similar.
Which one would require more overhead: storing the data as an array of Tweet objects or as a local database of tweets?
What do you think would be the best way of storing remote data locally (in memory) (for an app that loads data similar to how Twitter for iOS)?
I suppose this begs this prerequisite question: when data from a remote database is downloaded, is it usually loaded as a database table (result set) and therefore stored as one?
Thanks!
While it's very easy the put the fetched data right into your array and use it from there, it is likely that you would be benefitted by using a local database for two main reasons: scalability and persistance.
If you are hoping to download and display a large amount of data, it may be unsafe to try to store it in memory all at once. It would be more scalable to download whatever data you need, store it in a local database, and then fetch only the relevant objects you need to display.
If you download the data and only store it in an array, that data will have to be re-fetched from the remote database and re-parsed on next load of your app/view controller/etc before anything can be displayed. Instead, create a local database in which to store the downloaded data, allowing it to be readily available to display on next load while new data is fetched from your remote source.
While there is some initial overhead incurred in creating your database, the scalability and persistance that provides you is more than enough to warrant that. Storing the state of a remote database in a local database is an extremely common practice.
However, if you do not mind the user having to wait for this data to be fetched on every load, or the amount of data being fetched is relatively low or the data is incredibly simple, you will save time and effort on skipping the local database.

Updating an existing Memcached record

I have an application that needs to perform multiple network queries each one of those returns 100 records.
I'd like to keep all the results (several thousand or so) together in a single Memcached record named according to the user's request.
Is there a way to append data to a Memcached record or do I need to read and write it back and forth and combine the old results with the new ones by the means of my application?
Thanks!
P.S. I'm using Rails 3.2
There's no way to append anything to a memcached key. You'd have to read it in and out of storage every time.
redis does allow this sort of operation, however, as rubish points out -- it has a native list type that allows you to push new data onto it. Check out the redis list documenation for information on how to do that.
You can write a class that'll emulate list in memcached (which is actually what i did)... appending to record isn't atomic operation, so it'll generate errors that'll accumulate over time (at least in memcached). Beside it'll be very slow.
As pointed out Redis has native lists, but it can be emulated in any noSQL / K-V storage solution.

Resources