What is the most efficient way to store temp data for processing? - ruby-on-rails

I am writing an application in Ruby, which collect huge amount of data from API calls and stores it in a file. After that it processes it one by one. I was wondering if there is a way better than this to achieve the same?
Note: I want to process the records one by one by storing all of them locally because they may change during API calls.

I would look at storing the information in an in-memory key/value store (such as memcached or redis). If you use an in-memory key/value store, you can update information based on subsequent API calls rather than having multiple records in a file which represent the same data, just with different values.
Keep in mind, however, if your data is significantly large, you may run out of memory. That said, if you are into the gigabytes of data, the way you have implemented your solution may be the best route to take.

Related

Xodus disable transactions

I'm going to use Xodus for storing time-series data (100-500 million rows are inserted daily.)
I saw that Xodus was creating and deleting a lot of .xd files in the background. I read about log-structured design, but I don't clearly understand whether file is created on each transaction commit. Is each file represents snapshot of whole database? Is there any way to disable transactions (i don't need it) ?
Can I get any performance benefits by sharding my data between different stores ? I can store every metric in separate store instead of using one store with multikey. For now I'm creating separate store for each day
The .xd files don't actually represent certain transactions. The files are ordered, so they can be thought as an infinite log of records. Each transaction writes the changes and some meta information for making it possible to retrieve/search for saved data. Any .xd file has its maximum size, and when it is reached the new file is created.
It is not possible to disable transactions.
Basically, sharding your data between different stores gives better performance, at least the smaller the stores are, the faster and smoother GC works in background. The way you shard your data defines the way you can retrieve it. If data in different shards is completely decoupled than it is even better to store shards in different environments, not stores of a single environment. This will also physically isolate data in different shards, not only logically.

Singletons vs Core Data

This question is not about the technical problem, but rather the approach.
I know two more or less common approaches to store the data received from the server in your app:
1) Using managers, data holders etc to store the data. They are most often some kind of singleton and are used to store the models received from the server. (E.g. - the array of the posts/places/users) Singletons are needed to be able to access the data from any screen. I think the majority of apps uses this approach.
2) Using Core Data (or maybe Realm) as in-memory storage. This approach avoids having singletons, but, I guess, it is a bit more complex (and crash risky) to maintain and support.
How do you store data and why?
P.S. Any answers would help. But big "thank you" for detailed ones, with reasons.
The reason people opt to use Core Data/Relam/Shark or any other iOS ORM is mainly for the purpose of persisting data between runs of the app.
Currently there are two ways of doing this, for single values and very small (not that I encourage it) objects you can use the UserDefaults to persist between app launches. For a approach closer to a database, infact in the case of Core Data and SharkORM, they are built on top of SQLite, you need to use an ORM.
Using a manager to store an array of a data models will only persist said models for the lifetime of the app. For example when the user force quits the app, restarts their device or in some circumstances when iOS terminates your app, all that data will be lost permanently. This is because it is stored in RAM which is volatile memory, rather than in a database on the disk itself.
Using a database layer even if you don't specifically require persistence between launches can have its advantages though; for instance SharkORM allows you to execute raw SQL queries on your objects if you don't want to use the built in powerful query builder. This can be useful to quickly pull the model you are interested in rather than iterating through a local array.
In reply to your question, how do I store data?
Well, I use a combination of all three. Say for instance I called to an API for some data which I wanted to display there and then to the user, I would use a manager instance with an array to hold the data model.
But on the flipside if I wanted to store that data for later or if I needed to execute a complex query on it, I would store it on disk using Shark.
If however I just wanted to store whether or not the user had seen my on boarding flow I would just persist a boolean value into UserDefaults.
I hope this is detailed enough for you.
CoreData isn't strictly "in-memory". You can load objects into your data model and save them into their context, then they might actually be on disk and out of main memory, and they can easily be brought back via fetch requests.
Singletons, on the other hand, do typically stay in main memory all the time until the user terminates the app. If you have larger objects that you are storing in some data structure (e.g. full resolution images when all you really needed was a thumbnail), this can be quite a resource hog.

When fetching data using an api, is it best to store that data on another database, or is it best to keep fetching that data whenever you need it? [duplicate]

This question already has an answer here:
Caching calls to an external API in a rails app
(1 answer)
Closed 6 years ago.
I'm using the TMDB api in order to fetch information such as film titles and release years, but am wondering whether I need to create an extra database in order to store all this information locally, rather than keep having to use the api to get the info? For example, should I create a film model and call:
film.title
and by doing so accessing a local database with the title stored on it, or do I call:
Tmdb::Movie.detail(550).title
and by doing so making another call to the api?
Having dealt with a large Rails application that made service calls to about a dozen other applications, caching is your best bet. The problem with the database solution is keeping it up to date. The problem with making the calls every time is that it's too slow. There is a middle ground. For this you want Ruby on Rails Low Level Caching:
Sometimes you need to cache a particular value or query result instead of caching view fragments. Rails' caching mechanism works great for storing any kind of information.
The most efficient way to implement low-level caching is using the Rails.cache.fetch method. This method does both reading and writing to the cache. When passed only a single argument, the key is fetched and value from the cache is returned. If a block is passed, the result of the block will be cached to the given key and the result is returned.
An example that is pertinent to your use case:
class TmdbService
def self.movie_details(id)
Rails.cache.fetch("tmdb.movie.details.#{id}", expires_in: 4.hours) do
Tmdb::Movie.detail id
end
end
You can then configure your Rails application to use memcached or the database for the cache, it doesn't matter. The point is you want this cached data to expire at some point to ensure you are getting up-to-date information.
This is a big decision to make. If the amount of data you get through the API is not huge you can store all of it in your database. This way you will get the data much faster and your application will work even when the API is down.
If the amount of data you get is huge and you don't have sources to store all the data, you should at least store the most important data in your database as cache.
If you do not store any data on you own you are dependent on the source of data and it can have downtime.
Problem with storing data on your side is when the data change and you need to synchronize. In that case it is still good to store data on your side as cache to get results faster and synchronize the data periodically.
Calls to a local database are way faster than calls to external APIs. I would expect a local database to return within a few milliseconds, whereas an API will probably take hundreds of milliseconds. And local calls are less likely effected by network issues or downtimes.
Therefore I would always cache the result of an API call in a local database and occasionally updated the local version with a newer version from the API.
But in the end it depends on your requirement: Do you need real-time or is a cached version okay? How often do you need that data and how often is is updated? How fast is the API and is latency an issue? Does the API have a rate limit (a maximum number of request per time)?

Core Data vs NSFileManager

The problem:
I have been using for some time now my own cache system by using NSFileManager. Normally the data I receive is JSON and I just save the dictionary directly into cache (in the Documents Folder). When I need it back I will just go get it. I also implement, sometimes when I feel it's better, a NSDictionary on the root Folder with keys/values for the path for a given resource. For example:
Resource about weather in Geneve 17/02/2013, so I would have a key called GE_17_02_2013 and the value the path to the NSDictionary with the information.
Normally I don't need to do any complex queries. But, somehow, and from what I have been reading, when you have a lot of data, you should stick with Core Data. In my case, I normally have a lot of data, but I never actually felt the application going down, or suffering in terms of performance. So my questions are:
In this case, where sometimes (the weather thing was just an
example) I need to just remove all the data (a Twitter feed, for
example) and replace it by a completely new stream of data, is Core
Data worth? I think removing all the data, and inserting (populating) it, is heavier than just store the NSDictionary and replacing the old one.
Sometimes it would envolve images, textFiles, etc and the
NSFileManager does it perfectly, so what advantages could Core
Data bring in this cases?
P.S: I just saw this post, where this kind of question is made and numbers prove which one is actually faster. Still, I would like as well an empiric answer.
Core Data is worth using in every scenario you described. In fact, if an app stores more than preferences, you should probably use Core Data. Here are some reasons, among which, you'll find answers to your own problems:
is definitely faster than the filesystem, even if you wipe out everything and write it again as you describe (so you don't benefit to much from caching). This is basically because you can coalesce your operations and only access the store when needed. So if you read, write and read, you can save only once, the rest is done in memory, which is, needless to say, very fast.
everything is versioned and you can migrate from one version to another easily (while keeping the content the user has on the device)
80% of your model operations come free. Like, when something changes, you can override the willSave managed object method and notify your controllers.
using cascade makes it trivial to delete even very complex object structures
while is a bad idea to keep images in the database, you can still keep them on the filesystem and have core data delete them automatically when the managed object that represents them is deleted
is flexible, in fact is so flexible that you could migrate your project from using the local filesystem to using a server with very little modifications by writing a custom data store.
the core data designer basically creates the model objects for you. You don't need to create your own model classes (which you would have to if using the filesystem)
In this case ... is Core Data worth it?
Yes, to the extent that you need something more centrally managed than trying to draw up your own file-system schema. Core Data, and its lower-level cousin SQL, are still the best choice for persistence that we have right now. Besides, the performance hit of using NSKeyed(Un)Archiver to keep serializing/deserializing a dictionary over and over again becomes increasingly noticeable with larger datasets.
I think removing all the data, and inserting (populating) it, is heavier
than just store the NSDictionary and replacing the old one.
Absolutely, yes. But, you don't have to think about cache turnover like that. If you imagine Core Data as a static model, you can use a cache layer to ferry data in and out of the store. Need that resource about the weather? Check the cache layer. If it's not in there, make the cache run a fetch request. Need to turn over the whole cache? Have the cache empty itself then run a request to mark every entity with some kind of flag to show they are invalid. The expensive deletion you're worrying about can be done by a background process when you see that all your new data has been safely interned in the cache.
Sometimes it would envolve images, textFiles, etc and the
NSFileManager does it perfectly, so what advantages could Core Data
bring in this cases?
Unfortunately, not many. For blobs of data (which is essentially what Core Data does in these situations), storage and fetches to and from Core Data can quickly get costly. They can also take up a noticeably larger space on disk if they aren't compressed (which further decreases performance). If you need a faster alternative, use a store more suited to the task like Tokyo Cabinet or LevelDB, and use the entities in the Core Data store as a kind of stand-in that would, say, contain the key to the resource in one of those relational databases.

Updating an existing Memcached record

I have an application that needs to perform multiple network queries each one of those returns 100 records.
I'd like to keep all the results (several thousand or so) together in a single Memcached record named according to the user's request.
Is there a way to append data to a Memcached record or do I need to read and write it back and forth and combine the old results with the new ones by the means of my application?
Thanks!
P.S. I'm using Rails 3.2
There's no way to append anything to a memcached key. You'd have to read it in and out of storage every time.
redis does allow this sort of operation, however, as rubish points out -- it has a native list type that allows you to push new data onto it. Check out the redis list documenation for information on how to do that.
You can write a class that'll emulate list in memcached (which is actually what i did)... appending to record isn't atomic operation, so it'll generate errors that'll accumulate over time (at least in memcached). Beside it'll be very slow.
As pointed out Redis has native lists, but it can be emulated in any noSQL / K-V storage solution.

Resources