I'm going to use Xodus for storing time-series data (100-500 million rows are inserted daily.)
I saw that Xodus was creating and deleting a lot of .xd files in the background. I read about log-structured design, but I don't clearly understand whether file is created on each transaction commit. Is each file represents snapshot of whole database? Is there any way to disable transactions (i don't need it) ?
Can I get any performance benefits by sharding my data between different stores ? I can store every metric in separate store instead of using one store with multikey. For now I'm creating separate store for each day
The .xd files don't actually represent certain transactions. The files are ordered, so they can be thought as an infinite log of records. Each transaction writes the changes and some meta information for making it possible to retrieve/search for saved data. Any .xd file has its maximum size, and when it is reached the new file is created.
It is not possible to disable transactions.
Basically, sharding your data between different stores gives better performance, at least the smaller the stores are, the faster and smoother GC works in background. The way you shard your data defines the way you can retrieve it. If data in different shards is completely decoupled than it is even better to store shards in different environments, not stores of a single environment. This will also physically isolate data in different shards, not only logically.
Related
We are planning to build an Operational data store for the front-end users data extraction requirements.
As far as I know the Kimball's approach to build ODS\DW, it should hold the data for complete time period and not like the rolling time period.
The reason being, there could be a need to extract older data from ODS\DW.
So I need your thoughts on this. How should I approach ?
I would create a snapshot table that could hold the values for the rolling period for each day, and filter on the client side of things which snapshot to display.
Once the period is over then the final values can be stored on the permanent data mart.
Kimball's approach for a data warehouse would be to load transactional data to any data warehouse if you can, because it is more flexible in terms of being rolled up. Certainly at the ODS stage you wouldn't want to 'pre-aggregate' your data, if there could be a need to get hold of older data.
If you store both the transactional data and then pre-aggregated versions of the data (in aggregate fact tables, with indexes/views or with a cube, or just filtering on the report side as the other answer suggests), you can get the best of both worlds.
(Note: Kimball's approach in fact does not require an ODS: they're fine if you want to build one, but their focus is on the dimensionally modelled data warehouse.)
By using both the synchronous=OFF and journal_mode=MEMORY options, I am able to reduce the speed of updates from 15 ms to around 2 ms which is a major performance improvement. These updates happen one at a time, so many other optimizations (like using transactions about a bunch of them) are not applicable.
According to the SQLite documentation, the DB can go 'corrupt' in the worst case if there is a power outage of some type. However, is not the worst thing that can happen is for the data to be lost, or possibly part of a transaction to be lost (which I guess is a form of corruption). Is it really possible for arbitrary corruption to occur with either of these options? If so, why?
I am not using any transactions, so partially written data from transactions is not a concern, and I can handle loosing data once in a blue moon. But if 'corruption' means that all the data in the DB can be randomly changed in an unpredictable way, that would be a strong reason to not use these options.
Does any one know what the real worst-case behavior would be on iOS?
Tables are organized as B-trees with the rowid as the key.
If some writes get lost while SQLite is updating the tree structure, the entire table might become unreadable.
(The same can happen with indexes, but those could be simply dropped and recreated.)
Data is organized in pages (typically 1 KB or 4 KB). If some page update gets lost while some tree is being reorganized, all the data in these pages (i.e., some random rows from the table with nearby rowid values) might become corruped.
If SQLite needs to allocate a new page, and that page contains plausible data (e.g., deleted data from the same table), and the writing of that page gets lost, then you have incorrect data in the table, without the ability to detect it.
How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.
I am writing an application in Ruby, which collect huge amount of data from API calls and stores it in a file. After that it processes it one by one. I was wondering if there is a way better than this to achieve the same?
Note: I want to process the records one by one by storing all of them locally because they may change during API calls.
I would look at storing the information in an in-memory key/value store (such as memcached or redis). If you use an in-memory key/value store, you can update information based on subsequent API calls rather than having multiple records in a file which represent the same data, just with different values.
Keep in mind, however, if your data is significantly large, you may run out of memory. That said, if you are into the gigabytes of data, the way you have implemented your solution may be the best route to take.
First time I make a network hit to (sql)sever to get the table data having 14 fields with image blob mostly. It has above 2 hundred thousand of records in table.
Can we store the 2 hundred thousand records in local database of device using core data.
Best way to place images in local file / DB. (or) we can using remote images loading.
Should work offline
Please suggest the best way of possible to fill this above requirement.
Storing 200k records in core data is not a problem in itself as long as you do the initial importing of those records correctly. Make sure you implement you update-or-insert properly otherwise your users will have to wait proportional to N^2. Apple suggests a nice implementation for this: https://developer.apple.com/library/mac/documentation/cocoa/conceptual/coredata/articles/cdimporting.html
Then once you have the local data, you probably need to fine tune the batch size of your fetch requests, but that's a good idea to do, even if you don't have 200k records.
As for the images, never ever store them in Core Data as binary blobs. Always store them as normal files on disk and store their path in Core Data to access them later on.