What is the storage capacity of a Mnesia database? - erlang

Some places state 2GB period. Some places state it depends up the number of nodes.

Quite large if your question is "what's the storage capacity of an mnesia database made up of a huge number of disc_only_copies tables" - you're largely limited by available disk space.
An easier question to answer is what's the maximum capacity of a single mnesia table of different types. ram_copies tables are limited by available memory. disc_copies tables are limited by their dets backend (Hakan Mattsson on Mnesia) - this limit is 4Gb of data at the moment.
So the simple answer is that simple disc_copies table can store up to 4Gb of data before they run into problems. (Mnesia doesn't actually crash if you exceed the on-disk size limit - the ram_copies portion of the table continues running, so you can repair this by deleting data or making other arrangements at runtime)
However if you consider other mnesia features, then the answer is more complicated.
local_content tables. If the
table is a local_content table,
then it can have different contents
on each node in the mnesia cluster,
so the capacity of the table is
4Gb * <number of nodes>
fragmented tables. Mnesia supports user configurable table partitioning or sharding using table fragments. In this case you can effectively distribute and redistribute the data in your table over a number of primitive tables. These primitive tables can each have their own configuration - say one ram_copies table and the rest disc_only_copies tables. These primitive tables have the same size limits as mentioned earlier and now the effective capacity of the fragmented table is 4Gb * <number of fragments>. (Sadly if you fragment your table, you then have to modify your table access code to use mnesia:activity/4 instead of mnesia:write and friends, but if you plan this in advance it's managable)
external copies If you like living on the extreme bleeding edge, you could apply the mnesiaex patches to mnesia and store your table data in an external system such as Amazon S3 or Tokyo Cabinet. In this case the capacity of the table is limited by the backend storage.

TL;DR: the storage capacity of a Mnesia database is limited only* by available RAM.
* Assuming you use table types ram_copies or disc_copies. Also, if you store a lot of data in a disc_copies table, it needs to be read from disk at startup, which might increase startup time beyond what's acceptable.
This answer contradicts the two existing answers when it comes to tables of type disc_copies. Let me first get a few general points out of the way:
A mnesia table of type ram_copies is only limited by available RAM (except if you're on a 32-bit machine). Data is stored in an ETS table.
A mnesia table of type disc_only_copies is stored in a Dets table. Dets tables are limited to 2 GB, because of limits in the file format.
The obvious way to circumvent that limit is to create more tables, possibly through table fragmentation.
The schema is also stored in a Dets table, so the information describing all existing tables is also limited to 2 GB. You are likely to run into other limits before you hit that one, though.
A mnesia table of type disc_copies is stored both in RAM and on disk, so it is limited by available RAM - and perhaps something else?
I'm going to try to show below that there is no specific limit imposed by Mnesia on the size of a disc_copies table. Note however that many Erlang programmers believe that disc_copies tables are limited to 2 GB. That is stated in the accepted answer to this question, which at the time of writing outscores this answer by a factor of 7.
disc_copies moved from dets to disk_log in 2001
It is commonly believed that disc_copies tables are backed by Dets tables. As far as I can tell, this was the case until Erlang/OTP R7B-4 (released on 30th September 2001). From the README:
-- mnesia -----------------------------------------------------------------
OTP-3712 - Speed/load improvements disc_copies tables are not
implemented with dets anymore.
Look at the diff for more details, in particular mnesia_lib.erl and mnesia_loader.erl.
Sources supporting dets and a 2 / 4 GB limit
archelaus's answer draws from http://erlang.org/~hakan/mnesia_consumption.txt, which explains that disc_copies tables reside in ets and dets tables. However, looking at the index for the directory, we see that this document is dated 1999:
[TXT] mnesia_consumption.txt 26-Oct-1999 10:57 10k
It makes sense that it would say this, as it was written two years before the change.
Ray Boosen's answer draws from the Erlang FAQ:
11.5 How much data can be stored in Mnesia?
Dets uses 32 bit integers for file offsets, so the largest possible mnesia table (for now) is 4Gb.
In practice your machine will slow to a crawl way before you reach this limit.
The FAQ has been saying that since at least January 2001 (see the earliest copy in the Wayback Machine). That means that this FAQ entry dates from before the switch to disk_log, and hasn't been updated for a long time. (Anyway, the Dets table size limit is 2 GB, not 4 GB.) I submitted a pull request for the FAQ.
Sources supporting higher limits
The Learn You Some Erlang chapter on Mnesia says:
ram_copies
This option makes it so all data is stored exclusively in ETS, so memory only. Memory should be limited to a theoretical 4GB (and practically around 3GB) for virtual machines compiled on 32 bits, but this limit is pushed further away on 64 bits virtual machines, assuming there is more than 4GB of memory available.
disc_only_copies
This option means that the data is stored only in DETS. Disc only, and as such the storage is limited to DETS' 2GB limit.
disc_copies
This option means that the data is stored both in ETS and on disk, so both memory and the hard disk. disc_copies tables are not limited by DETS limits, as Mnesia uses a complex system of transaction logs and checkpoints that allow to create a disk-based backup of the table in memory.
I'm not sure when this was written, but the text above exists in the earliest Wayback Machine copy, dated April 2012.
In a post on erlang-questions titled "beating mnesia to death (was RE: Using 4Gb of ram with Erlang VM)", dated 7th November 2005, Ulf Wiger writes:
On a 16 GB machine, you can:
run 6 million simultaneous processes
(through use of erlang:hibernate, I was actually
able to run 20 million - spawn time: 6.3 us,
message passing time: 5.3 us, and I had
1.8 GB to spare.)
populate mnesia with at least 12 GB of data, but
think through how you want to represent it, since
the 64-bit word size blows things up a bit.
keep a 10 GB+ disc_copy table in mnesia. The
load times and log dump cost seem acceptable
(10 minutes to load, dumping takes a while but
runs in the background quite nicely.)
Conclusions
The confusion seems to stem from missing or out-dated information from official sources:
The Mnesia documentation doesn't mention any table size limits
The Erlang FAQ says that Mnesia is subject to a 4 GB Dets size limit, but this answer was written before the dets to disk_log change
The only other document on the erlang.org domain is Håkan Mattsson's document, dating from before the dets to disk_log change
LYSE seems to be the first "authoritative" source that mentions disc_copies tables not being subject to the Dets table size limit.

As per the documentation, this is 4GB. Section 11.5
http://erlang.org/faq/mnesia.html

Related

Queries performances on ADLS gen 2

I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.
So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).
On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)
No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?
So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.
As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t
Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.
Best practice is to use Parquet format, not CSV. It is a columnar format optimized for OLAP-like queries.
With Synapse Preview, you can then use SQL on-demand engine (serverless technology) when you do not need to provision DW cluster and you will be charged per Tb of scanned data.
Or you can spin up Synapse cluster and ingest your data using COPY command into DW (it is in preview as well).

Erlang - Will more fragments in Mnesia mean more performance?

I have a table in mnesia and read that the size limit of a table is only 4gb. I read that to store more data in a single table, fragmentation has to be done in mnesia. Also when using a table without fragmentation I noted that the cpu usage is high(disc_only_copies) not sure why though.
I wanted to know if adding more fragments will improve mnesia performance, reduce the cpu usage or is it just to store more data in a single table?
You didn't specify what kind of table you use:
disk_only: uses DETS and is limited to 2GB (don't use this!)
ram_copies: only in RAM (ETS table) limited to < 4 GB on 32bit machines, much larger possible on 64bit Erlang VMs limited by available memory
disk_copies: in RAM and in a transaction log on disk, doesn't have DETS limitations but the RAM limitations remain, but if you have enough RAM and use a 64bit VM you are fine
for more details see: LYSE on mnesia table types

Neo4j Huge database query performance configuration

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.
Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.
What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

What size is considered 'big' for the tbl_version table in TFS_Main database

We are currently experiencing significant waits in TFS database, and are trying to understand if these are as a consequence of the size of the tbl_Version version history table in the database.
Currently this table contains just over 20 million records, and is taking up approximately 6GB of storage space (total index space is just over 10GB). Looking at the queries that SQL Server is having to deal with, we have high PAGEIOLATCH_SH waits whenever this table is accessed. Obviously we don't have control over the queries being thrown at the database (all part of TFS).
Currently we have TFS on a Virtual Machine, and in essence want to get to understand whether we should (a) move to a physical machine, (b) attempt to reduce size of tbl_version or (c) follow a combination of these.
In our organisation it will be non-trivial to move to a physical server, so I'd like to get a feel for whether our table sizes are 'normal' or not before making any such decision.
PageLatch_SH typically indicates a wait for a page to be loaded from disk to memory. From the sounds of it tbl_Version is not being kept around in memory. There are 2 things you can do to improve the situation:
a. Get more RAM (not sure how much RAM you have on the server).
b. Get a faster disk subsystem.
In TFS 2010 we enable page compression if you have Enterprise Edition of SQL. This should help with the problem.
Based on some 2007 stats from Microsoft: http://blogs.msdn.com/b/bharry/archive/2007/03/13/march-devdiv-dogfood-statistics.aspx probably not the biggest.
But MS (as documented on that blog) had done some DB tuning, this I believe is in TFS 2010, but for earlier versions you'll probably need to talk to MS direct.
Caveat: We're using TFS 2008.
We're currently sitting with about 9GB of data (18GB index) with 31M rows. This is after about a year and a half of usage in an IS shop with 50-60 active developers.
Part of our problem, which we still need to address, is large binaries stored in the version control system. The answer to my question here may provide some information as to whether or not there are a few major offenders that are causing the size of that table to be bigger than you want.

Where are tables in Mnesia located?

I try to compare Mnesia with more traditional databases.
As I understand it tables in Mnesia can be located to (see Memory consumption in Mnesia):
ram_copies - tables are stored in ets, so no durability as in ACID.
disc_copies - tables are located to ets and dets, so the table can not be bigger than the available memory? And if the table are fragmented, the database can not be bigger than the available memory?
disc_only_copies - tables are located dets, so no caching in memory and worse performance. And the size of the table are limited to the size of dets or the table has to be fragmented.
So if I want the performance of doing reads from RAM and the durability of writes to disc, then the size of the tables are very limited compared to a traditional RDBMS like MySQL or PostgreSQL.
I know that Mnesia aren't meant to replace traditional RDBMS:s, but can it be used as a big RDBMS or do I have to look for another database?
The server I will use is a VPS with limited amount of memory, around 512MB, but I want good database performance.
Are disc_copies and the other types of tables in Mnesia so limited as I have understood? Can´t the database be partially on memory and a full copy on disc?
The storage capacity of the Mnesia database for the different types of tables has been discussed in this previous SO question:
What is the storage capacity of a Mnesia database?
where a great answer is already available.
Obviously (but I guess you've already seen it) the official doc is available at:
http://www.erlang.org/doc/man/mnesia.html
Also, reading from the Mnesia FAQ:
11.5 How much data can be stored in Mnesia?
Dets uses 32 bit integers for file
offsets, so the largest possible
mnesia table (for now) is 4Gb.
In practice your machine will slow to
a crawl way before you reach this
limit.
Finally, Mnesia tables can be fragmented. This is discussed here and there.
These are my 2p.

Resources