working on an erlang project using mnesia (some tables ram copies, some tables disk copies, some tables both). in an attempt to optimize a certain read (ram table), i used the ets lookup rather than the mnesia dirty_read i had been using, and timed both versions of the routine. the ets lookup was significantly faster than the mnesia dirty_read.
my question is whether there is some 'gotcha' or 'catch' to reading an mnesia table using ets vs mnesia (there must be, otherwise there is no reason for the slower mnesia read to exist). if it makes any difference, i don't need and am not using any "distrubuted" or "nodes." in other words, i am and will only be using a single node on a single computer.
mnesia:dirty_read does a rpc call even if the table is local. Also it checks for the current activity context and maintains it even for dirty lookups. This will result in the extra time required for the lookup.
In your case (where there is only one node with local mnesia), direct ets lookup should work but not recommended as it will be implementation dependent. The better would be to use mnesia:ets(Fun,[, Args]).
Related
I was going through the article, https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs which says, "If separate read and write databases are used, they must be kept in sync". One obvious benefit I can understand from having separate read replicas is that they can be scaled horizontally. However, I have some doubts:
It says, "Updating the database and publishing the event must occur in a single transaction". My understanding is that there is no guarantee that the updated data will be available immediately on the read-only nodes because it depends on when the event will be consumed by the read-only nodes. Did I get it correctly?
Data must be first written to read-only nodes before it can be read i.e. write operations are also performed on the read-only nodes. Why are they called read-only nodes? Is it because the write operations are performed on these nodes not directly by the data producer application; but rather by some serverless function (e.g. AWS Lambda or Azure Function) that picks up the event from the topic (e.g. Kafka topic) to which the write-only node has sent the event?
Is the data sharded across the read-only nodes or does every read-only node have the complete set of data?
All of these have "it depends"-like answers...
Yes, usually, although some implementations might choose to (try to) update read models transactionally with the update. With multiple nodes you're quickly forced to learn the CAP theorem, though, and so in many CQRS contexts, eventual consistency is just accepted as a feature, as the gains from tolerating it usually significantly outweigh the losses.
I suspect the bit you quoted anyway refers to transactionally updating the write store with publishing the event. Even this can be difficult to achieve, and is one of the problems event sourcing seeks to solve.
Yes. It's trivially obvious - in this context - that data must be written before it can be read, but your apps as consumers of the data see them as read-only.
Both are valid outcomes. Usually this part is less an application concern and is more delegated to the capabilities of your chosen read-model infrastructure (Mongo, Cosmos, Dynamo, etc).
I'm researching graph databases for a work project. Since our data is highly connected, it appears that a graph database would be a good option for us.
One of the first graph DB options I've run into is neo4j, and for the most part, I like it. However, I have one question about neo4j to which I cannot find the answer: Can I get neo4j to store the entire graph in-memory? If so, how does one configure this?
The application I'm designing needs to be lightning-fast. I can't afford to wait for the db to go to disk to retrieve the data I'm searching for. I need the entire DB to be held in-memory to reduce the query time.
Is there a way to hold the entire neo4j DB in-memory?
Thanks!
Further to Bruno Peres' answer, if you want to run a regular server instance, Neo4j will load the entire graph into memory when resources are sufficient. This does indeed improve performance.
The Manual has a chapter on configuring memory.
The page cache portion holds graph data and indexes - this is configured via the dbms.memory.pagecache.size property in neo4j.conf. If it is large enough, the whole graph will be stored in memory.
The heap space portion is for query execution, state management, etc. This is set via the dbms.memory.heap.initial_size and
dbms.memory.heap.max_size properties. Generally these two properties should be set to the same value, so that the whole heap is allocated on startup.
If the sole purpose of the server is to run Neo4j, you can allocate most of the memory to the heap and page cache, leaving enough left over for operating system tasks.
Holding Very Large Graphs In Memory
At Graph Connect in San Francisco, 2016, Neo4j's CTO, Jim Webber, in his typical entertaining fashion, gave details on servers that have a very large amount of high performance memory - capable of holding an entire large graph in memory. He seemed suitably impressed by them. I forget the name of the machines, but if you're interested, the video archive should have details.
Neo4j isn't designed to hold the entire graph in main memory. This leaves you with a couple of options. You can either play around with the config parameters (as Jasper Blues already explained in more details) OR you can configure Neo4j to use RAMDisk.
The first option probably won't give you the best performance as only the cache is held in memory.
The challenge with the second approach is that everything is in-memory which means that the system isn't durable and the writes are inefficient.
You can take a look at Memgraph (DISCLAIMER: I'm the co-founder and CTO). Memgraph is a high-performance, in-memory transactional graph database and it's openCypher and Bolt compatible. The data is first stored in main memory before being written to disk. In other words, you can choose to make a tradeoff between write speed and safety.
Hi I am learning Erlang.
I read from http://learnyousomeerlang.com/ets
Erlang has something called ETS (Erlang Term Storage) tables. ETS tables are an efficient in-memory database included with the Erlang virtual machine. [...]
My question is: The Erlang term data stored in ETS tables - Where are they stored? Are they temporarily stored in my computer's memory? If I restart my application, will they disappear?
ETS are RAM based and will disapear when the owner process terminates.
DETS are disk-based version of ETS. Being "disk-only" they are slow.
For more advanced usage, you should take a look at Mnesia, the standard Erlang DBMS.
The documentation has some basic comparisons between those three options.
What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.
I try to compare Mnesia with more traditional databases.
As I understand it tables in Mnesia can be located to (see Memory consumption in Mnesia):
ram_copies - tables are stored in ets, so no durability as in ACID.
disc_copies - tables are located to ets and dets, so the table can not be bigger than the available memory? And if the table are fragmented, the database can not be bigger than the available memory?
disc_only_copies - tables are located dets, so no caching in memory and worse performance. And the size of the table are limited to the size of dets or the table has to be fragmented.
So if I want the performance of doing reads from RAM and the durability of writes to disc, then the size of the tables are very limited compared to a traditional RDBMS like MySQL or PostgreSQL.
I know that Mnesia aren't meant to replace traditional RDBMS:s, but can it be used as a big RDBMS or do I have to look for another database?
The server I will use is a VPS with limited amount of memory, around 512MB, but I want good database performance.
Are disc_copies and the other types of tables in Mnesia so limited as I have understood? CanĀ“t the database be partially on memory and a full copy on disc?
The storage capacity of the Mnesia database for the different types of tables has been discussed in this previous SO question:
What is the storage capacity of a Mnesia database?
where a great answer is already available.
Obviously (but I guess you've already seen it) the official doc is available at:
http://www.erlang.org/doc/man/mnesia.html
Also, reading from the Mnesia FAQ:
11.5 How much data can be stored in Mnesia?
Dets uses 32 bit integers for file
offsets, so the largest possible
mnesia table (for now) is 4Gb.
In practice your machine will slow to
a crawl way before you reach this
limit.
Finally, Mnesia tables can be fragmented. This is discussed here and there.
These are my 2p.