Controlled data sharding in MongoDB - ruby-on-rails

I am new to MongoDB and I have very basic knowledge of its concepts of sharding. However I was wondering if it is possible to control the split of data yourself? For example a part of the records would be stored on one specific shard?
This will be used together with a rails app.

You can turn off the balancer to stop auto balancing:
sh.setBalancerState(false)
If you know the range of the key you are splitting on you could also presplit your data ranges to the desired servers see PreSplitting example. The management of the shard would be done via the javascript shell and not via your rails application.
You should take care that no shard gets more load (becomes hot) and that is why there is auto balancing by default, using monitoring like the free MMS service will help you monitor that.

The decision to shard is a complex decision and one that you should put a lot of thought into.
There's a lot to learn about sharding, and much of it is non-obvious. I'd suggest reviewing the information at the following links:
Sharding Introduction
Sharding Overview
FAQ
In the context of a shard cluster, a chunk is a contiguous range of shard key values assigned to a particular shard. By default, chunks are 64 megabytes (unless modified as per above). When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks. MongoDB chunks are logical and the data within them is NOT physically located together.
As I've mentioned the balancer moves the chunks around, however, you can do this manually. The balancer will take the decision to re-balance and request a chunk migration if there is a large enough difference ( minumum of 8) between the number of chunks on each shard. The actual moving of the chunks is co-ordinated between the "From" and "To" shard and when this is finished, the original chunks are removed from the "From" shard and the config servers are informed.
Quite a lot of people also pre-split, which helps with their migration. See here for more information.
In order to see documents split among the two shards, you'll need to insert enough documents in order to fill up several chunks on the first shard. If you haven't changed the default chunk size, you'd need to insert a minimum of 512MB of data in order to see data migrated to a second chunk. It's often a good idea to to test this and you can do this by setting your chunk size to 1MB and inserting 10MB of data. Here is an example of how to test this.

Probably http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding addresses your requirement in v2.2
Check out Kristina Chodorow's blog post too for a nice example : http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/

Why do you want to split data yourself if mongo DB is automatically doing it for you , You can upgrade your rails application layer to talk to mongos instance so that mongos routes the call for any CRUD operation to the place where the data resides . This is achieved using config server .

Related

How does Flink scale for hot partitions?

If I have a use case where I need to join two streams or aggregate some kind of metrics from a single stream, and I use keyed streams to partition the events, how does Flink handle the operations for hot partitions where the data might not fit into memory and needs to be split across partitions?
Flink doesn't do anything automatic regarding hot partitions.
If you have a consistently hot partition, you can manually split it and pre-aggregate the splits.
If your concern is about avoiding out-of-memory errors due to unexpected load spikes for one partition, you can use a state backend that spills to disk.
If you want more dynamic data routing / partitioning, look at the Stateful Functions API or the Dynamic Data Routing section of this blog post.
If you want auto-scaling, see Autoscaling Apache Flink with Ververica Platform Autopilot.

Queries performances on ADLS gen 2

I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.
So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).
On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)
No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?
So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.
As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t
Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.
Best practice is to use Parquet format, not CSV. It is a columnar format optimized for OLAP-like queries.
With Synapse Preview, you can then use SQL on-demand engine (serverless technology) when you do not need to provision DW cluster and you will be charged per Tb of scanned data.
Or you can spin up Synapse cluster and ingest your data using COPY command into DW (it is in preview as well).

Hold entire Neo4j graph database in RAM?

I'm researching graph databases for a work project. Since our data is highly connected, it appears that a graph database would be a good option for us.
One of the first graph DB options I've run into is neo4j, and for the most part, I like it. However, I have one question about neo4j to which I cannot find the answer: Can I get neo4j to store the entire graph in-memory? If so, how does one configure this?
The application I'm designing needs to be lightning-fast. I can't afford to wait for the db to go to disk to retrieve the data I'm searching for. I need the entire DB to be held in-memory to reduce the query time.
Is there a way to hold the entire neo4j DB in-memory?
Thanks!
Further to Bruno Peres' answer, if you want to run a regular server instance, Neo4j will load the entire graph into memory when resources are sufficient. This does indeed improve performance.
The Manual has a chapter on configuring memory.
The page cache portion holds graph data and indexes - this is configured via the dbms.memory.pagecache.size property in neo4j.conf. If it is large enough, the whole graph will be stored in memory.
The heap space portion is for query execution, state management, etc. This is set via the dbms.memory.heap.initial_size and
dbms.memory.heap.max_size properties. Generally these two properties should be set to the same value, so that the whole heap is allocated on startup.
If the sole purpose of the server is to run Neo4j, you can allocate most of the memory to the heap and page cache, leaving enough left over for operating system tasks.
Holding Very Large Graphs In Memory
At Graph Connect in San Francisco, 2016, Neo4j's CTO, Jim Webber, in his typical entertaining fashion, gave details on servers that have a very large amount of high performance memory - capable of holding an entire large graph in memory. He seemed suitably impressed by them. I forget the name of the machines, but if you're interested, the video archive should have details.
Neo4j isn't designed to hold the entire graph in main memory. This leaves you with a couple of options. You can either play around with the config parameters (as Jasper Blues already explained in more details) OR you can configure Neo4j to use RAMDisk.
The first option probably won't give you the best performance as only the cache is held in memory.
The challenge with the second approach is that everything is in-memory which means that the system isn't durable and the writes are inefficient.
You can take a look at Memgraph (DISCLAIMER: I'm the co-founder and CTO). Memgraph is a high-performance, in-memory transactional graph database and it's openCypher and Bolt compatible. The data is first stored in main memory before being written to disk. In other words, you can choose to make a tradeoff between write speed and safety.

Fetch data subset from gmond

This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

Resources