Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.
Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!
I want to run as many instances of Neo4J (using the Enterprise version) on a single VM as possible. What are the real minimum RAM requirements to fire up an instance?
Right now TaskManager is telling me that Java.exe is taking about 70,000K (70 Meg). Does that sound right?
I'm not worried about the performance, I just want to stuff as many instances as possible on a single box so people can do some low demand search of their graph.
One thing is what is recommended and second "it depends".
Neo4j is able to run on the Raspberry Pi. But you shouldn't expect great performance. Also I'm using AWS t2.micro for testing and it's enough.
The size is bound to fluctuate as the graph is loaded into memory to perform traversals and when it is paged back to the disk (When memory is running out).
If I may offer up a suggestion, you could run only one database instance and have unconnected graphs for each of your users. This would very likely be far more efficient in terms of server resources.
For example, If you have say (:Item) nodes which make up a graph for each user,
you could have them instead label them as (:Item-User1) with a unique prefix or postfix for each user.
Thus when you want to alter the query to run for each user you could just add that unique element and search the graph.
The Idea is to have a separate sub-graph for each user which is unconnected to other user's sub-graphs. Instead of having a separate database instance for each Individual user. As long as each user's sub-graph is unconnected from other user's sub-graphs there should be no security vulnerability where a user is given access to another user's data.
This way you could potentially have infinite number of users (within reason. quite possibly in the millions of users) each with their own sub graphs, with potentially no loss in performance, instead of the handful of database instances you could spin up on a single VM, which are likely to be competing for resources and choking out.
For neo4j 3.x, the documented minimum memory requirement is 2GB.
This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.
I'm trying to implement a cluster using Erlang as the glue that holds it all together. I like the idea that it creates a fully connected graph of nodes, but upon reading different articles online, it seems as though this doesn't scale well (having a max of 50 - 100 nodes). Did the developers of OTP impose this limitation on purpose? I do know that you can setup nodes to have explicit connections only as well as have hidden nodes, etc. But, it seems as though the default out-of-the-box setup isn't very scalable.
So to the questions:
If you had 5 nodes (A, B, C, D, E) that all had explicit connections such that A-B-C-D-E. Does Erlang/OTP allow A to talk directly to E or does A have to pass messages from B through D to get to E, and thus that's the reason for the fully connected graph? Again, it makes sense but it doesn't scale well from what I've seen.
If one was to try and go for a scalable and fault-tolerant system, what are your options? It seems as though, if you can't create a fully connected graph because you have too many nodes, the next best thing would be to create a tree of some kind. But, this doesn't seem very fault-tolerant because if the root or any parent of children nodes dies, you would lose a significant portion of your cluster.
In looking into supervisors and workers, all of the examples I've seen apply this to processes on a single node. Could it be applied to a cluster of nodes to help implement fault-tolerance?
Can nodes be part of several clusters?
Thanks for your help, if there is a semi-recent website or blogpost (roughly 1-year old) that I've missed, I'd be happy to look at those. But, I've scoured the internet pretty well.
Yes, you can send messages to a process on any remote node in a cluster, for example, by using its process identifier (pid). This is called location transparency. And yes, it scales well (see Riak, CouchDB, RabbitMQ, etc).
Note that one node can run hundred thousands of processes. Erlang has proven to be very scalable and was built for fault tolerance. There are other approaches to build bigger, e.g. SOA approach of CloudI (see comments). You also could build clusters that use hidden nodes if you really really need to.
At the node level you would take a different approach, for example, build identical nodes that are easy to replace if they fail and the work is taken over by the remaining nodes. Check out how Riak handles this (look into riak_core and check the blog post Introducing Riak Core).
Nodes can leave and enter a cluster but cannot be part of multiple clusters at the same time. Connected nodes share one cluster cookie which is used to identify connected nodes. You can set the cookie while the VM is running (see Distributed Erlang).
Read http://learnyousomeerlang.com/ for greater good.
The distribution protocol is about providing robustness, not scalability. What you want to do is to group your cluster into smaller areas and then use connections, which are not distribution in Erlang but in, say, TCP sessions. You could run 5 groups of 10 machines each. This means the 10 machines have seamless Pid distribution: you can call a pid on another machine. But distributing to another group means you can't seamlessly address the group like that.
You generally want some kind of "route reflection" as in BGP.
1) I think you need a direct connection between nodes to communicate between processes. This does, however, mean that you don't need persistent connections between all the nodes if two will never communicate (say if they're only workers, not coordinators).
2) You can create a not-fully-connected graph of erlang nodes. The documentation is hard to find, and comes with problems - you disable the global system which handles global names in the cluster, so you have to do everything by locally registered names, or locally registered names on remote nodes. Or just use Pids, as they work too. To start an erlang node like this, use erl ... -connect_all false .... I hope you know what you're up to, as I couldn't trust myself to do that.
It also turns out that a not-fully-connected graph of erlang nodes is a current research topic. The RELEASE Project is currently working on exactly that, and have come up with a concept of S-groups, which are essentially fully-connected groups. However, nodes can be members of more than one S-group and nodes in separate s-groups don't have to be fully connected but can establish the connections they need on demand to do direct node-to-node communication. It's worth finding presentations of theirs because the research is really interesting.
Another thing worth pointing out is that several people have found that you can get up to 150-200 nodes in a fully-connected cluster. Do you really have a use-case for more nodes than that? Surely 150-200 incredibly beefy computers would do most things you could throw at them, unless you have a ridiculous project to do.
3) While you can't start processes on a different node using gen_server:start_link/3,4, you can certainly call servers on a foreign node very easily. It seems that they've overlooked being able to start servers on foreign nodes, but there's probably good reason for it - such as a ridiculous number of error cases.
4) Try looking at hidden nodes, and at having a not-fully-connected cluster. They should allow you to group nodes as you see fit.
TL;DR: Scaling is hard, let's go shopping.
There are some good answers already, so I'm trying to be simple.
1) No, if A and E are not connected directly, A cannot talk to E. The distribution protocol runs on direct TCP connection - no routing included.
2) I think a tree structure is good enough - trade-offs always exist.
3) There's no 'supervisor for nodes', but erlang:monitor_node is your friend.
4) Yes. A node can talk to nodes from different 'clusters'. In the local node, use erlang:set_cookie(OtherNode, OtherCookie) to access a remote node with a different cookie.
1)
yes. they talk to each other
2) 3) and 4)
Generally speaking, when building a scalable and fault tolerant system, you would want, or more over, need to divide the work load to different "regions" or "clusters". Supervisor/Worker model has this envisioned thus the topology. What you need is a few processes coordinating work between clusters and all workers within one single cluster will talk to each other to balance out within group.
As you can see, with this topology, the "limitation" is not really a limitation as long as you divide your tasks carefully and in a balanced fashion. Personally, I believe a tree like structure for supervisor processes is not avoidable in large scale systems, and this is the practice I'm following. Reasons are vary but boils down to scalability, fault tolerance as fall back policy implementation, maintenance need and portability of the clusters.
So in conclusion,
2) use a tree-like topology for your supervisors. let workers explicitly connect to each other and talk within their own domain with the supervisors.
3) while this is the native designed environment, as I presume, I'm pretty sure a supervisor can talk to a worker on a different machine. I would not suggest this as fault tolerance can be hell in remote worker scenario.
4) you should never let a node be part of two different cluster at the same moment. You can switch it from one cluster to another though.
We're trying to scale up HBase writes on a cluster using Thrift. (Our HBase application is in Python, and hence needs Thrift.)
Despite increasing the number of nodes in the cluster, we are seeing the same write speeds.
First off, is the recommended strategy to run Thrift on:
1. The client?
2. The HBase master?
3. HBase region servers?
If on #1 or #2, will the client or HBase master take care of splitting the requests to the various region servers? It doesn't appear to in our case.
If #3, then I have to modify the client to write to the specific region servers, and randomize the writes. I can do this, but it seems to defeat the purpose of using HBase.
Any other tips on read/write scaling (especially with Thrift) are greatly appreciated.
In HBase to gain performance with node increase you should have a decent "rowkey" distribution. As long as you have "hot spots" (a very busy region server) in your cluster you would not gain anything from increasing your cluster size. checkout the article on row key design to start with.
If you don't need to read right away (if you are comfortable with async writes) you can check asynch hbase client from stumbleupon for performance gain.
I found the answer at these two questions, it looks like we'll go with #3 (write to the specific region servers, and randomize the writes):
Is it better to send data to hbase via one stream or via several servers concurrently?
HBase Thrift: how to connect to remote HBase master/cluster?