What are the variables that determine the number of nodes required in Hyperledger Indy Setup? - hyperledger

As part of one of my project implementation, I have to use Hyperledger Indy for providing digital identity to users. The estimated number of users on the platform is 20k. I am stuck on determining what must be the number of nodes required to run Hyperledger Indy for efficient use on production.

The number of nodes depends on the number of faulty nodes you want to permit. To support f faults in the system, you need 3f+1 total nodes. So if you are OK with only 2 nodes failing at any one time, you need f=2 or 7 total nodes. The practical limit has found to be f=8 or 25 total nodes. That balances a robust network with the speed of consensus.
How that relates to the number of users on the platform is a different question. It depends on (a) how many issuers, and (b) how much revocation is happening.
If there is only one issuer and no revocation, there would only be 3 extra transactions beyond the genesis transactions. Not many...

Related

scale capability of volttron

I am trying volttron for a project solution and want to know the capability of volttron in a long term. The project is to control/monitor ~100k devices, and possibly millions if things run well.
What is the biggest scale of volttron usage in a real scenario? How many devices that one node can accommodate if say that the host machine have high spec?
What is the constrain of volttron later in the future after its use? (constrain as like in database / server resource / network)
The answer hoped to get is not an exact value. I just wanted to find the capability range.
Thanks,
There are several drivers for how well VOLTTRON scales for a single VOLTTRON instance.
In no particular order:
Network and device communication speed. (Are your devices on a serial connection? BACnet devices behind a MSTP router?)
Frequency of data collection. (10 seconds?, 1 minute? 5 minutes? 15 minutes?)
How close together (time wise) does data from differnt devices need to be.
Frequency of commands issued/ number of commands issued.
Machine specs
Often we see the bottleneck being the network for device communication. This will drive the rate at which you can communicate with devices. For collection a mid level PC is overkill in most situations.
In the field our users have been able to scrape 1.5K+ BACnet devices in less than 15 minutes with a single node. Many of these devices were on an MSTP trunk which would be the major limiting factor. If these were TCP BAcnet devices the rate of data acquisition would be much higher.
There are parameters to tune the rate of data collection for a specific node. It is common to tweak these values to find the optimal rate of collection after initial platform configuration.
The kind of scaling you are looking for will require using multiple VOLTTRON instances. It is common to have multiple collection boxes for an installation. Usually these instances will gather data for some number of devices (based on your scenario) and either send those values directly to a database or forward them to another central instance of the platform that will submit the data on the remote nodes behalf. Numbers for some real deployments can be found here: https://volttron.org/sites/default/files/publications/VOLTTRON%20Scalability-update-final.pdf
There are several database options from MySQL to Mongo to SQLite. You will want to pick a central database based on your data collection needs (so not SQLite).

SOLR and VNodes and Tokens

Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.
Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!

distributed storage: why the redundant copy is 3 by default instead of 2?

In distributed storage, to avoid data disasters, we need multiple copies of data.
However, why the total copy quantity is preferred as 3 by default instead of 2?
Two copies will save nearly 50% storage requirements.
What's the main reason of choosing 3 copies?
When using two copies of data, and they differ which version do you choose? The third acts as a tie breaker.
As to why they would differ, if one computer were down for a bit—or even if they can't talk to each other—their data would differ unless the system stops accepting writes. With three computers, though, if one is down or separated from the others, the other two can still accept data without fear of the scenario in the first paragraph. (Unless you have correlated failures, which you should still plan for.)
Update. Generally you'll find that distributed algorithms use a Quorum-based system for ensuring writes. In most it's a simple majority, meaning that at least ceil(n/2) of the nodes must have the value before it is durably written. After that, you are guaranteed that nothing can un-write the value because you cannot get ceil(n/2) more nodes to oust the decision. In a two-node system ceil(n/2) = 2; so if one of the nodes goes down, you cannot accept a write anymore. But in a three node system, ceil(n/2) = 2 still, so one node can go down and the system can still accept writes.
Really it's a question of durability vs cost vs latency. The more nodes you throw at your system, the more likely you'll not lose data. One node is fairly ephemmeral; two nodes slightly less ephemeral. Three nodes is pretty good, and many systems stop there. But systems that need higher durability will have 5, 7, or 9 nodes required.
I work on one of the most reliable systems on the internet and we use 5 nodes in the quorum with up to 16 more nodes as hot backups. For us the cost is little compared to the required durability; we chose to use 5 nodes in the quorum for latency sake with the backups for a little boost in durability and to take some read pressure of the quorum.
Because cost increase is not that significant compared to significant improvement in redundancy.
Adding to Michael's answer in this question, three is chosen because it provides a very simple level of fault tolerance. This is called 't fault-tolerance' in the presence of Byzantine faults, where t is 1. That is at most 1 of those data copies can go stale/corrupt/wrong without bringing down the system.
t is usually chosen before hand as an SLA for the system in question, or via empirical evidence. Given a value of t one needs 2*t+1 copies to handle fault tolerance.

What is the recommended hardware for the following neo4j setup?

I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API
I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.
You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.

How scalable is distributed Erlang?

Part A:
Erlang has a lot of success stories about running concurrent agents e.g. the millions of simultaneous Facebook chats. That's millions of agents, but of course it's not millions of CPUs across a network. I'm having trouble finding metrics on how well Erlang scales when scaling is "horizontal" across a LAN/WAN.
Let's assume that I have many (tens of thousands) physical nodes (running Erlang on Linux) that need to communicate and synchronize small infrequent amounts of data across the LAN/WAN. At what point will I have communications bottlenecks, not between agents, but between physical nodes? (Or will this just work, assuming a stable network?)
Part B:
I understand (as an Erlang newbie, meaning I could be totally wrong) that Erlang nodes attempt to all connect to and be aware of each other, resulting in an N^2 connection point-to-point network. Assuming that part A won't just work with N = 10K's, can Erlang be configured easily (using out-of-the-box config or trivial boilerplate, not writing a full implementation of grouping/routing algorithms myself) to cluster nodes into manageable groups and route system -wide messages through the cluster/group hierarchy?
We should specify that we talk about horizontal scalability of physical machines -- that's the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.
node = machine.
To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.
~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.
over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.
The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience - I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.
The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.
Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.
There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using 'hidden' connection, which doesn't join this full mesh, but also it cannot benefit from connection to all nodes.
The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

Resources