how can checksum calculated be different on different hosts which have same code running? - checksum

Is there a way that the checksum calculated using MessageDigest (Java) could be different on 2 hosts? What could be the factors that could lead to this?
I checked the hardware and its same on both hosts.

Related

How to improve random number generation in kubernetes cluster containers?

I'm seeing some issues with random number generation inside containers running
in a kubernetes cluster (repeated values). It might be the lack of entropy
inside the container, or it could be something else, on a higher level, but
I'd like to investigate the entropy angle and I have a few questions I'm
having trouble finding the answers to.
The value of /proc/sys/kernel/random/entropy_avail is between 950 and 1050
across containers and nodes - is that good enough? rngtest -c 10000
</dev/urandom returns pretty good results - FIPS 140-2 successes: 9987,
FIPS 140-2 failures: 13, but run against /dev/random it just hangs forever.
The entropy_avail values in containers seem to follow the values on the
nodes. If I execute cat /dev/random >/dev/null on the node, entropy_avail
drops also inside the containers running on that node, even though docker
inspect doesn't indicate that the /dev/*random devices are bind-mounted
from the node. So how do they relate? Can one container consume the entropy
available to other containers on that node?
If entropy_avail around 1000 is something to be concerned about, what's the
best way of increasing that value? It seems deploying a haveged daemonset
would be one way (https://github.com/kubernetes/kubernetes/issues/60751). Is
that the best/simplest way to go about it?
I'm having trouble finding the answers on google, stackoverflow, and in
kubernetes github issues. I also got no response in the kubernetes-users slack
channel, so I'm hoping someone here can shed some light on this.
Proper pseudo-random number generation underpins all cryptographic operations,
so any kubernetes user should be interested in the answers.
Thanks in advance.

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

What are hardware requirements to run Hyperledger Fabric peer?

What are minimum hardware requirements to run a Hyperledger Fabric v1 peer?
It can run on a RaspberryPi, so technically it does not need much if you aren't planning on doing much with it. However, to achieve the performance results you might expect, you'll need to look to achieving the right balance of network, processor, disk and CPU speeds. Additionally, as the peer is essentially managing a database, you'll need to take into consideration the data storage needs over time.
You'll also need to consider such factors as number of chaincode smart contracts, the number of expected channels and the size of the network. IOW, the hardware requirements will really depend on many other factors than simply what the peer (or orderer) process requires to minimally function.
If you are merely interested in running a development/test cluster of 4 peer nodes, an orderer and CA, keep in mind that this can all be easily handled on a Macbook Pro with 16G memory, and with slightly less ease at 8G memory. You can use that as a yardstick for cloud instances to run a development/test cluster.
Finally, there's a LOT of crypto processing, so you will want to consider hardware crypto acceleration to yield the optimal performance.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Machine type for google cloud dataflow jobs

I noticed there is an option that allows specifying a machine type.
What is the criteria I should use to decide whether to override the default machine type?
In some experiments I saw that throughput is better with smaller instances, but on the other hand jobs tend to experience more "system" failures when many small instances are used instead of a smaller number of default instances.
Thanks,
G
Dataflow will eventually optimize the machine type for you. In the meantime here are some scenarios I can think of where you might want to change the machine type.
If your ParDO operation needs a lot of memory you might want to change the machine type to one of the high memory machines that Google Compute Engine provides.
Optimizing for cost and speed. If your CPU utilization is less than 100% you could probably reduce the cost of your job by picking a machine with fewer CPUs. Alternatively, if you increase the number of machines and reduce the number of CPUs per machine (so total CPUs stays approximately constant) you can make your job run faster but cost approximately the same.
Can you please elaborate more on what type of system failures you are seeing? A large class of failures (e.g. VM interruptions) are probabilistic so you would expect to see a larger absolute number of failures as the number of machines increases. However, failures like VM interruptions should be fairly rare so I'd be surprised if you noticed an increase unless you were using order of magnitude more VMs.
On the other hand, its possible you are seeing more failures because of resource contention due to the increased parallelism of using more machines. If that's the case we'd really like to know about it to see if this is something we can address.

Resources