Sharing the metrics in cache of two nodes within a graphite cluster - monitoring

I have a graphite cluster with 2 nodes under and ELB. Both of them share a same NFS to store the metrics.I didn't have a problem in accessing the metrics that are already written to the NFS.The problem arises in the case where node 1 have some metrics in its cache and have not written yet to the NFS and node 2 tries to access that metric.So one solution that I have in mind is to include the IP of both servers in local_setting.py
#########################
# Cluster Configuration #
#########################
#CLUSTER_SERVERS = ["10.x.x.1:80", "10.x.x.2:80"]
Is there any other way or a better solution to access the cache in node 1 from node 2 under the same ELB ?

Graphite is using files on the disk for resolving globs (e.g. '*') in metric names. If the metric is not yet written to disk - it will not be visible in Graphite.
Adding CLUSTER_SERVERS will not help because they should be another graphite-web instances and not caches. You can add both caches to CARBONLINK_HOSTS, i.e.
CARBONLINK_HOSTS = [‘10.x.x.1:7002’,‘10.x.x.2:7002’]
but I doubt that helps because of what I said above.

Related

Duplicate entries in prometheus

I'm using the prometheus plugin for Jenkins in order to pass data to the prometheus server and subsequently have it displayed in grafana.
With the default setup I can see the metrics at http://:8080/prometheus
But in the list I also find some duplicate entries for the same job
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api/com.xxxxxx.yyy:yyy-web",repo="NA",} 217191.0
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api",repo="NA",} 526098.0
Both entries refer to the same jenkins job spring_api. But the metrics have different value. Why do I see two entries for the same metric?
Possibly one is a a subset of the other.
In the kubernetes world you will have the resource consumption for each container in a pod ,and the pod's overall resource usage.
Suppose I query the metric "container_cpu_usage_seconds_total" for {pod="X"}.
Pod X has 2 containers so I'll get back four metrics.
{pod="X",container="container1"}
{pod="X",container="container2"}
{pod="X",container="POD"} <- some weird "pause" image with very low usage
{pod="X"} <- sum of container1 and container2
There might also be a discrepancy where the metrics with no container is greater than the sum of the container consumption. That might be some "not accounted for" overhead, like maybe pod dns lookups or something. I'm not sure.
I guess my point is that prometheus will often use combinations of labels and omissions of labels to show how a metric is broken down.

Can Telegraf combine/add value of metrics that are per-node, say for a cluster?

Let's say I have some software running on a VM that is emitting two metrics that are fed through Telegraf to be written into InfluxDB. Let's say the metric are no. successfully handled HTTP requests (S), and no. of failed HTTP requests (F), on that VM. However, I might configure three such VMs each emitting those 2 metrics.
Now, if I would like to have a computed metric which is the sum of S from each VM, and sum of F from each VM, and store as new metrics, at various instants of time. Is this something that can be achieved using Telegraf ? Or is there a better, more efficient, more elegant way ?
Kindly note that my knowledge of Telegraf and InfluxDB are theoretical, as I've recently started reading up about them, so I have not actually tried any of the above, yet.
This isn't something telegraf would be responsible for.
With Influx 1.x, you'd use a TICKScript or Continuous Queries to calculate the sum and inject the new sampled value.
Roughly, this would look like:
CREATE CONTINUOUS QUERY "sum_sample_daily" ON "database"
BEGIN
SELECT sum("*") INTO "daily_measurement" FROM "measurement" GROUP BY time(1d)
END
CQ docs

Private Ethereum maxpeer

Steps
I have created a private node and use --maxpeer of 1 (network id =1223123341)
Add user's X node by admin.addPeer(enode of user X) successfully. (same network id and genesis)
Base on my understanding that maxpeer will limit the node that can conect from the network to 1 node only(user's X node)
Question - if user's X node update his --maxpeer to 5 and give the network id and genesis file to other nodes, does it means there can now 5 who can conect to this network? Who control the maxpeer in a private network (e.g. network id =1223123341)
If you want to avoid 51% attacks, you should consider running permissioned chains. You can either do this by keeping your genesis block of the Proof-of-Work or -Stake network private, but you would have to share it with any participant in the network and you will not know if this can be leaked at some point. And if it does, there is no way to stop other users from participating.
Another option is to use Proof-of-Authority networks. Both Geth and Parity support that. This allows only strictly defined nodes to seal blocks and everyone else can just use the network, but not change the set of rules defined by the authorities.
Note: I work for Parity.
The --maxpeers option controls the number of peers for that particular instance. So, yes, if node 1 has --maxpeers=1 and node 2 has --maxpeers=5, you will not be limited to just 2 nodes in the network. Nodes don't all need to know about every other node either, so node 2 may be peers with nodes 3-7 and not know anything about node 1 (in other words, with the example you provided, the total number of nodes could be even more than 5).
AFAIK, there is no configuration to limit the total number of nodes in a network, and I don't see what you would want one. You are given enough control at the node level.

ML-Engine with GPUs workers errors

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers.
This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!
In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.
GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).
The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.
For more information about building high performance models, see this guide.
In the meantime, I recommend using a single machine with more GPUs to see if that helps:
{
"scaleTier": "CUSTOM",
"masterType": "complex_model_l_gpu",
...
}
Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

Why does Riak store all my documents on only one node? n_val is equal to 3

I have built a 5-node cluster using Riak 2.0pre11 on EC2 servers. Installed Riak, got it working, then repeated the same actions on 4 more servers using a bash script. At that point I used riak-admin cluster join riak#node1.example.com on nodes 2 thru 5 to form a cluster.
Using the Python Riak client I wrote a script to send 10,000 documents to Riak. Works fine and I can wrote another script to retrieve a doc which worked fine. Other than specifying the use of protobufs I haven't specified any other options when storing keys. I stored all the docs via a connection to node1.
However Riak seems to be storing all 3 replicas on the same node, in other words the storage used on node1 is about 3x the original HTML docs.
The script connected to node 1 and that is where all docs are stored. I changed the script to connect to node 2 and send 10,000 more which also all ended up in node 1. I used the command du -h /data/riak/bitcask to verify the aggregate stored size of the objects. On nodes 2 thru 4 there is only a few K which is the overhead of an empty Bitcask datastore.
For each document I specified the key similar to this
http://www.example.com/blogstore/007529.html4787somehash4787947:2014-03-12T19:14:32.887951Z
The first part of all keys are identical (testing), only the .html name and the ISO 8601 timestamp are different. Is it possible that I have somehow subverted the perfect hashing function?
Basically I used a default config. What could be wrong? Since Riak 2.0 uses a different config format, here is a fragment of the generated config for riak-core in the old format:
{riak_core,
[{enable_consensus,false},
{platform_log_dir,"/var/log/riak"},
{platform_lib_dir,"/usr/lib/riak/lib"},
{platform_etc_dir,"/etc/riak"},
{platform_data_dir,"/var/lib/riak"},
{platform_bin_dir,"/usr/sbin"},
{dtrace_support,false},
{handoff_port,8099},
{ring_state_dir,"/datapool/riak/ring"},
{handoff_concurrency,2},
{ring_creation_size,64},
{default_bucket_props,
[{n_val,3},
{last_write_wins,false},
{allow_mult,true},
{basic_quorum,false},
{notfound_ok,true},
{rw,quorum},
{dw,quorum},
{pw,0},
{w,quorum},
{r,quorum},
{pr,0}]}]}
If the bitcask directory only grows on a single node, it sounds like the nodes might not be communicating. Please run riak-admin member-status to verify that all nodes in the cluster are active.
Once you have issued the riak-admin cluster join <node> commands on all the nodes joining the cluster, you will also need to run riak-admin cluster plan to verify that the plan is correct before committing it using riak-admin cluster commit. These commands are described in greater detail here..

Resources