I am monitoring a multi agent flume setup with gangaila. I am able to produce metrics which I can view as JSON data on the browser. Can anyone suggest me how to view these metrics in graphs?
When starting the flume agent provide following java options:
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=<host:port of ganglia monitor daemon>
Depending on the flume installation these can be set in flume-env.sh (in mine the file is in /etc/flume/conf).
You won't be able to run http and ganglia monitoring at the same time.
Related
I am trying to use JMX_Exporter for my kotlin code to expose metrics to prometheus in order to show it in Grafana. I have gone through many articles and tried to understand how could it be done. I have found below two links useful and trying to achieve using those.
https://github.com/prometheus/jmx_exporter
https://www.openlogic.com/blog/monitoring-java-applications-prometheus-and-grafana-part-1
What i did so far is, created a folder 'prometheus-jmx' in root directory and added mentioned JAR and config.yml file in that folder. Then, i added below parameter to my dockerfile.
CMD java -Xms128m -Xmx256m -javaagent:./jmx_prometheus_javaagent-0.12.0.jar=8080:./config.yml -Dconfig.file=config/routing.conf -cp jagathe-jar-with-
dependencies.jar:./* com.bmw.otd.agathe.AppKt
My prometheus is running in my OpenShift cluster along with my application. I could scrape metrics for my other applications/deployments like Jenkins, SonarQube etc without any modifications in deployment.yml of Prometheus.
My application is running properly now on OpenShift and from application's pod, I can scrape metrics by using below command.
curl http://localhost:portNumber
But on promethus UI, I cant see any JVM or JMX related metric.
Can someone please tell me where and what am i doing wrong? Any kind of help would be appreciated.
After trying many things, I came to know that I needed to expose the port of my application's container in order to let Prometheus or other deployments to know. After exposing the port, I could see my application under targets on Prometheus and I could scrape all JMX and JVM metrics. Hope this would help someone in future...
Trying to setup Flume on edge node, I was checking through many blogs but haven't got much idea as most of them are referring a single node cluster, can someone suggest it is good idea to setup on edge node or this will be on server where HDFS or any worker node setup (Data-node), if yes then what will be configuration to setup this on Edge node.
As suggested by Viren in production environment on edge node only you need to configure flume, its not you can't do on namenode server but we need to avoid that for performance issues.
If this is a production environment, its a good idea to avoid NameNode server(s), Resource Manager server(s), journal nodes and DataNodes. That leaves you with edge node.
The process would be to:
1) Install Hadoop client.
2) Install Flume
3) Configure the flume in flume.conf file (or whatever name you want to give). You can find many sample configurations online.
4) Make monitoring type = http for quick check of performance data.
5) Open the ports for Sources and Sinks.
5) Start the agent.
6) Check the agent log to see all components started.
7) Try sending some sample data and check if it reaches destination.
8) Debug any failures.
Let me know if you need more information.
I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.
First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.
I have recently performed a migration to Google Cloud Platform, and I really like it.
However I can't find a way to monitor the memory usage of the Dataproc VM intances. As you can see on the attachment, the console provides utilization info about CPU, disk and network, but not about memory.
Without knowing how much memory is being used, how is it possible to understand if there is a need of extra memory?
By installing the Stackdriver agent in GCE VMs additional metrics like memory can be monitored. Stackdriver also offers you alerting and notification features. Nevertheless agent metrics are only available for premium tier accounts.
See this answer for Dataproc VMs.
The stackdriver agent only supports monitoring of RAM of the E2 family at the moment. Other instance types such as N1, N2,... are not supported.
See the latest documentation of what is supported; https://cloud.google.com/monitoring/api/metrics_gcp#gcp-compute
Well you can use the /proc/meminfo virtual file system to get information on current memory usage. You can create a simple bash script that reads the memory usage information from /proc/meminfo. The script can be run periodically as a cron job service. The script can send an alert email if the memory usage exceeds a given threshold.
See this link: https://pakjiddat.netlify.app/posts/monitoring-cpu-and-memory-usage-on-linux
The most up-to-date answer here.
How to see memory usage in GCP?
Install the agent on your virtual machine. Takes less than 5 minutes.
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update
sudo apt-get install stackdriver-agent
the code snippet should install the most recent version of the agent, but for up-to-date guide you can always refer to
https://cloud.google.com/monitoring/agent/installation#joint-install.
After it's installed, in a minute or two, you should see the additional metrics in Monitoring section of GCP.
https://console.cloud.google.com/monitoring
Explanation and why it's invisible by default?
The metrics (such as CPU usage or memory usage) can be collected at different places. For instance, CPU usage is a piece of information that the host (machine with special software running your virtual machine) can collect.
The thing with memory usage and virtual machines, is, it's the underlying operating system that manages it (the operating system of your virtual machine). Host cannot really know how much is used, for all it can see in the memory given to that virtual machine, is a stream of bytes.
That's why there's an idea to install agents inside of that virtual machine that would collect the metrics from inside and ship it somewhere where they can be interpreted. There are many types of agents available out there, but Google promotes their own - Monitoring Agent - and it integrates into the entire GCP suite well.
The agent metrics page may be useful:
https://cloud.google.com/monitoring/api/metrics_agent
You'll need to install stackdriver. See: https://app.google.stackdriver.com/?project="your project name"
The stackdriver metrics page will provide some guidance. You will need to change the "project name" (e.g. sinuous-dog-133823) to suit your account:
https://app.google.stackdriver.com/metrics-explorer?project=sinuous-dog-133823&timeSelection={"timeRange":"6h"}&xyChart={"dataSets":[{"timeSeriesFilter":{"filter":"metric.type=\"agent.googleapis.com/memory/bytes_used\" resource.type=\"gce_instance\"","perSeriesAligner":"ALIGN_MEAN","crossSeriesReducer":"REDUCE_NONE","secondaryCrossSeriesReducer":"REDUCE_NONE","minAlignmentPeriod":"60s","groupByFields":[],"unitOverride":"By"},"targetAxis":"Y1","plotType":"LINE"}],"options":{"mode":"COLOR"},"constantLines":[],"timeshiftDuration":"0s","y1Axis":{"label":"y1Axis","scale":"LINEAR"}}&isAutoRefresh=true
This REST call will get you the cpu usage. You will need to modify the parameters to suite your project name (e.g. sinuous-dog-133823) and other params to suit needs.
GET /v3/projects/sinuous-cat-233823/timeSeries?filter=metric.type="agent.googleapis.com/memory/bytes_used" resource.type="gce_instance"& aggregation.crossSeriesReducer=REDUCE_NONE& aggregation.alignmentPeriod=+60s& aggregation.perSeriesAligner=ALIGN_MEAN& secondaryAggregation.crossSeriesReducer=REDUCE_NONE& interval.startTime=2019-03-06T20:40:00Z& interval.endTime=2019-03-07T02:51:00Z& $unique=gc673 HTTP/1.1
Host: content-monitoring.googleapis.com
authorization: Bearer <your token>
cache-control: no-cache
Postman-Token: 039cabab-356e-4ee4-99c4-d9f4685a7bb2
VM memory metrics is not available by default, it requires Cloud Monitoring Agent 1.
The UI you are showing is Dataproc, which already has the agent installed, but disabled by default, you don't have to reinstall it. To enable Cloud Monitoring Agent for Dataproc clusters, set --properties dataproc:dataproc.monitoring.stackdriver.enable=true 2 when creating the cluster. Then you can monitor VM memory and create alerts in the Cloud Monitoring UI (not integrated with Dataproc UI yet).
Also see this related question: Dataproc VM memory and local disk usage metrics
This article is now out of date as Stackdriver is now a legacy agent. This has been replaced by the Ops Agent. Please read the latest articles on GCP about migrating to Ops Agent
What is the common practice to get metrics from services running inside Docker containers, using tools like CollectD, or InfluxDD Telegraph?
These tools are normally configured to run as agents in the system and get metrics from localhost.
I have read collectd docs and some plugins allow to get metrics from remote systems so I could have for example, an NGINX container and then a collectd container to get the metrics, but there isnt a simpler way?
Also, I dont want to use Supervisor or similar tools to run more that "1 process per container".
I am thinking about this in conjunction with a System like DC/OS or Kubernetes.
What do you think?
Thank you for your help.