Increasing Prometheus storage retention - monitoring

Hi I have Prometheus server installed on my AWS instance but the data is been removed automatically after 15 days. I need to have data for an year or months, is there anything I need to change in my prometheus configuration?
Or do I need any extensions like Thanos, I am new to Prometheus so please be easy on the answers

Edit the prometheus.service file
vi /etc/systemd/system/prometheus.service
add "--storage.tsdb.retention.time=1y" below to "ExecStart=/usr/local/bin/prometheus \" line.
So the config will look like bellow for 1 year of data retention.
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.external-url=http://34.89.26.156:9090 \
--storage.tsdb.retention.time=1y
[Install]
WantedBy=multi-user.target

There's the --storage.tsdb.retention.time flag that you can set when you start Prometheus. It defines how long data is kept in the time-series database (TSDB). The default is 15 days.
So, to increase the retention time to a year, you should be able to set this to something like:
--storage.tsdb.retention.time=1y
# or
--storage.tsdb.retention.time=365d
See the Prometheus documentation.

adding below in deployment yml file allowed me change storage retention days
image: 'your/image path'
args:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=45d'
- '--config.file=/etc/prometheus/prometheus.yml'
Thanks
Prashanth

On Debian you do not have to edit the systemd-config. You simply can add the arguments to
/etc/default/prometheus
like so:
# Set the command-line arguments to pass to the server.
ARGS="--storage.tsdb.retention.time=60d"

Related

How to delete a tags in influxDB Cloud or influxDB explore?

How to delete a tags in influxDB Cloud or influxDB explore? Just like the tags: scripts, thx
influx delete --bucket default \
--start '1970-01-01T00:00:00Z' \
--stop $(date +"%Y-%m-%dT%H:%M:%SZ")
--predicate "host=\"some-hostname\""
This will delete the data points for the host tag with the name "some-hostname".
I've not found a way to completely remove a tag

How to schedule a Docker container in GCP

I have 5 tasks in my project that need to be run periodically. Some of these tasks are run on a daily basis, some on a weekly basis.
I try to containerize each task in a Docker image. Here is one illustrative example:
FROM tensorflow/tensorflow:2.7.0
RUN mkdir /home/MyProject
COPY . /home/MyProject
WORKDIR /home/MyProject/M1/src/
RUN pip install pandas numpy
CMD ./task1.sh
There are a list of Python scripts that need to be run in the task1.sh file defined above. This is not a server application or anything similar, it will run the task1.sh, which will run all the python scripts defined in it one by one, and the entire process will be finished within minutes. And the same process is supposed to be repeated 24 hours later.
How can I schedule such Docker containers in GCP? Are there different ways of doing it? Which one is comparably simpler if there are multiple solutions?
I am not a dev-ops expert by any means. All examples in documentation I find are explained for server applications which are running all the time, not like my example where the image needs to be run just once periodically. This topic is quite daunting for a beginner in this domain like myself.
ADDENDUM:
Looking at Google's documentation for cronjobs in GKE on the following page:
https://cloud.google.com/kubernetes-engine/docs/how-to/cronjobs
I find the following cronjob.yaml file:
# cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
concurrencyPolicy: Allow
startingDeadlineSeconds: 100
suspend: false
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo "Hello, World!"
restartPolicy: OnFailure
It is stated that this cronjob prints the current time and a string once every minute.
But it is documented in a way with the assumption that you deeply understand what is going on on the page, in which case you would not need to read the documentation!
Let's say that I have my image that I would like it to be run once every day, and the name of my image - say - is my_image.
I assume that I am supposed to change the following part for my own image.
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo "Hello, World!"
It is a total mystery what these names and arguments mean.
name: hello
I suppose it is just a user selected name and does not have any practical importance.
image: busybox
Is this busybox the base image? If not, what is that? It says NOTHING about what this busybox thing is and where it comes from!
args:
- /bin/sh
- -c
- date; echo "Hello, World!"
And based on the explanation on the page, this is the part that prints the date and the "Hello, World!" string to the screen.
Ok... So, how do I modify this template to create a cronjob out of my own image my_image? This documentation does not help at all!
I will answer your comment here, because the second part of your question is too long to answer.
Don't be afraid, it's kubernetes API definition. You declare what you want to the control plane. It is in charge to make your whishes happen!
# cronjob.yaml
apiVersion: batch/v1 # The API that you call
kind: CronJob # The type of object/endpoint in that API
metadata:
name: hello # The name of your job definition
spec:
schedule: "*/1 * * * *" # Your scheduling, change it to "0 10 * * *" to run your job every dat at 10.00am
concurrencyPolicy: Allow # config stuff, deep dive later
startingDeadlineSeconds: 100 # config stuff, deep dive later
suspend: false # config stuff, deep dive later
successfulJobsHistoryLimit: 3 # config stuff, deep dive later
failedJobsHistoryLimit: 1 # config stuff, deep dive later
jobTemplate: # Your execution definition
spec:
template:
spec:
containers:
- name: hello # Custom name of your container. Only to help you in case of debug, logs, ...
image: busybox # Image of your container, can be gcr.io/projectID/myContainer for example
args: # Args to pass to your container. You also have the "entrypoint" definition to change if you want. The entrypoint is the binary to run and that will receive the args
- /bin/sh
- -c
- date; echo "Hello, World!"
# You can also use "command" to run the command with the args directly. In fact it's WHAT you start in your container to perform the job.
restartPolicy: OnFailure # Config in case of failure.
You have more details on the API definition here
Here the API definition of a container with all the possible values to customize it.

Restart wurstmeister/kafka-docker after server.properties changes due to ssl config

According to the docs, SSL config of a wurstmeister/kafka-docker is to be done in server.properties file, as follow:
listeners=PLAINTEXT://host.name:port,SSL://host.name:port
# The following is only needed if the value is different from ``listeners``, but it should contain
# the same security protocols as ``listeners``
advertised.listeners=PLAINTEXT://host.name:port,SSL://host.name:port
and
ssl.keystore.location=/var/private/ssl/kafka.server.keystore.jks
ssl.keystore.password=test1234
ssl.key.password=test1234
ssl.truststore.location=/var/private/ssl/kafka.server.truststore.jks
ssl.truststore.password=test1234
Source: https://docs.confluent.io/3.0.0/kafka/ssl.html#configuring-kafka-brokers
I also followed the rest of the docs, so I also have the SSL and port 9093 configured:
listeners=PLAINTEXT://:9092,SSL://:9093
advertised.listeners=PLAINTEXT://localhost:9092,SSL://localhost:9093
After I've done that, I have tried to stop and start the server again:
docker stop wurstmeister_kafka_1
docker start wurstmeister_kafka_1
and also
docker restart wurstmeister_kafka_1
But when I inspect with docker ps, I do not see port 9093 being bound:
λ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
b6c5685414ec wurstmeister/kafka:latest "start-kafka.sh" 3 days ago Up 6 minutes 0.0.0.0:9092->9092/tcp
wurstmeister_kafka_1
ded10e44873a wurstmeister/zookeeper:latest "/bin/sh -c '/usr/sb…" 3 days ago Up 3 days 22/tcp, 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp wurstmeister_zookeeper_1
and the following command openssl s_client -debug -connect localhost:9093 -tls1 said errors:
λ openssl s_client -debug -connect localhost:9093 -tls1
20024:error:0200274D:system library:connect:reason(1869):../openssl-1.1.1a/crypto/bio/b_sock2.c:110:
20024:error:2008A067:BIO routines:BIO_connect:connect error:../openssl-1.1.1a/crypto/bio/b_sock2.c:111:
20024:error:0200274D:system library:connect:reason(1869):../openssl-1.1.1a/crypto/bio/b_sock2.c:110:
20024:error:2008A067:BIO routines:BIO_connect:connect error:../openssl-1.1.1a/crypto/bio/b_sock2.c:111:
connect:errno=0
How can I restart the docker so that changes in server.properties takes effect? If that's not the right approach, then what is?
Docker doesn't preserve file changes within the image.
You either have to volume mount your own server.properties over the one in the container, or see if the environment variables allow you to update the configuration during the startup of the image (similar to the confluentinc/kafka image)

check_disk not generating alerts: nagios

I am new to nagios.
I am trying to configure the "check_disk" service for one host but I am not getting the expected results.
I should get the emails when when disk usage goes beyond 80%.
So, There is already service defined for this task with multiple hosts, as below:
define service{
use local-service ; Name of service template to use
host_name localhost, host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_local_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
The Issue:
Further I tried to test single host i.e. "host2". The current usage of host2 is as follow:
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-rootvol 94G 45G 45G 50% /
So to get instant emails, I written another service as below, where warning set to <60% and critical set to <40%.
define service{
use local-service
host_name host2
service_description Root Partition again
check_command check_local_disk!60%!40%!/
contact_groups dev-admins
}
But still I am not receive any emails for the same.
Where it going wrong.
The "check_local_disk" command is defined as below:
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}
Your command definition currently is setup to only check your Nagios server's disk, not the remote hosts (such as host2). You need to define a new command definition to execute check_disk on the remote host via NRPE (Nagios Remote Plugin Execution).
On Nagios server, define the following:
define command {
command_name check_remote_disk
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk -a $ARG1$ $ARG2$ $ARG3$
register 1
}
define service{
use genric-service
host_name host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_remote_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
Restart the Nagios service.
On the remote host:
Ensure you have NRPE plugin installed.
Instructions for Ubuntu: http://tecadmin.net/install-nrpe-on-ubuntu/
Instructions for CentOS / RHEL: http://sharadchhetri.com/2013/03/02/how-to-install-and-configure-nagios-nrpe-in-centos-and-red-hat/
Ensure there is a command defined for check_disk on the remote host. This is usually included in nrpe.cfg, but commented-out. You'd have to un-comment the line.
Ensure you have the check_disk plugin installed on the remote host. Mine is located at: /usr/lib64/nagios/plugins/check_disk
Ensure that allowed_hosts field of nrpe.cfg includes the IP address / hostname of your Nagios server.
Ensure that dont_blame_nrpe field of nrpe.cfg is set to 1 to allow command line arguments to NRPE commands: dont_blame_nrpe=1
If you made any changes, restart the nrpe service.

Nagios Percona Monitoring Plugin

I was reading a blog post on Percona Monitoring Plugins and how you can somehow monitor a Galera cluster using pmp-check-mysql-status plugin. Below is the link to the blog demonstrating that:
https://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins/
The commands in this tutorial are run on the command line. I wish to try these commands in a Nagios .cfg file e.g, monitor.cfg. How do i write the services for the commands used in this tutorial?
This was my attempt and i cannot figure out what the best parameters to use for check_command on the service. I am suspecting that where the problem is.
So inside my /etc/nagios3/conf.d/monitor.cfg file, i have the following:
define host{
use generic-host
host_name percona-server
alias percona
address 127.0.0.1
}
## Check for a Primary Cluster
define command{
command_name check_mysql_status
command_line /usr/lib/nagios/plugins/pmp-check-
mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
}
define service{
use generic-service
hostgroup_name mysql-servers
service_description Cluster
check_command pmp-check-mysql-
status!wsrep_cluster_status!==!str!non-Primary
}
When i run the command Nagios and go to monitor it, i get this message in the Nagios dashboard:
status: UNKNOWN; /usr/lib/nagios/plugins/pmp-check-mysql-status: 31:
shift: can't shift that many
You verified that:
/usr/lib/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
works fine on command line on the target host? I suspect there's a shell escape issue with the ==
Does this work well for you? /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9

Resources