I have a few RDS servers I'd like to monitor for insufficient disk space. For simplicity sake, I prefer using my current monitoring system rather than an AWS solution like cloudwatch.
I've been reading the documentation and the nearest solution was describe-db-instances, which gives the allocated storage, but not the space left / amount of storage used:
"SecondaryAvailabilityZone": "us-east-1a",
"ReadReplicaDBInstanceIdentifiers": [],
"AllocatedStorage": 100,
...
How do I query a specific RDS DB instance for the amount of free space left or used?
The right tool is the cloudwatch CLI:
aws cloudwatch get-metric-statistics \
--metric-name FreeStorageSpace \
--start-time 2017-02-27T23:00:00Z \
--end-time 2017-02-28T23:00:00Z \
--period 3600 \
--namespace AWS/RDS \
--statistics Average \
--dimensions Name=DBInstanceIdentifier,Value=<DB-NAME>
<DB-NAME> and the metric name FreeStorageSpace can be found using:
aws cloudwatch list-metrics
Related
I have two questions:
As Docker Hub is updating their image pull and retention policy from 1. November 2020 and particularly restricting the image pull requests for free user account. I want to measure how many pull requests (GET) for images as well as manifests has been made by a free user account. Is there a docker hub api that gives this metadata?
How to verify the docker pulled image is downloaded by a particular user?
Thank you,
Jack
Since Docker is going to update their policy regarding image pull rates (and of course other resources) by anonymous user and authenticated user from 2. November 2020, they provide a header (metadata) to measure the Rate Limit and remaining quota.
See below:
Get Bearer Token:
$ TOKEN=$(curl "https://auth.docker.io/token?service=registry-1.docker.io&scope=repository:ratelimitpreview/test:pull" | jq -r .token)
Get manifest to measure assigned and left quota:
$ curl -v -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest 2>&1 | grep RateLimit
## Output:
< RateLimit-Limit: 200;w=21600
< RateLimit-Remaining: 199;w=21600
# w -> time in seconds
# RateLimit-Limit -> Assigned Pulls Quota for w seconds
# RateLimit-Remaining -> Remaining Pulls Quota for w seconds
Update:
See the official blog from docker - https://www.docker.com/blog/checking-your-current-docker-pull-rate-limits-and-status/
Hi I have Prometheus server installed on my AWS instance but the data is been removed automatically after 15 days. I need to have data for an year or months, is there anything I need to change in my prometheus configuration?
Or do I need any extensions like Thanos, I am new to Prometheus so please be easy on the answers
Edit the prometheus.service file
vi /etc/systemd/system/prometheus.service
add "--storage.tsdb.retention.time=1y" below to "ExecStart=/usr/local/bin/prometheus \" line.
So the config will look like bellow for 1 year of data retention.
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.external-url=http://34.89.26.156:9090 \
--storage.tsdb.retention.time=1y
[Install]
WantedBy=multi-user.target
There's the --storage.tsdb.retention.time flag that you can set when you start Prometheus. It defines how long data is kept in the time-series database (TSDB). The default is 15 days.
So, to increase the retention time to a year, you should be able to set this to something like:
--storage.tsdb.retention.time=1y
# or
--storage.tsdb.retention.time=365d
See the Prometheus documentation.
adding below in deployment yml file allowed me change storage retention days
image: 'your/image path'
args:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=45d'
- '--config.file=/etc/prometheus/prometheus.yml'
Thanks
Prashanth
On Debian you do not have to edit the systemd-config. You simply can add the arguments to
/etc/default/prometheus
like so:
# Set the command-line arguments to pass to the server.
ARGS="--storage.tsdb.retention.time=60d"
I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.'
google_cloud_options = p_options.view_as(GoogleCloudOptions)
google_cloud_options.region = 'europe-west1'
google_cloud_options.project = 'my-project'
google_cloud_options.job_name = 'rim'
google_cloud_options.staging_location = 'gs://my-bucket/binaries'
google_cloud_options.temp_location = 'gs://my-bucket/temp'
p_options.view_as(StandardOptions).runner = 'DataflowRunner'
p_options.view_as(SetupOptions).save_main_session = True
p_options.view_as(StandardOptions).streaming = True
p_options.view_as(WorkerOptions).subnetwork = 'regions/europe-west1/subnetworks/test'
p = beam.Pipeline(options=p_options)
I tried to specify --network 'test' in the command line since it is not the default configuration
It looks like your default firewall rules were modified and dataflow detected this and prevented your job from launching. Could you verify your firewall rules were not modified in your project?. Please take a look at the documentation here. You will also find a command here to restore the firewall rules:
gcloud compute firewall-rules create [FIREWALL_RULE_NAME] \
--network [NETWORK] \
--action allow \
--direction ingress \
--target-tags dataflow \
--source-tags dataflow \
--priority 0 \
--rules tcp:1-65535
Pick a name for the firewall, and provide a network name. Then pass in the network name with --network when you launch the dataflow job. If you have a network named 'default' dataflow will try to use that automatically, so you won't need to pass in --network. If you've deleted that network you may wish to recreate it.
As of now, till apache beam version 2.19.0. There is no provision from dataflow to set network tag for its VM.
Instead while creating firewall rule, we should add a tag for dataflow.
gcloud compute firewall-rules create FIREWALL_RULE_NAME \
--network NETWORK \
--action allow \
--direction DIRECTION \
--target-tags dataflow \
--source-tags dataflow \
--priority 0 \
--rules tcp:12345-12346
See this link for more details
https://cloud.google.com/dataflow/docs/guides/routes-firewall
I am setting up a Neo4j HA cluster according to the documentation (https://neo4j.com/docs/operations-manual/current/tutorial/highly-available-cluster) but Neo4j does not seem to be applying the HA configuration.
I can browse to the Neo4j browser and my database is active, but :play sysinfo shows 'High Availability' = 'Not enabled' and Cluster = 'No cluster'.
I also cannot telnet to the clustering port (5001):
ubuntu#ip-172-0-31-71:~$ telnet localhost 5001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.
ubuntu#ip-172-0-31-71:~$
The HA configuration in Neo4j is:
# Unique server id for this Neo4j instance
# can not be negative id and must be unique
ha.server_id=2
# List of other known instances in this cluster
ha.initial_hosts=neo4j-prod-1:5001,neo4j-prod-2:5001,neo4j-prod-3:5001
# HA - High Availability
# SINGLE - Single mode, default.
dbms.mode=HA
I'm starting the container with the following command which routes communication on the clustering port (5001) to the container:
/usr/bin/docker run \
--publish=7474:7474 --publish=7687:7687 --publish=5001:5001 \
--volume=/var/lib/neo4j/data:/data \
--volume=/var/lib/neo4j/logs:/logs \
--volume=/var/lib/neo4j/conf:/conf \
neo4j:3.0
It looks like Neo4j is not loading the HA configuration - where should I look next?
HA does not work in Neo4j Community Edition.
To enable HA, run Enterprise Edition with:
/usr/bin/docker run \
--publish=7474:7474 --publish=7687:7687 --publish=5001:5001 \
--volume=/var/lib/neo4j/data:/data \
--volume=/var/lib/neo4j/logs:/logs \
--volume=/var/lib/neo4j/conf:/conf \
neo4j:3.0-enterprise
I have an issue using Docker swarm.
I have 3 replicas of a Python web service running on Gunicorn.
The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).
I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).
So, following what I've proposed in comments:
Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.
But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.
In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:
HEALTHCHECK --interval=30s --timeout=3s \
CMD /service_healthcheck.sh
healthcheck.sh
#!/bin/bash
CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5
[ $((a%2)) -eq 0 ]
do_healthcheck() {
curl -f http://localhost/ || exit 1
}
if [ ! -f /tmp/healthcheck.first.run ]; then
do_healhcheck
touch /tmp/healthcheck.first.run
exit 0
fi
# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0
Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)
There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.
If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.