My influxdb measurement have 24 Field Keys and 5 tag keys.
I try to do 'select last(cpu) from mymeasurement', and found result :
When there is no client throwing data into it, it'll take around 2 seconds to got the result
But when I run 95 client throwing data (per 5 seconds) into it, the query will take more than 10 seconds before it show the result. is it normal ?
Note :
My system is a Centos7 VM in xenserver with 4 vcore CPU and 8 GB ram, the top command show 30% cpu while that clients throw datas.
Some ideas:
Check your vCPU configuration on other VMs running on the same host. Other VMs you might have that don't need the extra vCPUs should only be configured with one vCPU, for a latency boost.
If your DB server requires 4 vCPUs and your host already has very little CPU% used during queries, you might want to check the storage and memory configurations of the VM in case your server is slow due to swap partition use, especially if your swap partition is located on a Virtual Disk over the network via iSCSI or NFS.
It might also be a memory allocation issue within the VM and server application. If you have XenTools installed on the VM, try on a system without the XenTools installed to rule out latency issues related to the XenTools driver.
Related
I'm running a docker environment with roundabout 25 containers for my private use. My system has 32GB Ram.
Especially RStudio Server and JupyterLab often need a lot of memory.
So I limited the memory to for both container at 26 GB.
This works good as long not both application storing dataframes in memory. If now RServer stores some GB and Jupyter also is filling memory to the limit my system crashes.
Is there any way to configure that these two container together are allowed to use 26GB ram max.
Or a relative limit like Jupyter is allowed to use 90% of free memory.
As I'm working now with large datasets it can happen all the time that (because I forget to close a kernel or something else) the memory can increase to the limit and I want just the container to crash and not the whole system.
And I don't want lower the limit for Jupyter further as the biggest dataset on its own need 15 GB of memory.
Any ideas?
I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options
When I let my application output the available memory and number of cores on a Google Cloud Run instance using linux commands like "free -h", "lscpu" and "top" I always get the information that there are 2 GB of available memory and 2 cores, although I specified other capacities in my deployment. No matter, I set 1 GB, 2 GB and 4 GB of memory and 1, 2 or 4 CPUs the mentioned linux tools always show the same capacity.
Am I misunderstanding these tools or the Google Cloud Run concept, or is there something not working like it should?
The Cloud Run services run container on a non standard runtime environmen (named BORG internally at Google Cloud). It's possible that the low level info values are not relevants.
In addition, Cloud Run services run in a sandbox (gVisor) and system calls can be also filtered like that.
What did you look at with these test?
I performed tests to validate the multi-cpus capacity of Cloud Run and wrote an article about that. The multi cpu capacity is real!! Have a look on it.
Hie,
I have recently come across virtualization called containers.
this is regarding auto-scaling.
I have read that when current container is out of resources then it will add a new container to the host. How does adding new container saves in resources?
can any explain the scale of
scenario 1: container running all of the resources (leaving resources for host os)
vs
Scenario 2: the scale of running two containers running in the same
host (where the container is using half of the resources consumed in
the previous case)
if the scale is more in scenario 2 then can anyone explain how scale has increased by having two containers even through total resources are same?
Consider following scenario regarding workload of an application (Only CPU and memory are considered here for example) -
Normal Workload
It requires 4 Cores , 16 GB memory.
Maximum Workload
It requires 8 Cores, 32 GB Memory.
Assuming that the burst of activities (max workload) happens only 2 hours per day.
Case 1 - When application is not containerized
The application has to reserve 8 Cores and 32 GB of memory for it's working so that it could handle max workload with expected performance.
But the 4 Cores and 16 GB of memory is going to be wasted for 22 hours in a day.
Case 2 - When application is containerized
Lets assume that a container of 4 Cores and 16 GB memory is spawned. Then remaining resources, 4 Cores and 16 GB memory are available for other applications in cluster and another container of same configuration will be spawned for this application only for 2 hours in a day when max workload
is required.
Therefore the resources in cluster are used optimally when applications are containerized.
What if single machine does not have all the resources required for the application ?
In such cases, if application is containerized then containers/resources can be allocated from multiple machines in the cluster.
Increase fault tolerance
If application is running on single machine and the machine goes down then the whole application become unavailable. But in case of containerized application
running on different machines in cluster. If a machine fails then only few containers will not be available.
Regarding your question, if the application's workload is going to be uniform throughout the lifetime then there is no benefit in breaking the application in smaller containers in terms of scale. But you may still consider containerizing applications for other benefits. In terms of scale, an application is containerized only when there is varying workload or more workload is anticipated in future.
how scale has increased by having two containers even through total resources are same?
It usually doesn't. If you have a single threaded application, and you have a multi core host, then scaling multiple instances of the container in the same host will give you access to more cores by that application. This could also help with a multi threaded application if it's limited by internal resource contention and not using all of the cores. But for most multi threaded processes, scaling the application up on a single host will not improve performance.
What scaling does help with in an multi node environment is allowing the application to run in other hosts in the cluster that are not fully utilized. This is the horizontal scaling that most target with 12 factor apps. It allows apps deployed to cloud environments to scale out with more replicas/shards by adding more nodes rather than trying to find more resources for a single large node.
I have three node CouchDB cluster. It is running on windows. Each node has 16vcpu and 64GB RAM. I am fairly new to CouchDB and to nonrelational databases in general.
The cluster is running on windows. What I am struggling with is one of the nodes (which I am assuming is the coordinator) is using the page file about 120GB disk space while it has about 48GB free RAM available to it.
We increased the RAM from 32Gb to 64GB to help with the paging. Only to find out that, it is now using more of the page file since the page file is being currently managed by the Windows OS.
I would assume it would be paging once it used all the available RAM, but what we have is 120GB paging file while it has about 50GB free RAM.
Why is it using the page file which has less response time while it has free RAM available to it?
Wasn't it supposed to use unreserved RAM for disk caching of frequently accessed DB file blocks to speed up access? Why is it behaving this way?
Is there a CouchDB or Erlang Beam configuration setting that I should be looking at?