Does WebSphere MQ v7 guarantee the recovery of in-flight messages after failover to a standby queue manager?
If so, how is this accomplished?
Thanks
There are two primary types of standby instances which support this level of recovery. The first is in a traditional hardware cluster such as Power HA, HACMP, Veritas, MSCS and so forth. The other is a Multi-Instance Queue Manager (MIQM). Both of these are capable of running the queue manager on more than one server with data and log files occupying shared disk which is accessible to all instances.
In both cases, persistent messages which have been committed prior to the termination of the primary QMgr will be recovered. The secondary QMgr will assume possesion of the data and log files during the failover event. From the perspective of the failover node it is the same as if the QMgr was just starting up after a shutdown or crash, it just hapens to now be running on a different server.
The main differences between a hardware cluster versus MIQM is that a hardware cluster fails over the IP address and possibly non-MQ processes as well. The MIQM recovers only the MQ processes and comes up on a different IP address. Applications with V7 clients can be configured with multi-instance connection details to allow for the multiple IP addresses.
So for these solutions in which the state of the QMgr and any in-flight messages is stored on shared disk, bringing the QMgr up with the same shared disk but on a different node recovers the state of the QMgr, including any in-flight messages.
Related
I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options
Currently, I am doing some R&D on Thingsboard IOT platform. I am planning to deploy it in cluster mode.
When it is deployed, how two Thingsboard servers communicate with each other?
I got this problem in my mind because a particular device can send a message to one Thingsboard server (A) but actually, the message might need to be transferred to another server (B) since a node in the B server is processing that particular device's messages (As I know Thingsboard nodes uses a device hash to handle messages).
How Kafka stream forward that message accordingly when in a cluster?
I read the official documentation and did some googling. But couldn't find exact answers.
Thingsboard uses Zookeeper as a service discovery.
Each Thingsboard microservice knows what other services run somewhere in the cluster.
All communications perform through message queues (Kafka is a good choice).
Each topic has several partitions. Each partition will be assigned to the respective node.
Message for device will be hashed by originator id and always pushed to the constant partition number. There is no direct communication between nodes.
In the case of some nodes crash or simply scaled up/down, Zookeeper will fire the repartition event on each node. And existing partitions will be reassigned according to the line node count. The device service will follow the same logic.
That is all magic. Simple and effective. Hope it helps with the Thingsboard cluster architure.
There is a legacy implementation(to an extent company proprietary in Pascal, C with some java macros) which processes TCP Socket based requests from TCP client application. It supports multiple client applications(around 5K) connecting over TCP Socket, however, it only supports single socket connection with backend(database). There are two instances of the server, so in total, it supports 10K client applications over two TCP Socket connection with database. All database related communication happens in synchronous manner over single socket connection. There are massive issues in this application, especially higher RTT(Round Trip Time) and occasional outages due to back-pressure. We have an ops team for such issues. They mostly resolve them by restarting the server. Hardly, we have people in our team who know coding details of this application and there is not much documentation. As this is a critical application we can not afford messing with it. We don't want to touch the code at least for now. This even becomes more critical due to shift in business priorities. There is a need to add another 30K client applications of another business with this setup.
Task before us is to integrate it with another application which is based on microservice architecture with middleware using RabbitMQ. This is a customer facing application sensitive to higher QoS. We can not afford outage & downtime in it. As part of this integration, there is a need to process request messages coming from the above legacy application over TCP Socket before passing them to database. In other words, we want to introduce a component which would process requests of legacy application before handing over to database. This additional process is part of our client request. Some of the processing requirement is very intensive and resource hungry in terms of CPU Cycle, Memory and socket i/o. As a result, there are chances, such processing may lead to server downtime & higher RTT. Our this layer is very flexible, we can easily add more server or replace faulty ones. But, this doesn't sound very efficient in this integration as we are limited with single socket connection of legacy application. So in total at max, we can only have 2(+ 6 for new 30k client application) servers. This is our cause of concern.
I want know, what different possible options are available to address high availability, scalability and latency issues of such integration? Especially with limitation of single TCP socket connection, how can we make this integration efficient, something which can handle back-pressure, better application uptime etc.
We were thinking of leveraging RabbitMQ, Layer 4 Load balancer(like haProxy, NginX), IPVS, NAT etc.. But all lead toward making some changes(or not very efficient technique) in the legacy code, which we don't want.
I am working on a distributed application in which a set of logical nodes communicate with each other.
In the initial discovery phase, each logical node starts up and sends out a UDP broadcast packet to the network to inform the rest of the nodes of its existence.
With different physical hosts, this can easily be handled by agreeing on a port number and keeping track of UDP broadcasts received from other hosts.
My problem is - I need to be able to be able to handle the case of multiple logical nodes on the same machine as well.
So in this case, it seems I cannot bind to the same port twice. How do I handle the node discovery case if there are two logical nodes on the same box ?? Thanks a lot in advance !!
Your choices are:
Create a RAW socket and listen to all packets on a particular NIC, this way ,by looking at the content of each packet, the process will identify if the packet is for destined for itself. The problem with this is tons of packets you would have to process. This is why kernels of our operating systems bind sockets to processes, so the traffic gets distributed optimally.
Create a specialized service, i.e. a daemon that will handle announcements of new processes that will be available to execute the work. When launched, the process will have to announce its port number to the service. This is usually how it is done.
Use virtual IP addresses for each process you want to run, each process binds to different IP address. If you are running on a local network, this is the simplest way.
Define range of ports and scan this range on all ip addresses you have defined.
I read in forum that while implementing any application using AMQP it is necessary to use fewer queues. So would I be completely wrong to assume that if I were cloning twitter I would have a unique and durable queue for each user signing up? It just seems the most natural approach and if not assign a unique queue for each user how would one design something like that.
What is the most used approach for web messaging. I see RabbitHUb and Rabbit WebHooks but Webhooks doesn't seem to be a scalable solution. i am working with Rails and my AMQP server as running as a Daemon.
In RabbitMQ, queues are quite cheap. They're effectively lightweight Erlang processes, and you can run tens to hundreds of thousands of queues on a single commodity machine (i.e. my laptop). Of course, each will consume a bit of RAM, but unused-recently queues will hibernate, so they'll consume as little memory as possible. In addition, if Rabbit runs low on memory for messages, it will page old messages to disk.
The above only applies to a single machine. RabbitMQ supports a form of lightweight clustering. When you join several Rabbit nodes into a cluster, each can see the queues and exchanges on the other nodes but each runs only its own queues. So, you'll be able to have even more queues! (to the limit of Erlang clusters, which is usually a few hundred nodes) So, a cluster forms a logical broker distributed over several machines; clients connect to it and use it transparently through any of the nodes.
That said, having a single durable queue for each user seems a bit strange: in AMQP, you cannot browse messages while they're on the queue; you may only get/consume messages which takes them off the queue and publish which adds the to the end of the queue. So, you can use AMQP as a message router, but you can't use it as a sort of message database.
Here is a thread that just talks about that: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2009-February/003041.html