Impossible (?) NullPointerException - Springframework RabbitMQ, Failed to invoke afterAckCallback - spring-amqp

I'm running a Java application that uses RabbitMQ Server 3.8.9, spring-amqp-2.2.10.RELEASE, and spring-rabbit-2.2.10.RELEASE.
My test case does something like the following:
Start the RabbitMQ Server
Start my Java application
Test and validate some functionality on my Java application
Gracefully stop my Java application
Gracefully stop the RabbitMQ Server
Repeat 1-6 a few more times
Everything looks fine except sometimes during one of the restarts about 10 minutes into it, I see the following error in my application's logs:
2021-02-05 12:52:46.498 UTC,ERROR,org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl,null,rabbitConnectionFactory23,runWorker():1149,Failed to invoke afterAckCallback
java.lang.NullPointerException: null
at org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl.lambda$doHandleConfirm$1(PublisherCallbackChannelImpl.java:1027) ~[spring-rabbit.jar:2.2.10.RELEASE]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
Further analysis doesn't point to anything specific. There are no errors in the RabbitMQ log files, no restarts of the RabbitMQ server, nothing weird in the RabbitMQ logs during the time stamp above.
The code in question:
https://github.com/spring-projects/spring-amqp/blob/v2.2.10.RELEASE/spring-rabbit/src/main/java/org/springframework/amqp/rabbit/connection/PublisherCallbackChannelImpl.java#L1027
My tests are automated and run as part of a CI pipeline. The issue is intermittent and I have had trouble reproducing it locally in my sandbox.
From what I can tell, the functionality of my Java application is unaffected.
Code that creates the RabbitMQ connection factory used everywhere:
final CachingConnectionFactory connectionFactory = new CachingConnectionFactory(HOST_NAME);
connectionFactory.setChannelCacheSize(1);
connectionFactory.setPublisherConfirms(true);
It seems like a concurrency problem, but I'm not so sure on how to get to the bottom of it. For the most part, we use the RabbitTemplate and other Spring facilities to connect to RabbitMQ.
Anyone in the Spring world with some knowledge in RabbitMQ care to chime in?
Thanks

The code you talk about is like this:
finally {
try {
if (this.afterAckCallback != null && getPendingConfirmsCount() == 0) {
this.afterAckCallback.accept(this);
this.afterAckCallback = null;
}
}
catch (Exception e) {
this.logger.error("Failed to invoke afterAckCallback", e);
}
}
There is really could be a race condition around that this.afterAckCallback property.
We may pass if() in one but then different thread makes this.afterAckCallback as null, so we fail with that NPE.
We have to copy its value to the local variable and then check and perform accept().
Feel free to raise a GitHub issue against Spring AMQP project: https://github.com/spring-projects/spring-amqp/issues
We have a race condition because we really call this doHandleConfirm() with its async logic from the loop in the processMultipleAck().

Related

Store Lock Issue on Neo4j

Getting following exception
java.lang.RuntimeException: Error starting
org.neo4j.kernel.EmbeddedGraphDatabase
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:335)
at org.neo4j.kernel.EmbeddedGraphDatabase.(EmbeddedGraphDatabase.java:59)
at org.neo4j.graphdb.factory.GraphDatabaseFactory.newDatabase(GraphDatabaseFactory.java:108)
at org.neo4j.graphdb.factory.GraphDatabaseFactory$1.newDatabase(GraphDatabaseFactory.java:95)
at org.neo4j.graphdb.factory.GraphDatabaseBuilder.newGraphDatabase(GraphDatabaseBuilder.java:176)
at org.neo4j.graphdb.factory.GraphDatabaseFactory.newEmbeddedDatabase(GraphDatabaseFactory.java:67)
at com.tpgsi.mongodb.dataPollingWithOplog.ORBCreateLink.main(ORBCreateLink.java:62)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component
'org.neo4j.kernel.StoreLockerLifecycleAdapter#5b6ca687' was
successfully initialized, but failed to start. Please see attached
cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:513)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:115)
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:331)
... 6 more Caused by: org.neo4j.kernel.StoreLockException: Unable to obtain l file:
/home/aps/neo4j-community-2.2.3/CompleteTest/store_lock. Please ensure
no other process is using this database, and that the directory is
writable (required even for read-only access)
at org.neo4j.kernel.StoreLocker.checkLock(StoreLocker.java:78)
at org.neo4j.kernel.StoreLockerLifecycleAdapter.start(StoreLockerLifecycleAdapter.java:44)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:507)
... 8 more Caused by: java.nio.channels.OverlappingFileLockException
at sun.nio.ch.SharedFileLockTable.checkList(FileLockTable.java:255)
at sun.nio.ch.SharedFileLockTable.add(FileLockTable.java:152)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1087)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:1154)
at org.neo4j.io.fs.StoreFileChannel.tryLock(StoreFileChannel.java:135)
at org.neo4j.io.fs.FileLock.wrapFileChannelLock(FileLock.java:38)
at org.neo4j.io.fs.FileLock.getOsSpecificFileLock(FileLock.java:99)
at org.neo4j.io.fs.DefaultFileSystemAbstraction.tryLock(DefaultFileSystemAbstraction.java:85)
at org.neo4j.kernel.StoreLocker.checkLock(StoreLocker.java:74)
I have one process which will create graph and on completion of this i have one more process to create few more relationship on top of if. But i am getting above exception while running the second process after completion of first process. I checked the data directory of Neo4j is not been used by any process but still getting lock issue.
I ran once a piece of code to create the graph.
static GraphDatabaseFactory dbFactory = null;
static GraphDatabaseService graphdb = null;
static{
dbFactory =new GraphDatabaseFactory();
graphdb = dbFactory.newEmbeddedDatabase(com.tpgsi.mongodb.dataPollingWithOplog.CommonConstants.NEO4J_DATA_DIRECTORY);
}
try{
Transaction tx = graphdb.beginTx();
try
{
// creating Node and relationships
tx.success();
} catch (Exception e) {
tx.failure();
e.printStackTrace();
} finally {
tx.close();
}
}
catch(Exception e)
{
e.printStackTrace();
}
I have created graphdb object as global variable and using that everywhere. Only transaction, i am closing. I am not using registerShutDownHook() and shutdown function for graphdb object. The reason i am not using these functions are because, i am running this in storm environment with multiple executors and if we will shutdown this then for every thread i have to create it again which is also not good.
I am thinking not shutting down the graphdb could be the reason.
If i am running the same code with One executor it works fine but with multiple executor getting Lock issue.
Can anybody tell me what i have to do to get ride of it.
A graph database instance is thread-safe so you can use it across all bolts, just make it accessible as a global variable.
Only one GDB at a time can access a store-directory.
Otherwise create a service that your storm-bolts access/use via another protocol e.g. http or binary.
I faced the same issue. But my mistake is I executed the program while the server is running. If I stop the server, then the program executed successfully.
I was executing my code with multiple worker and executor in storm environment. As multiple worker were creating multiple graphdb install and i was getting store lock exception. I changed the number of worker to 1 and to create graphdb object i have written singleton class to ensure i am creating only one graphdb object at a time and it's a gloabal variable.

Quartz jobs failing after MySQL db errors

On a working Grails 2.2.5 system, we're occasionally losing connection to the MySQL database, for reasons that are not relevant here. The majority of the system recovers perfectly well from the outage. But any Quartz jobs (using Quartz plugin 0.4.2) are typically failing to run again after such an outage. This is a typical message which appears in the log at the point the job should run:
2015-02-26 16:30:45,304 [quartzScheduler_Worker-9] ERROR core.ErrorLogger - Unable to notify JobListener(s) of Job to be executed: (Job will NOT be executed!). trigger= GRAILS_JOBS.quickQuoteCleanupJob job= GRAILS_JOBS.com.aire.QuickQuoteCleanupJob
org.quartz.SchedulerException: JobListener 'sessionBinderListener' threw exception: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9] [See nested exception: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1868)
at org.quartz.core.JobRunShell.notifyListenersBeginning(JobRunShell.java:338)
at org.quartz.core.JobRunShell.run(JobRunShell.java:176)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
Caused by: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1866)
... 3 more
What do I need to do to make things more robust, so that the Quartz jobs recover as well?
By default, a Quartz job will get a session bound to it. Disable that session binding and let your service handle the transaction / session. That's what we do and when we get our DB connections back up, jobs still work.
To disable session binding in your job, add :
def sessionRequired = false

Grails Quartz Plugin Freezes on the 8th execution

Environment: Grails 2.0.3, Quartz plugin 1.0-RC2
I have a simple quartz job that reads a value from the database. On the 8th execution, the Job freezes while reading from the database. There is also a web page that retrieves the value from the DB. Once the Job gets into the waiting state, attempting to read the value through the web page also freezes.
Environment: Grails 2.2.0, Quartz plugin 1.0-RC5
I ran into the same problem using quartz-1.0-RC5.
As a workaround I replaced the SessionBinderJobListener class with the one from quartz-0.4.2 (changed only the package to the new one) and the job runs again without any problem. So it looks like the persistenceInterceptor bean does not close the connections or return them to the pool. Maybe there is a problem in org.codehaus.groovy.grails.orm.hibernate.support.HibernatePersistenceContextInterceptor with flush and destroy.
If org.quartz.threadPool.threadCount is much less than maxActive in dataSource properties, the problem does not appear (perhaps each job thread already got its connection) or it will only take longer.
The default size of the datasource connection pool is 8, so you're probably not properly closing the connections to return them to the pool.
I'm seeing the same thing with Quartz plugin version 1.0.1. On the 8th execution both the Job and Tomcat workers freeze. Used withSession and called Hibernate session.disconnect() in the finally {} block of the job. That did the trick.
def execute() {
def hsession
try {
DomainObject.withSession { ses ->
hsession = ses
....
}
} catch(Exception e) {
//log it etc.
} finally {
hsession?.disconnect()
}
}

Quartz.net not firing on remote server

I've implemented quartz.net in windows service to run tasks. And everything works fine on local workstation. But once it's deployed to remote win server host, it just hangs after initialization.
ISchedulerFactory schedFact = new StdSchedulerFactory();
// get a scheduler
var _scheduler = schedFact.GetScheduler();
// Configuration of triggers and jobs
var trigger = (ICronTrigger)TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.WithCronSchedule(job.Value)
.Build();
var jobDetail = JobBuilder.Create(Type.GetType(job.Key)).StoreDurably(true)
.WithIdentity("job1", "group1").Build();
var ft = _scheduler.ScheduleJob(jobDetail, trigger);
Everything seems to be standard. I have private static pointer to scheduler, logging process stops right after jobs are initialized and added to scheduler. Nothing else happens after.
I'd appreciate any advices.
Thanks.
PS:
Found some strange events in event viewer mb according quartz.net:
Restart Manager - Starting session 2 - ‎2012‎-‎07‎-‎09T15:14:15.729569700Z.
Restart Manager - Ending session 2 started ‎2012‎-‎07‎-‎09T15:14:15.729569700Z.
Based on your question and the additional info you gave in comments, I would guess there is something going wrong in the onStart method of your service.
Here are some things you can do to help figure out and solve the problem:
Place the code in your onStart method in a try/catch block, and try to install and start the service. Then check windows logs to see if it was installed correctly, started correctly, etc.
The fact that restart manager is running leads me to believe that your service may be dependent on a process which is already in use. Make sure that any dependencies of your service are closed before installing it.
This problem can also be caused by putting data-intense or long running operations in your onStart method. Make sure that you keep this kind of code out of onStart.
I had a similar problem to this and it was caused by having dots/periods in the assembly name e.g. Project.Update.Service. When I changed it to ProjectUpdateService it worked fine.
Strangely it always worked on the development machine. Just never on the remote machine.
UPDATE: It may have been the length of the service that has caused this issue. By removing the dots I shortened the service name. It looks like the maximum length is 25 characters.

ServiceController seems to be unable to stop a service

I'm trying to stop a Windows service on a local machine (the service is Topshelf.Host, if that matters) with this code:
serviceController.Stop();
serviceController.WaitForStatus(ServiceControllerStatus.Stopped, timeout);
timeout is set to 1 hour, but service never actually gets stopped. Strange thing with it is that from within Services MMC snap-in I see it in "Stopping" state first, but after a while it reverts back to "Started". However, when I try to stop it manually, an error occurs:
Windows could not stop the Topshelf.Host service on Local Computer.
Error 1061: The service cannot accept control messages at this time.
Am I missing something here?
I know I am quite late to answer this but I faced a similar issue , i.e., the error: "The service cannot accept control messages at this time." and would like to add this as a reference for others.
You can try killing this service using powershell (run powershell as administrator):
#Get the PID of the required service with the help of the service name, say, service name.
$ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'service name'}).processID
#Now with this PID, you can kill the service
taskkill /f /pid $ServicePID
Either your service is busy processing some big operation or is in transition to change the state. hence is not able to accept anymore input...just think of it as taking more than it can chew...
if you are sure that you haven't fed anything big to it, just go to task manager and kill the process for this service or restart your machine.
I had exact same problem with Topshelf hosted service. Cause was long service start time, more than 20 seconds. This left service in state where it was unable to process further requests.
I was able to reproduce problem only when service was started from command line (net start my_service).
Proper initialization for Topshelf service with long star time is following:
namespace Example.My.Service
{
using System;
using System.Threading.Tasks;
using Topshelf;
internal class Program
{
public static void Main()
{
HostFactory.Run(
x =>
{
x.Service<MyService>(
s =>
{
MyService testServerService = null;
s.ConstructUsing(name => testServerService = new MyService());
s.WhenStarted(service => service.Start());
s.WhenStopped(service => service.Stop());
s.AfterStartingService(
context =>
{
if (testServerService == null)
{
throw new InvalidOperationException("Service not created yet.");
}
testServerService.AfterStart(context);
});
});
x.SetServiceName("my_service");
});
}
}
public sealed class MyService
{
private Task starting;
public void Start()
{
this.starting = Task.Run(() => InitializeService());
}
private void InitializeService()
{
// TODO: Provide service initialization code.
}
[CLSCompliant(false)]
public void AfterStart(HostControl hostStartedContext)
{
if (hostStartedContext == null)
{
throw new ArgumentNullException(nameof(hostStartedContext));
}
if (this.starting == null)
{
throw new InvalidOperationException("Service start was not initiated.");
}
while (!this.starting.Wait(TimeSpan.FromSeconds(7)))
{
hostStartedContext.RequestAdditionalTime(TimeSpan.FromSeconds(10));
}
}
public void Stop()
{
// TODO: Provide service shutdown code.
}
}
}
I've seen this issue as well, specifically when a service is start pending and I send it a stop programmatically which succeeds but does nothing. Also sometimes I see stop commands to a running service fail with this same exception but then still actually stop the service. I don't think the API can be trusted to do what it says. This error message explanation is quite helpful...
http://technet.microsoft.com/en-us/library/cc962384.aspx
I run into a similar issue and found out it was due to one of the services getting stuck in a state of start-pending, stop pending, or stopped.
Rebooting the server or trying to restart services did not work.
To solve this, I run the Task Manager in the server and in the "Details" tab I located the services that were stuck and killed the process by ending the task. After ending the task I was able to restart services without problem.
In brief:
1. Go to Task Manager
2. Click on "Detail" tab
3. Locate your service
4. Right click on it and stop/kill the process.
That is it.
I know it was opened while ago, but i am bit missing the option with Windows command prompt, so only for sake of completeness
Open Task Manager and find respective process and its PID i.e PID = 111
Eventually you can narrow down the executive file i.e. Image name = notepad.exe
in command prompt use command TASKKILL
example: TASKKILL /F /PID 111 ; TASKKILL /F /IM notepad.exe
I had this exact issue internally when starting and stopping a service using PowerShell (Via Octopus Deploy). The root cause for the service not responding to messages appeared to be related to devs accessing files/folders within the root service install directory via an SMB connection (looking at a config file with notepad/explorer).
If the service gets stuck in that situation then the only option is to kill it and sever the connections using computer management. After that, service was able to be redeployed fine.
May not be the exact root cause, but something we now check for.
I faced the similar issue. This error sometimes occur because the service can no longer accept control messages, this may be due to disk space issues in the server where that particular service's log file is present.
If this occurs, you can consider the below option as well.
Go to the location where the service exe & its log file is located.
Free up some space
Kill the service's process via Task manager
Start the service.
I just fought this problem while moving code from an old multi partition box to a newer single partition box. On service stop I was writing to D: and since it didn't exist anymore I got a 1061 error. Any long operation during the OnStop will cause this though unless you spin the call off to another thread with a callback delegate.

Resources