Running a three node Neo4j causal cluster (deployed on a Kubernetes cluster) our leader seems to have trouble with replicating transaction to it's followers. We are seeing the following error/warning appear in the debug.log:
2019-04-09 16:21:52.008+0000 WARN [o.n.c.c.t.TxPullRequestHandler] Streamed transactions [868842--868908] to /10.0.31.11:38968 Connection reset by peer
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:110)
In our application we seem to catch this error as:
Database not up to the requested version: 868969. Latest database version is 868967
The errors occur when we apply WRITE loads to the cluster using an asynchronous worker process that reads chunks of data from a queue and pushes them in to the database.
We have looked into obvious culprits:
Networking bandwidth limits are not reached
No obvious peaks on CPU / memory
No other Neo4j exceptions (specifically, no OOMs)
We have unbound/rebound the cluster and performed a validity check on the database(s) (they're all fine)
We tweaked the causal_clustering.pull_interval to 30s, which seems to improve performance but does not alleviate this issue
We have removed resource constraints on the db to mitigate bugs that might induce throttling on Kubernetes (without reaching actual CPU limits), this also did nothing to alleviate the issue
Related
We are deploying several services in docker swarm (16 services in total) in a single master node. Most of these services are developed in Quarkus, some of them compiled in native mode, others are java compiled because of their dependencies.
When the services are in use everything works fine, but if they are waiting to be used for more than 15 minutes we start to receive this message:
Caused by: java.net.SocketException: Connection reset,
at java.net.SocketInputStream.read(SocketInputStream.java:186),
at java.net.SocketInputStream.read(SocketInputStream.java:140),
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137),
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153),
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280),
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138),
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56),
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259),
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163),
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157),
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273),
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125),
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272),
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186),
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89),
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110),
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185),
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83),
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56),
at org.jboss.resteasy.client.jaxrs.engines.ManualClosingApacheHttpClient43Engine.invoke(ManualClosingApacheHttpClient43Engine.java:302),
It doesn't matter if it is native compilation or Java compilation, in all of them the novelty is reproduced.
This message is presented when trying to consume the API also built in Quarkus, its consumption is through Swarm's internal network using the name of the service to resolve its location.
To consume the api we build a jar, in which we implement the services interfaces as indicated in the guide https://quarkus.io/guides/rest-client using additionally org.eclipse.microprofile.faulttolerance for the retry.
Because of lack of communication through the socket your connection is lost.
I found few solutions here: Apache HttpClient throws java.net.SocketException: Connection reset if I use it as singletone
I have "mariadb" set to 127.0.0.1 in my /etc/hosts file and sidekiq occasionally throws errors such as:
Mysql2::Error::ConnectionError: Unknown MySQL server host 'mariadb' (16)
The VM is not under significant load or anything like that.
Later edit: seems other gems have trouble resolving hosts too:
WARN -- : Unable to record event with remote Sentry server (Errno::EBUSY - Failed to open TCP connection to XXXX.ingest.sentry.io:443 (Device or resource busy - getaddrinfo)):
Anyone have any idea why that may happen?
I've figured this out a couple weeks ago but wanted to be sure before posting an answer.
I still can't figure out the mechanic of this issue but it was caused by fail2ban.
I had it running in a container polling the httpd logs and blocking the tremendous amount of bots scraping my sites.
Also I increased the max file handlers and inotify handlers.
fs.file-max = 131070
fs.inotify.max_user_watches = 65536
As soon as I got rid of fail2ban and increased the inotify handlers the errors disappeared.
Obviously fail2ban gets on the "do not touch" list because of this, and we've rolled out a 404/403/500 handler on application layer that pushes unknown IPs to Cloudflare.
Although this is probably an edge case I'm leaving this here in hope it helps someone at some point.
apparently with no specific reason, and with nothing on neo4j logs, our application is getting this:
2019-01-30 14:15:08,715 WARN com.calenco.core.content3.ContentHandler:177 - Unable to acquire connection from the pool within configured maximum time of 60000ms
org.neo4j.driver.v1.exceptions.ClientException: Unable to acquire connection from the pool within configured maximum time of 60000ms
at org.neo4j.driver.internal.async.pool.ConnectionPoolImpl.processAcquisitionError(ConnectionPoolImpl.java:192)
at org.neo4j.driver.internal.async.pool.ConnectionPoolImpl.lambda$acquire$0(ConnectionPoolImpl.java:89)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.neo4j.driver.internal.util.Futures.lambda$asCompletionStage$0(Futures.java:78)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
The neo4j server is still running, and answering requests to either its web browser console, or the cypher-shell CLI. Also, restarting our application re-acquires the connection to neo4j with no issue.
Our application is connecting to neo4j once when it's started and then keeps that connection open for as lunch as it's running, opening and closing sessions against that connection as needed to fulfill the received requests.
It's the 2nd time in less than a month that we see the above exception thrown.
Any ideas?
Thanks in advance
I am running multiple Spring-Boot servers all connected to a Spring Boot Admin instance. Everything is running in the same Docker Swarm.
Spring Boot Admin keeps reporting on these "fake" instances that pop up and die. They are up for 1 second and then become unresponsive. When I clear them, they come back. The details for that instance show this error:
Fetching live health status failed. This is the last known information.
Request failed with status code 502
Here's a screenshot:
This is the same for all my APIs. This is causing us to get an inaccurate health reading of our services. How can I get Admin to stop reporting on these non-existant containers ?
I've looked in all my nodes and can't find any containers (running or stopped) that match the unresponsive containers that Admin is reporting.
After upgrading to 2.3.2 I am getting the following error when starting up the cluster.
Starting getting this in 2.3.2 upgrade and neo4j cluster fails to start:
2016-01-22 00:54:42.499+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase#483013b3' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component
Caused by: java.lang.RuntimeException: Unknown replication strategy
at org.neo4j.kernel.ha.transaction.TransactionPropagator$1.getReplicationStrategy(TransactionPropagator.java:115)
at org.neo4j.kernel.ha.transaction.TransactionPropagator.start(TransactionPropagator.java:175)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 14 more
This issue seems to be related to the updates made to the ha.tx_push_strategy setting in conf/neo4j.properties. With this setting at ha.tx_push_strategy=fixed the error occurs. When choosing a more specific strategy i.e. ha.tx_push_strategy=fixed_ascending the error goes away and the cluster forms correctly.
The push strategy determines a tie breaker where if tx ids are the same, which server id is pushed to next. The new strategies are fixed_descending and fixed_ascending. While the default of fixed_descending is the default for this version, fixed_ascending is the better choice because the election strategy uses an ascending order when determining which instance is elected as the next master. Thus using fixed_ascending reduces the chances for branched data under certain situations.