How to check to see if Ruby-Kafka retries works? - ruby-on-rails

In the documentation it is mentioned that producer retries to send the message to the queue based on max_retries.
So I shutdown the Kafka and then tried my producer. I get this error
Fetching cluster metadata from kafka://localhost:9092
[topic_metadata] Opening connection to localhost:9092 with client id MYCLIENTID
ERROR -- : [topic_metadata] Failed to connect to localhost:9092: Connection refused
DEBUG -- : Closing socket to localhost:9092
ERROR -- : Failed to fetch metadata from kafka://localhost:9092
Completed 500 Internal Server Error in 486ms (ActiveRecord: 33.9ms)
which it make sense, however the retries never happens after that. I have read the doc inside-out and I can't figure it out how this retries actually going to trigger?
Here is my code:
def self.deliver_message(kafka, message, topic, transactional_id)
producer = kafka.producer(idempotent: true,
transactional_id: transactional_id,
required_acks: :all,
max_retries: 5,
retry_backoff: 5)
producer.produce(message, topic: topic)
producer.deliver_messages
end
link to doc:
https://www.rubydoc.info/gems/ruby-kafka/Kafka/Producer#initialize-instance_method
Thank you in advance.

The retries are based on the type of Exception thrown by the producer callback. According to the Callback Docs there are the following Exception possible happening during callback:
The exception thrown during processing of this record. Null if no error occurred. Possible thrown exceptions include:
Non-Retriable exceptions (fatal, the message will never be sent):
InvalidTopicException
OffsetMetadataTooLargeException
RecordBatchTooLargeException
RecordTooLargeException
UnknownServerException
Retriable exceptions (transient, may be covered by increasing #.retries):
CorruptRecordException
InchvalidMetadataException
NotEnoughReplicasAfterAppendException
NotEnoughReplicasException
OffsetOutOfRangeException
TimeoutException
UnknownTopicOrPartitionException
Shutting down Kafka completely rather looks like a non-retriable Exception.

Related

spring amqp rabbit max consumer connection retries

I am trying to establish the max number of retries from my app to rabbit broker.
I have the retry interceptor,
#Bean
public RetryOperationsInterceptor retryOperationsInterceptor() {
return RetryInterceptorBuilder.stateless()
.maxAttempts(CommonConstants.MAX_AMQP_RETRIES)
.backOffOptions(500, 2.0, 3000)
.build();
}
and this is used inside listener container,
container.setAdviceChain(new Advice[]{retryOperationsInterceptor()});
However, after a couple of retries, the consumer attempts connection all over again in an endless loop,
2017-02-21 15:03:12.229 WARN 9292 --- [nsumerThread_92] o.s.a.r.l.SimpleMessageListenerContainer : Consumer raised exception, processing can restart if the connection factory supports it. Exception summary: org.springframework.amqp.AmqpConnectException: java.net.ConnectException: Connection refused: connect
2017-02-21 15:03:12.229 INFO 9292 --- [nsumerThread_92] o.s.a.r.l.SimpleMessageListenerContainer : Restarting Consumer: tags=[{}], channel=null, acknowledgeMode=AUTO local queue size=0
2017-02-21 15:03:13.245 WARN 9292 --- [nsumerThread_93] o.s.a.r.l.SimpleMessageListenerContainer : Consumer raised exception, processing can restart if the connection factory supports it. Exception summary: org.springframework.amqp.AmqpConnectException: java.net.ConnectException: Connection refused: connect
2017-02-21 15:03:13.245 INFO 9292 --- [nsumerThread_93] o.s.a.r.l.SimpleMessageListenerContainer : Restarting Consumer: tags=[{}], channel=null, acknowledgeMode=AUTO local queue size=0
2017-02-21 15:03:13.261 ERROR 9292 --- [nsumerThread_83] o.s.a.r.l.SimpleMessageListenerContainer : Failed to check/redeclare auto-delete queue(s).
I want the app to fail and error out because of lack of connectivity to the broker after a MAX_RETRY # limit.
thanks for the help
EDIT
As suggested by #artem-bilan, I ended up using a Component
public class BrokerFailureEventListener implements ApplicationListener<ListenerContainerConsumerFailedEvent>
In this class the onApplicationEvent I counted the number of failures and then take appropriate action.
In case of producer-side, it's a little different scenario. As explained by #artem-bilan, the application would need to take care of any issues. I explored using netflix-hystrix and added a fallback method for the production method and will go with that route. thanks much again.
Well, you misunderstood a bit container.setAdviceChain(new Advice[]{retryOperationsInterceptor()});. It is for the business errors during messages processing:
Business exception handling, as opposed to protocol errors and dropped connections, might need more thought and some custom configuration, especially if transactions and/or container acks are in use. Prior to 2.8.x, RabbitMQ had no definition of dead letter behaviour, so by default a message that is rejected or rolled back because of a business exception can be redelivered ad infinitum. To put a limit in the client on the number of re-deliveries, one choice is a StatefulRetryOperationsInterceptor in the advice chain of the listener. The interceptor can have a recovery callback that implements a custom dead letter action: whatever is appropriate for your particular environment.
In contradiction to the:
In fact it loops endlessly trying to restart the consumer, and only if the consumer is very badly behaved indeed will it give up. One side effect is that if the broker is down when the container starts, it will just keep trying until a connection can be established.
What you need is ListenerContainerConsumerFailedEvent, which is emitted as:
private void logConsumerException(Throwable t) {
if (logger.isDebugEnabled()
|| !(t instanceof AmqpConnectException || t instanceof ConsumerCancelledException)) {
logger.warn(
"Consumer raised exception, processing can restart if the connection factory supports it",
t);
}
else {
logger.warn("Consumer raised exception, processing can restart if the connection factory supports it. "
+ "Exception summary: " + t);
}
publishConsumerFailedEvent("Consumer raised exception, attempting restart", false, t);
}
So, you can listen for those events and stop your application when some condition is reached.

Umbraco Log Error Failed to load Xml from file

I get lots of Errors in my Umbraco 7.2.6 log file. And sometimes the CMS is broken, needs to restart IIS or browser. All errors are somethings with the load or save XML Lock or Thread aborted. I use the default Umbraco config on a simple shared hosting environment.
But It often works well. (so no file permission or somethings I expect)
What is wrong?
2015-07-07 09:45:02,833 [10] INFO umbraco.BusinessLogic.Log - [T15/D3] Log scrubbed. Removed all items older than 2015-05-08 09:45:02
2015-07-07 09:45:40,726 [10] ERROR umbraco.content - [T8/D3] Failed to load Xml from file.
System.Threading.ThreadAbortException: Thread was being aborted.
at System.Threading.WaitHandle.WaitOneNative(SafeHandle waitableSafeHandle, UInt32 millisecondsTimeout, Boolean hasThreadAffinity, Boolean exitContext)
at System.Threading.WaitHandle.InternalWaitOne(SafeHandle waitableSafeHandle, Int64 millisecondsTimeout, Boolean hasThreadAffinity, Boolean exitContext)
at System.Threading.WaitHandle.WaitOne(Int32 millisecondsTimeout, Boolean exitContext)
at System.Threading.WaitHandle.WaitOne(Int32 millisecondsTimeout)
at Umbraco.Core.AsyncLock.Lock(Int32 millisecondsTimeout)
at umbraco.content.EnsureFileLock()
at umbraco.content.LoadXmlFromFile()
2015-07-07 09:45:40,741 [10] ERROR Umbraco.Web.WebServices.ScheduledPublishController - [T55/D3] Error executing scheduled task
System.Threading.ThreadAbortException: Thread was being aborted.
at umbraco.content.LoadXmlFromFile()
at umbraco.content.LoadXmlLocked(SafeXmlReaderWriter safeXml, Boolean& registerXmlChange)
at umbraco.content..ctor()
at umbraco.content.<.cctor>b__17()
at System.Lazy`1.CreateValue()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Lazy`1.LazyInitValue()
at System.Lazy`1.get_Value()
at Umbraco.Web.WebServices.ScheduledPublishController.Index()
2015-07-07 09:45:40,757 [10] ERROR Umbraco.Web.Scheduling.ScheduledPublishing - [T7/D3] An error occurred with the scheduled publishing. The base url used in the request was: http://acc.xxxxxxx.nl:80/umbraco/, see http://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/#ScheduledTasks documentation for details on setting a baseUrl if this is in error
System.Net.WebException: The remote server returned an error: (400) Bad Request.
at System.Net.WebClient.UploadDataInternal(Uri address, String method, Byte[] data, WebRequest& request)
at System.Net.WebClient.UploadString(Uri address, String method, String data)
at System.Net.WebClient.UploadString(String address, String data)
at Umbraco.Web.Scheduling.ScheduledPublishing.PerformRun()
2015-07-07 09:46:14,905 [10] INFO Umbraco.Core.PluginManager - [T19/D3] Starting resolution types of Umbraco.Core.PropertyEditors.IParameterEditor
2015-07-07 12:55:10,801 [17] ERROR umbraco.content - [T18/D11] Failed to save Xml to file.
System.TimeoutException: Failed to enter the lock within timeout.
at Umbraco.Core.AsyncLock.Lock(Int32 millisecondsTimeout)
at umbraco.content.EnsureFileLock()
at umbraco.content.<SaveXmlToFileAsync>d__d.MoveNext()
2015-07-07 14:03:09,702 [9] ERROR umbraco.content - [T7/D3] Failed to load Xml from file.
System.TimeoutException: Failed to enter the lock within timeout.
at Umbraco.Core.AsyncLock.Lock(Int32 millisecondsTimeout)
at umbraco.content.EnsureFileLock()
at umbraco.content.LoadXmlFromFile()
This issue was related to http://issues.umbraco.org/issue/U4-6802
it is fixed in the Umbraco v7.2.8 release.

Cowboy websocket termination error

I am implementing cowboy websocket. Everything is working fine except when user closes the browser it fires websocket_termination and at server end it generates following error:-
Error in process <0.298.0> on node 'ews_2#servername.com' with exit value: {function_clause,
[{cowboy_req,ensure_response,[{[]},204],[{file,"src/cowboy_req.erl"},{line,1112}]},
{cowboy_protocol,next_request,3,[{file,"src/cowboy_protocol.erl"},{line,545}]}]}
Code in websocket_termination is :-
websocket_terminate(Reason, Req, State) ->
io:format("~nWebsocket connection termination~n"),
ok.
Resolved: Problem was Req was not getting passed and got manipulated between the callbacks... Cowboy needs a proper Req parameter to be passed at the time of connection termination.

DotNetOpenAuth with Yahoo,AOL results in Timeout or NameResolutionFailure

I'm using DotNetOpenAuth 3.5.0.10357 and when attempting to authorize using Yahoo as the provider (https://me.yahoo.com) often times a ProtocolException is thrown at OpenIdRelyingParty.CreateRequest(Identifier). If another attempt is made immediately after the first attempt, the workflow behaves as expected. I've added a XRDS document as per this blog post and when the Yahoo provider responds it seems to detect the file as it no longer displays that verification message. All other providers, with the exception of AOL which has the same issues, work properly at all times. I've enabled logging and there seem to be two different causes, one is a timeout another is a NameResolutionFailure, both from WebException.
Here is the log from the instance resulting in NameResolutionFailure:
HTTP GET https://me.yahoo.com/ WebException NameResolutionFailure from
https://me.yahoo.com/, no response available. Error while performing
discovery on: "https://me.yahoo.com/":
DotNetOpenAuth.Messaging.ProtocolException: Error occurred while
sending a direct message or getting the response. --->
System.Net.WebException: The remote name could not be resolved:
'me.yahoo.com' at System.Net.HttpWebRequest.GetResponse() at
DotNetOpenAuth.Messaging.StandardWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) --- End of inner
exception stack trace --- at
DotNetOpenAuth.Messaging.StandardWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) at
DotNetOpenAuth.Messaging.UntrustedWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) at
DotNetOpenAuth.Yadis.Yadis.Request(IDirectWebRequestHandler
requestHandler, Uri uri, Boolean requireSsl, String[] acceptTypes)
at DotNetOpenAuth.Yadis.Yadis.Discover(IDirectWebRequestHandler
requestHandler, UriIdentifier uri, Boolean requireSsl) at
DotNetOpenAuth.OpenId.UriDiscoveryService.Discover(Identifier
identifier, IDirectWebRequestHandler requestHandler, Boolean&
abortDiscoveryChain) at
DotNetOpenAuth.OpenId.RelyingParty.OpenIdRelyingParty.Discover(Identifier
identifier) at
DotNetOpenAuth.OpenId.RelyingParty.AuthenticationRequest.Create(Identifier
userSuppliedIdentifier, OpenIdRelyingParty relyingParty, Realm realm,
Uri returnToUrl, Boolean createNewAssociationsAsNeeded) Performing
discovery on user-supplied identifier: https://me.yahoo.com/ Filtering
and sorting of endpoints did not affect the list.
The following is the log from a timeout:
HTTP GET https://me.yahoo.com/ WebException Timeout from
https://me.yahoo.com/, no response available. Error while performing
discovery on: "https://me.yahoo.com/":
DotNetOpenAuth.Messaging.ProtocolException: Error occurred while
sending a direct message or getting the response. --->
System.Net.WebException: The operation has timed out at
System.Net.HttpWebRequest.GetResponse() at
DotNetOpenAuth.Messaging.StandardWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) --- End of inner
exception stack trace --- at
DotNetOpenAuth.Messaging.StandardWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) at
DotNetOpenAuth.Messaging.UntrustedWebRequestHandler.GetResponse(HttpWebRequest
request, DirectWebRequestOptions options) at
DotNetOpenAuth.Yadis.Yadis.Request(IDirectWebRequestHandler
requestHandler, Uri uri, Boolean requireSsl, String[] acceptTypes)
at DotNetOpenAuth.Yadis.Yadis.Discover(IDirectWebRequestHandler
requestHandler, UriIdentifier uri, Boolean requireSsl) at
DotNetOpenAuth.OpenId.UriDiscoveryService.Discover(Identifier
identifier, IDirectWebRequestHandler requestHandler, Boolean&
abortDiscoveryChain) at
DotNetOpenAuth.OpenId.RelyingParty.OpenIdRelyingParty.Discover(Identifier
identifier) at
DotNetOpenAuth.OpenId.RelyingParty.AuthenticationRequest.Create(Identifier
userSuppliedIdentifier, OpenIdRelyingParty relyingParty, Realm realm,
Uri returnToUrl, Boolean createNewAssociationsAsNeeded) Performing
discovery on user-supplied identifier: https://me.yahoo.com/ Filtering
and sorting of endpoints did not affect the list.
I'm using the default configuration settings. I'm guessing I cat get around the timeout error by increasing the timeout setting, however I'm not sure how to go about the name resolution error.
From the exception this doesn't look like a DotNetOpenAuth-specific problem to me. It looks like your DNS server is slow or you have a bad connection to it. I'd look into that problem. And yes, increasing the timeout will help you in a pinch.

ActiveResource EOFError on "slow" API

I'm seriously struggling to solve this one, any help would be appreciated!
I have two Rails apps, let's call them Client and Service, all very simple, normal REST interface - here's the basic scenario:
Client makes a POST /resources.json request to the Service
The Service runs a process which creates the resource and returns an ID to the Client
Again, all very simple, just that Service processing is very time-intensive and can take several minutes. If that happens, an EOFError is raised on the Client, exactly 60s after the request was made (no matter what the ActiveResource::Base.timeout is set to) while the service correctly processed the request and responds with 200/201. This is what we see in the logs (chronologically):
C 00:00:00: POST /resources.json
S 00:00:00: Received POST /resources.json => resources#create
C 00:01:00: EOFError: end of file reached
/usr/ruby1.8.7/lib/ruby/1.8/net/protocol.rb:135:in `sysread'
/usr/ruby1.8.7/lib/ruby/1.8/net/protocol.rb:135:in `rbuf_fill'
/usr/ruby1.8.7/lib/ruby/1.8/timeout.rb:62:in `timeout'
...
S 00:02:23: Response POST /resources.json, 201, after 143s
Obviously the service response never reached the client. I traced the error down to the socket level and recreated the scenario in a script, where I open a TCPSocket and try to retrieve data. Since I don't request anything, I shouldn't get anything back and my request should time out after 70 seconds (see full script at the bottom):
Timeout::timeout(70) { TCPSocket.open(domain, 80).sysread(16384) }
These were the results for a few domain:
www.amazon.com => Timeout after 70s
github.com => EOFError after 60s
www.nytimes.com => Timeout after 70s
www.mozilla.org => EOFError after 13s
www.googlelabs.com => Timeout after 70s
maps.google.com => Timeout after 70s
As you can see, some servers allowed us to "wait" for the full 70 seconds, while others terminated our connection, raising EOFErrors. When we did this test against our service, we (expectedly) got an EOFError after 60 seconds.
Does anyone know why this happens? Is there any way to prevent these or extend the server-side time-out? Since our service continues "working", even after the socket was closed, I assume it must be terminated on the proxy-level?
Every hint would be greatly appreciated!
PS: The full script:
require 'socket'
require 'benchmark'
require 'timeout'
def test_socket(domain)
puts "Connecting to #{domain}"
message = nil
time = Benchmark.realtime do
begin
Timeout::timeout(70) { TCPSocket.open(domain, 80).sysread(16384) }
message = "Successfully received data" # Should never happen
rescue => e
message = "Server terminated connection: #{e.class} #{e.message}"
rescue Timeout::Error
message = "Controlled client-side timeout"
end
end
puts " #{message} after #{time.round}s"
end
test_socket 'www.amazon.com'
test_socket 'github.com'
test_socket 'www.nytimes.com'
test_socket 'www.mozilla.org'
test_socket 'www.googlelabs.com'
test_socket 'maps.google.com'
I know this is nearly a year old, but in case anyone else finds this, I wanted to add a possible culprit.
Amazon's ELB will terminate idle connections at 60 seconds, so if you are using EC2 behind ELB, then ELB could be the server side problem.
the only "documentation" I could find here is https://forums.aws.amazon.com/thread.jspa?threadID=33427&start=50&tstart=50, but it's better than nothing
Each server decides when to close the connection. It depends on the server side software and its settings. You can't control that.

Resources