Fine-grained control over what kind of failures trip a dapr circuit breaker - amazon-sqs

Question
Is it possible to further specifiy what kind of failures trip a dapr circuit breaker targeting a component (AWS SQS)?
Use case
I am sending emails via AWS SES. If you reach your sending limits, the AWS sdk throws an error (code 454). In this case, I want a circuit breaker to stop the processing of the queue and retry sending the emails later.
However, when you have another error, e.g. invalid email address, I don't want this to trip the circuit breaker as it is not transient. I would like to send the message to the DLQ though, as to manually examine these messages later (-> that's why I am still throwing here and not failing siltently).
Simplified setup
I have a circuit breaker defined that trips when my snssqs-pubsub component has more than 3 consecutive failures:
circuitBreakers:
pubsubCB:
# Available variables: requests, totalSuccesses, totalFailures, consecutiveSuccesses, consecutiveFailures
trip: consecutiveFailures > 3
# ...
targets:
components:
snssqs-pubsub-emails:
inbound:
circuitBreaker: pubsubCB
In my application, I want to retry sending emails that failed because the AWS SES sending limit was hit:
try {
await this.sendMail(options);
} catch (error) {
if (error.responseCode === '454') {
// This error should trip the cicuit breaker
throw new Error({ status: 429, 'Rate limited. Should be retried.' })
} else {
// This error should not trip the circuit breaker.
// Because status is 404, dapr directly puts the message into DLQ and skips retry
throw new NotFoundError({ status: 404 })
}
}

You may not have a problem to worry about if you have business case that does not violate AWS Terms of Service. You can put a support ticket and get SES Service Limit's raised.
It does not appear that dapr retry policies don't support the customization you need but .NET does.
If you don't want to process the message, then don't delete it. You can then set visibility timeout of the message in the SQS so they stay hidden to avoid processing again too quickly. Any exception thrown regardless will end up in the DLQ.

Related

How do I use Event Store DB client without continued memory usage growth?

I am using the event store client for .Net and I am struggling to find the correct way to use the client. When I register the client as a singleton in the .Net dependency injection and run my application over an extended period of time memory usage grows continuously with each subscription.
I create and register the client in the following way. A full minimal application that experiences the problem can be found here.
var esdbConnectionString = configuration.GetValue("ESDB_CONNECTION_STRING", "esdb://admin:changeit#localhost:2113?tls=false");
var eventStoreClientSettings = EventStoreClientSettings.Create(esdbConnectionString);
var eventStoreClient = new EventStoreClient(eventStoreClientSettings);
services.AddSingleton(eventStoreClient);
My application has a high number of short streams over an extended period of time
To Reproduce
Steps to reproduce the behavior:
Register EventStoreClient as singleton as reccomended in the documentation.
Subscribe to a very high number of streams over an extended time.
Cancel the CancellationToken sent into the stream subscription and let it be garbage collected.
Watch memory usage of service grow.
How I am creating and subscribing to streams:
var streamName = CreateStreamName();
var payload = new PingEvent { StreamNr = _currentStreamNumber };
var eventData = new EventData(Uuid.NewUuid(), typeof(PingEvent).Name, EventSerialization.SerializeEventData(payload));
await _client.AppendToStreamAsync(streamName, StreamState.Any, new[] { eventData });
var streamCancellationTokenSource = new CancellationTokenSource(TimeSpan.FromMinutes(30));
await _client.SubscribeToStreamAsync(streamName, FromStream.Start, async (sub, evnt, token) =>
{
if (evnt.Event.EventType == "PongEvent")
{
_previousStreamIsDone = true;
streamCancellationTokenSource.Cancel();
}
},
cancellationToken: streamCancellationTokenSource.Token);
Approaches attempted
Registering as Transient or Scoped
If I register the client as Transient or Scoped in .Net DI it is throwing thousands of exceptions internally and causing multiple problems.
Manually handling lifetime of client
By having a singleton service that handles the lifetime of the client I have attempted to every once in a while dispose of the client and create a new one, ensuring that there exists only one instance of the client at the same time. This results in same problem as registering the service as Transient or Scoped.
I am using version 22.0.0 of the Event Store client in .Net 6 against Event Store Database 21.10.0. The problems happens both when running on windows and on the standard aspnet:6.0 linux docker container.
By inspecting the results of these dotnet-dumps the memory growth seem to be happening inside this HashSet of ActiveCalls in the gRPC client.
I am hoping to find a way of using the client that does not lead to memory growth.
In your reproduction the leaked calls are coming from the extra read that you are issuing while processing an event received on the subscription.
There is an open issue (https://github.com/EventStore/EventStore-Client-Dotnet/issues/219) at the moment to deal with this better, but currently if you issue a read but don't consume all the events and don't cancel the read, then the call remains open. In your case this is happening if the slave has managed to reply Pong before the master has issued the read that results from receiving its own Ping in the subscription. That read will then contain the Ping and the Pong, only the Ping is read, and the call remains open.
For now, if you cancel those reads by passing the cancellation token that you are cancelling into the ReadStreamAsync call in ReadFromStartOfStreamToEnd, it should resolve your problem.
In case it's helpful for you, you can see the number of Current Calls live rather than waiting a long time to see the effect on memory:
dotnet-counters monitor --counters "Grpc.Net.Client" -p <processid>

Deleted Scheduled Messages still sending

I am building a slack application that will schedule a message when someone posts a specific type of workflow in a channel.
It will schedule a message, and if someone from a specific group of users replies before it has sent, it will delete the scheduled message.
Unfortuantely these messages are still sending, even though the list of scheduled messages is empty and the response when deleting the message is a successful one. I am also deleting the message within the 60 second limit that is noted on the API.
Scheduling the message gives me a success response, and if I use the list scheduled messages I get:
[
{
id: 'MESSAGE_ID',
channel_id: 'CHANNEL_ID',
post_at: 1620428096, // 2 minutes in the future for testing
date_created: 1620428026,
text: 'thread_ts: 1620428024.001300'
}
]
Canceling the message:
async function cancelScheduledMessage(scheduled_message_id) {
const response = await slackApi.post("/chat.deleteScheduledMessage", {
channel: SLACK_CHANNEL,
scheduled_message_id
})
return response.data
}
response.data returns { "ok": true }
If I use the list scheduled message API to retrieve what is scheduled I get an empty array []
However, the message will still send to the thread.
Is there something I am missing? I have the proper scopes set up and the API calls appear to be working.
If it helps, I am using AWS Lambda, and DynamoDB to store/retrieve the thread_ts and message IDs.
Thanks all.
For messages due in 5 minutes or less, chat.deleteScheduleMessage has a bug (as of November 2021) [1]. Although this API call may return OK, the actual message will still be delivered due to the bug.
Note that for messages within 60 seconds, this API does return an proper error code, as described in the documentation [2]. For the range (60 seconds, ~5 minutes), the API call returns OK but fails behind the scenes.
Before this bug is fixed, the only thing one can do is to only delete messages scheduled 5 minutes (the exact threshold may vary, according to Slack) or more (yes not very ideal and may not be feasible for some applications).
[1] Private communication with Slack support.
[2] https://api.slack.com/methods/chat.deleteScheduledMessage

Avoid sending message to DLQ/DLX

Our application is using org.springframework.cloud, spring-cloud-starter-stream-rabbit framework and we are trying to avoid sending specific messages to DLQ and also retrying them, this behaviour should be somehow dynamically, because, for the default messages, retries and DLQ should work.
According to this documentation:
Putting it All Together
And this useful post:
DLX in rabbitmq and spring-rabbitmq - some considerations of rejecting messages
It seems that ImmediateAcknowledgeAmqpException could be used in spring AMQP to mark a message as acknowledged and no further process it. However, when we use this code:
#StreamListener(LogSink.INPUT)
public void handle(Message<Map<String, Object>> message) {
if (message.getPayload().get("condition1").equals("abort")) {
throw new ImmediateAcknowledgeAmqpException("error, we don't want to send this message to DLQ");
}
...
}
The message is always send to DLQ
Our current configuration:
spring.cloud.stream:
bindings:
log:
consumer.concurrency: 10
destination: log
group: myGroup
content-type: application/json
rabbit.bindings:
log:
consumer:
autoBindDlq: true
republishToDlq: true
transacted: true
Are we missing something? Is there any other alternative to avoid publishing to DLQ and requeuing?
republishToDlq does not look at that exception; it only applies when the exception is thrown to the container (method causeChainHasImmediateAcknowledgeAmqpException()).
Republishing subverts that logic since no exception is thrown to the container.
Please open an issue against the Rabbit binder, republishToDlq should honor that exception and discard the failed message.

Timeout errors which happens sometimes on Azure SQL Databases called in Azure Web jobs

We have a web application running on Azure that performs miscellaneous database maintenance tasks like creating databases, deleting unused databases, and so on. Everything is running on Azure SQL.
This application runs 24/24, and the maintenance tasks are performed every hour. Most of the time, everyhing goes well. However, the task sometimes ends up with errors like those ones :
HTTP error GatewayTimeout : The gateway did not receive a response from ‘Microsoft.Sql’ within the specified time period
HTTP error ServiceUnavailable : The request timed out
SQLException : Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
SQLException : A connection was successfully established with the server, but then an error occurred during the pre-login handshake
It seems like the database is not reachable when this happens.
We'd be glad if someone could help us to debug the issue.
thank you in advance.
There are transient errors and other type of errors that are particular to Azure SQL Database. Transient fault errors typically manifest as one of the following error messages from your client programs:
•Database on server is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of
•Database on server is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of . (Microsoft SQL Server, Error: 40613)
•An existing connection was forcibly closed by the remote host.
•System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: Session Provider, error: 19 - Physical connection is not usable)
•An connection attempt to a secondary database failed because the database is in the process of reconfguration and it is busy applying new pages while in the middle of an active transation on the primary database.
Because of those errors and more explained here. It is necessary to create a retry logic on applications that connect to Azure SQL Database.
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
More about how to create a retry logic here.
Throttling is also a cause of timeouts. The following queries may help you understand the impact of workloads on the Azure SQL database.
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
--service level objective (SLO) of 99.9% <= go to next tier

How to explicitly acknowledge/fail Amazon SQS FIFO queue from the listener without throwing an exception?

My application only listens to a certain queue, the producer is the 3rd party application. I receive the messages but sometimes based on some logic I need to send fail message to the producer so that the message is resend to my listener again until I decide to consume it and acknowledge it. My current implementation of this process is just throwing some custom exception. But this is not a clean solution, therefore can any one help me to send FAIL to producer without throwing exception.
My JMS Listener Factory settings:
#Bean
public DefaultJmsListenerContainerFactory jmsListenerContainerFactoryForQexpress(SQSErrorHandler errorHandler) {
SQSConnectionFactory connectionFactory = SQSConnectionFactory.builder()
.withRegion(RegionUtils.getRegion(StaticSystemConstants.getQexpressSqsRegion()))
.withAWSCredentialsProvider(new ClasspathPropertiesFileCredentialsProvider(StaticSystemConstants.getQexpressSqsCredentials()))
.build();
DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(connectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
factory.setConcurrency("3-10");
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
factory.setErrorHandler(errorHandler);
return factory;
}
My Listener Settings:
#JmsListener(destination = StaticSystemConstants.QUEXPRESS_ORDER_STATUS_QUEUE, containerFactory = "jmsListenerContainerFactoryForQexpress")
public void receiveQExpressOrderStatusQueue(String text) throws JSONException {
LOG.debug("Consumed QExpress status {}", text);
//here i need to decide either acknowlege or fail
...
if (success) {
updateStatus();
} else {
//todo I need to replace this with explicit FAIL message
throw new CustomException("Not right time to update status");
}
}
Please, share your experience on this. Thank you!
SQS -- internally speaking -- is fully asynchronous and completely decouples the producer from the consumer.
Once the producer successfully hands off a message to SQS and receives the message-id in response, the producer only knows that SQS has received and committed the message to its internal storage and that the message will be delivered to a consumer at least once.¹ There is no further feedback to the producer.
A consumer can "snooze" a message for later retry by simply not deleting it (see setSessionAcknowledgeMode docs) or by actively resetting the visibility timeout on the message instead of deleting it, which triggers SQS to leave the message in the in flight status until the timer expires, at which point it will again deliver the message for the consumer to retry.
Note, too, that a single SQS queue can have multiple producers and/or multiple consumers, as long as all the producers ask for and consumers provide identical services, but there is no intrinsic concept of which consumer or which producer. There is no consumer-to-producer backwards communication channel, and no mechanism for a producer to inquire about the status of an earlier message -- the design assumption is that once SQS has received a message, it will be delivered,² so no such mechanism should be needed.
¹at least once. Unless the queue is a FIFO queue, SQS will typically deliver the message exactly once, but there is not an absolute guarantee that the message will not be delivered more than once. Because SQS is a massive, distributed system that stores redundant copies of messages, it is possible in some edge case conditions for messages to be delivered more than once. FIFO queues avoid this possibility by leveraging stronger internal consistency guarantees, at a cost of reduced throughput of 300 TPS.
²it will be delivered assuming of course that you actually have a consumer running. SQS does not block the producer, and will allow you to enqueue an unbounded number of messages waiting for a consumer to arrive. It accepts messages from producers regardless of whether there are currently any consumers listening. The messages are held until consumed or until the MessageRetentionPeriod (default 4 days, max 14 days) timer expires for each message, whichever comes first.

Resources