AlwaysOn - cluster lease timeouts and PREEMPTIVE_HADR_LEASE_MECHANISM - timeout

We have recently installed some WSUS updates + SQL 2012 SP3 (yes, all tested without a problem in UAT :) and since than it seems that AO and cluster is having few issues - it seems that cluster's lease is timing out and I am unable to figure out why.. ;/ this results in short blip and lost connectivity.
Any help would be appreciated!
AlwaysOn Extended Events:
availability_group_lease_expired; state: LeaseEpxired; Timestamp: 2016-06-12 04:58:40.34
availability_replica_state_change: current state: Resolving_Normal; previous_sate: Primary_Normal;Timestamp: 2016-06-12 04:58:40.34
..
availability_replica_state_change: current state: Primary_Normal; previous_sate: Primary_Pending;Timestamp: 2016-06-12 04:58:52.96
SQL Log:
Date: 12/06/2016 04:58:40; Error: 19421, Severity: 16, State: 1.
SQL Server hosting availability group did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
Date: 12/06/2016 04:58:40; Error: 19407, Severity: 16, State: 1.
The lease between availability group and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
Date: 12/06/2016 04:58:40
AlwaysOn: The local replica of availability group is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
Cluster log (do not ask my why it's -1h, date on all nodes is ok):
2016/06/12-03:58:40.587 INFO [RCM] rcm::RcmApi::FailResource: (AlwaysOn)
2016/06/12-03:58:40.588 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'AlwaysOn', gen(3) result 0/0.
2016/06/12-03:58:40.588 INFO [RCM] Res AlwaysOn: Online -> ProcessingFailure( StateUnknown )
2016/06/12-03:58:40.588 INFO [RCM] TransitionToState(AlwaysOn) Online-->ProcessingFailure.
2016/06/12-03:58:40.588 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AlwaysOn, Online --> Pending)
2016/06/12-03:58:40.588 ERR [RCM] rcm::RcmResource::HandleFailure: (AlwaysOn)
2016/06/12-03:58:40.588 INFO [RCM] resource AlwaysOn: failure count: 1, restartAction: 2 persistentState: 1.
2016/06/12-03:58:40.588 INFO [RCM] numDependents is zero, auto-returning true
2016/06/12-03:58:40.588 INFO [RCM] Greater than restartPeriod time has elapsed since first failure of AlwaysOn, resetting failureTime and failureCount.
2016/06/12-03:58:40.588 INFO [RCM] Will queue immediate restart (500 milliseconds) of AlwaysOn after terminate is complete.
2016/06/12-03:58:40.588 INFO [RCM] Res AlwaysOn: ProcessingFailure -> WaitingToTerminate( DelayRestartingResource )
2016/06/12-03:58:40.588 INFO [RCM] TransitionToState(AlwaysOn) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].
2016/06/12-03:58:40.588 INFO [RCM] Res AlwaysOn: [WaitingToTerminate to DelayRestartingResource] -> Terminating( DelayRestartingResource )
2016/06/12-03:58:40.588 INFO [RCM] TransitionToState(AlwaysOn) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource].
2016/06/12-03:58:40.588 ERR [RES] SQL Server Availability Group <AlwaysOn>: [hadrag] Lease Thread terminated
2016/06/12-03:58:40.588 ERR [RES] SQL Server Availability Group <AlwaysOn>: [hadrag] The lease is expired. The lease should have been renewed by 2016/06/12-03:58:30.348
2016/06/12-03:58:40.588 INFO [RES] SQL Server Availability Group: [hadrag] Stopping Health Worker Thread
2016/06/12-03:58:40.588 INFO [RES] SQL Server Availability Group: [hadrag] Health worker was asked to terminate
Something odd - SQL wait times from last 12h:
wait type Wait Time % of Total Wait
PREEMPTIVE_HADR_LEASE_MECHANISM 80,183,360 ms 39.09%
PREEMPTIVE_SP_SERVER_DIAGNOSTICS 80,183,265 ms 39.09%
HADR_CLUSAPI_CALL 40,534,655 ms 19.76%
Dodgy update somewhere? Let me know if you have any hints.
Thanks in advance,
Tomasz

1) try to reboot your server.
2) if the server is unresponsive or CPU utilization reached to 100 %, you can see these weird errors.

Related

Flutter app w/ Firebase Auth and Revenue cat's IAP not connecting

So I'm facing a really odd problem and don't really know where to look.
I have a an app built in flutter. I'm using Bitrise for CI/CD. I had the configuration set to use XCode 11 until yesterday. I didn't notice that it needed changing till I got the message saying I needed to build using the iOS 14 SDK. I upgraded the stack, and was able to publish a build to testflight.
I set the version to use XCode 12.5 (latest). The app would be asking to log in again after re-opening and would get an error saying in app purchases could not be loaded. Error code: 21004
This only happens on testflight builds. Local builds using XCode 12.5 can handle IAP and firebase auth just fine. I can't attach a debugger to a testflight build.
I decided since the only difference is XCode, I'll back down to 12.4, and that seem to solve the IAP problem, but the app is still logged out every time it opens. The saved user profile on the local db is just fine. Just firebase thinks it's logged out.
Error loggs are unhelpful. The only thing I've managed to find is
default 13:45:20.643023-0600 photoanalysisd PLAccountStore accountDidChange, clearing cached properties.
default 13:45:20.643750-0600 assetsd PLAccountStore accountDidChange, clearing cached properties.
default 13:45:20.644216-0600 backboardd [1.26.0] +[ASEProcessing shouldEnhanceWidth:height:destinationWidth:destinationHeight:]: src={ 1242w x 2688h }, dest={ 1240w x 2683h }, aseFunctionOnYesOffNo=1
default 13:45:20.645602-0600 fitcored [1031] <private>::<private>
default 13:45:20.645884-0600 assetsd photoStreamAccountSettingsChanged
default 13:45:20.646817-0600 assetsd Clearing cached PLPhotoSharingHelper state
default 13:45:20.646867-0600 ptpd PLAccountStore accountDidChange, clearing cached properties.
default 13:45:20.647514-0600 wifid -[WiFiAccountStoreManager _updateIsManagedAppleIDAndNotify:]_block_invoke: No change Current (Non-Managed Account)
default 13:45:20.648268-0600 accountsd "<private> (<private>) received"
default 13:45:20.648443-0600 appstored elided platform fast path for key: re6Zb+zwFKJNlkQTUeT+/w
default 13:45:20.648542-0600 accountsd "<private> (<private>) received"
default 13:45:20.648829-0600 accountsd "<private> (<private>) received"
default 13:45:20.649444-0600 appstored AMSMescal: [IAPF9A5435D/[MY APP]:2_1] Skipping mescal - unabled to locate data to sign
error 13:45:20.649481-0600 appstored AMSURLRequest: [IAPF9A5435D/[MY APP]:2_1] Failed to add mescal header. Error: Error Domain=AMSErrorDomain Code=204 "Bag Value Missing" UserInfo={NSLocalizedDescription=Bag Value Missing, NSLocalizedFailureReason=The bag does not contain signed-actions nor did anyone register a default value. <AMSBagNetworkDataSource: 0x104e18ff0; profile: appstored; version: 1; sandbox: 1>}
default 13:45:20.649516-0600 appstored AMSURLSession: [IAPF9A5435D/[MY APP]:2_1] Preparing request: <AMSURLRequest: 0x1051bcd90> { URL: https://sandbox.itunes.apple.com/WebObjects/MZFinance.woa/wa/inAppPendingTransactions?guid=00008030-001A20E83C78802E }
default 13:45:20.649678-0600 appstored AMSURLSession: [IAPF9A5435D/[MY APP]:2_1] Task created: LocalDataTask <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> Session: <__NSURLSessionLocal: 0x10762d2b0>
default 13:45:20.649717-0600 appstored Task <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> resuming, timeouts(60.0, 604800.0) QOS(0x19) Voucher <private>
default 13:45:20.650280-0600 appstored [Telemetry]: Activity <nw_activity 12:2 [8861910E-6C9C-4029-8C30-FB15A400CDE9] (reporting strategy default)> on Task <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> was not selected for reporting
default 13:45:20.650562-0600 appstored Task <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> {strength 0, tls 4, ct 0, sub 0, sig 1, ciphers 0, bundle 1, builtin 0}
default 13:45:20.650599-0600 appstored Task <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> now using Connection 2319
default 13:45:20.654134-0600 fitcored [1031] <private>::<private>
default 13:45:20.655855-0600 appstored Task <B3B892EF-AF0E-4185-AFE5-A28937F04BD8>.<2> sent request, body S 592
default 13:45:20.656338-0600 CommCenter #I querying acount name
default 13:45:20.656497-0600 accountsd "<private> (<private>) received"
default 13:45:20.657079-0600 backboardd [1.26.0] +[ASEProcessing shouldEnhanceWidth:height:destinationWidth:destinationHeight:]: src={ 1242w x 2688h }, dest={ 1240w x 2683h }, aseFunctionOnYesOffNo=1
default 13:45:20.658111-0600 fitcored Storefront matches cache: <private>
I'm at a loss, any one have any ideas?
So after having tons of fun debugging this.
I've opened up a bug with firebase.
https://github.com/FirebaseExtended/flutterfire/issues/5955

Solace Guaranteed Message Publish issue

We encountered publish Guaranteed Message Publish error, it reports:
[ERR] {"message":"Guaranteed Message Window Closed","name":"OperationError","subcode":22,"level":"error","stack":"OperationError: Guaranteed Message Window Closed\n
at PublisherFSM.prepareAdMessageAndSend (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:21530:13)\n
at MessagePublisher.prepareAdMessageAndSend (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:20866:22)\n
at SessionFSM.prepareAndSendMessage (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:26483:49)\n
at Session.validateAndSendMessage (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:29161:22)\n
at Session.send (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:28463:10)\n
Known that is due to client side pended many unacknowledged messages (default window size is 50) https://solace.com/blog/understanding-guaranteed-message-publish-window-sizes-and-acknowledgement/. Per check the log, application side can receive acknowledge notification after it publish to solace more than 1 sec.
So what's the root cause introduced solace server can't acknowledge so many messages at client side? we checked the client application log, and just find below solace related error:
at PublisherFSM.process (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:17780:11)
at PublisherFSM.processEvent (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:17841:15)
at State.onEntry (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:2201:17)
at BufferSMFClient.rxDataBuffer (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:33020:10)
at CorrelatedRequest.respRecvdCallback (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:21257:77)
at SessionFSM.handleADCtrlMessage (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:25872:26)
ERROR solclientjs: Uncaught exception in {PublisherFSM: PublisherFSM} while processing {PublisherFSMEvent: MessagePublisherFailed}: TypeError: Cannot read property 'nextCorrelationTag' of null
at PublisherFSM.<anonymous> (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:17847:49)
at State.handleOpenFlowResponse (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:21120:22)
at TcpRawTransport.onData (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:34168:20)
at SessionFSM.getCorrelationTag (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:25777:28)
at SessionFSM.handleSMFMessage (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:26116:23)
at BufferSMFClient._rxSmfCB (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:26352:41)
at BufferSMFClient._rxDataCB (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:33074:16)
at State.sendOpenFlow (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:21244:47)
at State.onEntry (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:21262:14)
at State.processReactionResult (/home/vcap/deps/0/node_modules/solclientjs/lib/solclientjs-debug.js:9001:18)
We used solclientjs 10.2.1, and Solace VMR 9.0.0.17.
We can only restart the client application, but those unacknowledged messages are lost.

Check active connection timeout in Tomcat JDBC Connection Pool

We have a connection to postgres database that is configured with tomcat connection pool. The problem is that when connection becomes active it never goes back to idle.
When I start my microservice it has 0 active connections and 10 idle ones. After one hour of work there are 7 active and 3 idle. After weekend there were 100 active, it reached the limit and service was down.
Is there any way to configure tomcat connection pool to check active connections state and if they are stucked to close them?
Looks like your application is leaking connection. By default hibernate c3p0 provide facilities for detecting leaks , there are two parameters to configure :
5
true
After this it will print stack trace for long active connections and close them.
Recommended not to use on high load. If using another pool, search for a similar thing
As we have http timeouts inside our cluster, it seems that due to this there is a connection leak. I investigated and connection remains always active.
The solution for me was to enable abandoned connections verification.
private DataSource configureDataSource(String url, String user, String password, String driverClassName){
DataSource ds = DataSourceBuilder.create()
.url(url)
.username(user)
.password(password)
.driverClassName(driverClassName)
.build();
org.apache.tomcat.jdbc.pool.DataSource configuredDataSource = (org.apache.tomcat.jdbc.pool.DataSource) ds;
// some other configurations here
// ...
configuredDataSource.getPoolProperties()
.setRemoveAbandonedTimeout(300);
configuredDataSource.getPoolProperties()
.setRemoveAbandoned(true);
}
#Bean(name = "qaDataSource")
public JdbcTemplate getQaJdbcTemplate() {
DataSource ds = configureDataSource(qaURL, qaUsername, qaPassword ,qaDriverClassName);
return new JdbcTemplate(ds);
}
RemoveAbandoned and RemoveAbandonedTimeout flags mean that if some connection is in active state more that timeout value it will be closed. If you put this to your code ensure that this timeout is superior that the maximum query execution time for your service.

housekeeper [deleted 105926 hist/trends, 0 items, 0 events, 0 sessions, 0 alarms, 0 audit items in 3.718012 sec, idle for 1 hour(s)]

In my zabbix server it logs as
# sudo tail -f /var/log/zabbix/zabbix_server.log
housekeeper [deleted 105926 hist/trends, 0 items, 0 events, 0 sessions, 0 alarms, 0 audit items in 3.718012 sec, idle for 1 hour(s)]
and after this it fails to send the Alerts
5243:20171213:180658.517 Failed sending data to the peer: DATA failed: 550
5243:20171213:180702.182 Failed sending data to the peer: DATA failed: 550
5243:20171213:180705.812 Failed sending data to the peer: DATA failed: 550
Can you help me why this occurs and give me a solution
I solved this
Beacuase i have configured email to a person in more numbers in the sense
user will get more than 3 alerts for a problem and a ok alert and reduced
the alerts to single alert fro problem per person.
when the **zabbix alerter process exceeds 75%** this errors occurs
Zabbix alerter processes more than 75% busy
and zabbix is not able to send alerts to all peer hosts

Timeout errors which happens sometimes on Azure SQL Databases called in Azure Web jobs

We have a web application running on Azure that performs miscellaneous database maintenance tasks like creating databases, deleting unused databases, and so on. Everything is running on Azure SQL.
This application runs 24/24, and the maintenance tasks are performed every hour. Most of the time, everyhing goes well. However, the task sometimes ends up with errors like those ones :
HTTP error GatewayTimeout : The gateway did not receive a response from ‘Microsoft.Sql’ within the specified time period
HTTP error ServiceUnavailable : The request timed out
SQLException : Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
SQLException : A connection was successfully established with the server, but then an error occurred during the pre-login handshake
It seems like the database is not reachable when this happens.
We'd be glad if someone could help us to debug the issue.
thank you in advance.
There are transient errors and other type of errors that are particular to Azure SQL Database. Transient fault errors typically manifest as one of the following error messages from your client programs:
•Database on server is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of
•Database on server is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of . (Microsoft SQL Server, Error: 40613)
•An existing connection was forcibly closed by the remote host.
•System.Data.Entity.Core.EntityCommandExecutionException: An error occurred while executing the command definition. See the inner exception for details. ---> System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: Session Provider, error: 19 - Physical connection is not usable)
•An connection attempt to a secondary database failed because the database is in the process of reconfguration and it is busy applying new pages while in the middle of an active transation on the primary database.
Because of those errors and more explained here. It is necessary to create a retry logic on applications that connect to Azure SQL Database.
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
More about how to create a retry logic here.
Throttling is also a cause of timeouts. The following queries may help you understand the impact of workloads on the Azure SQL database.
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
--service level objective (SLO) of 99.9% <= go to next tier

Resources