Thursday 20 February 2020

Activity Group not synchronizing

There was an activity group failover in the night.  Some sort of glitch – glitch happens.  The Primary flicked over to the Secondary and the Secondary became the Primary, no problem.
The Old Primary became the Secondary
BUT…

Databases on the New Secondary are marked as Not Synchronizing




The error log says: A connection timeout has occurred on a previously established connection to availability replica with id [5FA3CBDB-5504-43E2-8697-2C713110307B].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

Well, the replica has certainly transitioned.  Other instances on the nodes were unaffected


A bit of googling finds this article.
Microsoft recommends restarting service on secondary replica.
The problem occurs because of an internal race condition.
 What on earth does that mean?  Ah - it's a bit like a deadlock.  

And another article.
Resolve Secondary Replica Database Not Synchronizing You cannot ONLINE a secondary replica availability group database Since a secondary replica availability group database is really in an ongoing restoring state, it cannot be set Online like the primary replica. Remember it is already in the Online role, according to the DMVs.

Solution
Restart the SQL Server instance hosting the secondary replica. This will initiate recovery on all databases including availability group databases, and so long as SQL Server can access all the database files, the availability group database affected should recover successfully and resume synchronization with the primary replica. 

OK, that all seems pretty clear.

Next question - can we do it now?  In the working day?  

Well, if we don't, we have lost our Availability group resilience - that's not a Good Thing

If you look at the picture above, we have an estimated recovery time of 0 seconds.  I've added a few extra columns to my Dashboard - there's a list of them here.  Now that seems too good to be true.  It's been down for a few hours, are you seriously expecting me to believe that there is no data waiting to go across to the Secondary?  Pshaw! I say to you.
One of my colleagues thinks that Dave has mentioned this issue before – but he isn’t sure.  Dave is on holiday - because of course he is.  Whenever you need someone who knows stuff, Sod's Law mandates that they are not around.  
So we are a bit scared of restarting the services on the secondary – is it going to impact on the primary as it catches up with the backlog of stuff to be transferred? 
We decide to wait until everyone goes home for the night. 
I stop the agent service.  
I stop the SQL Server service
I start the SQL Server service.  
I start the Agent service
I go and have a look at the Dashboard



The two small databases have already sorted themselves out.  
The two big ones estimate 20 minutes and 10 minutes to recover, but it quickly becomes obvious that this is an over-estimate.  The actual time taken to recover was about 3 mins for the larger, 1 min for the smaller.  There doesn't appear to have been any impact on the Primary, but it fixed itself before I could check.  
When Dave returns from his hols he confirms that restarting the services on the secondary will fix this; it takes seconds he says.    He reports that it sometimes happens when he manually fails over e.g. to apply patches.  Out of hours, of course. 
So with hindsight, I would feel much more confident about restarting the service on the Secondary node, and do it in the working day rather than run the risk of losing data.  

No comments:

Post a Comment