Thursday, 20 February 2020

Activity Group not synchronizing

There was an activity group failover in the night.  Some sort of glitch – glitch happens.  The Primary flicked over to the Secondary and the Secondary became the Primary, no problem.
The Old Primary became the Secondary
BUT…

Databases on the New Secondary are marked as Not Synchronizing




The error log says: A connection timeout has occurred on a previously established connection to availability replica with id [5FA3CBDB-5504-43E2-8697-2C713110307B].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

Well, the replica has certainly transitioned.  Other instances on the nodes were unaffected


A bit of googling finds this article.
Microsoft recommends restarting service on secondary replica.
The problem occurs because of an internal race condition.
 What on earth does that mean?  Ah - it's a bit like a deadlock.  

And another article.
Resolve Secondary Replica Database Not Synchronizing You cannot ONLINE a secondary replica availability group database Since a secondary replica availability group database is really in an ongoing restoring state, it cannot be set Online like the primary replica. Remember it is already in the Online role, according to the DMVs.

Solution
Restart the SQL Server instance hosting the secondary replica. This will initiate recovery on all databases including availability group databases, and so long as SQL Server can access all the database files, the availability group database affected should recover successfully and resume synchronization with the primary replica. 

OK, that all seems pretty clear.

Next question - can we do it now?  In the working day?  

Well, if we don't, we have lost our Availability group resilience - that's not a Good Thing

If you look at the picture above, we have an estimated recovery time of 0 seconds.  I've added a few extra columns to my Dashboard - there's a list of them here.  Now that seems too good to be true.  It's been down for a few hours, are you seriously expecting me to believe that there is no data waiting to go across to the Secondary?  Pshaw! I say to you.
One of my colleagues thinks that Dave has mentioned this issue before – but he isn’t sure.  Dave is on holiday - because of course he is.  Whenever you need someone who knows stuff, Sod's Law mandates that they are not around.  
So we are a bit scared of restarting the services on the secondary – is it going to impact on the primary as it catches up with the backlog of stuff to be transferred? 
We decide to wait until everyone goes home for the night. 
I stop the agent service.  
I stop the SQL Server service
I start the SQL Server service.  
I start the Agent service
I go and have a look at the Dashboard



The two small databases have already sorted themselves out.  
The two big ones estimate 20 minutes and 10 minutes to recover, but it quickly becomes obvious that this is an over-estimate.  The actual time taken to recover was about 3 mins for the larger, 1 min for the smaller.  There doesn't appear to have been any impact on the Primary, but it fixed itself before I could check.  
When Dave returns from his hols he confirms that restarting the services on the secondary will fix this; it takes seconds he says.    He reports that it sometimes happens when he manually fails over e.g. to apply patches.  Out of hours, of course. 
So with hindsight, I would feel much more confident about restarting the service on the Secondary node, and do it in the working day rather than run the risk of losing data.  

Friday, 29 March 2019

Shrinking that Database

Rule 1 - don't shrink that database!

It's a bad thing to do.  Brent Ozar says so.   And I believe him,.  Really, it's a bad idea

Imagine the stock control database.
You buy widgets, you sell widgets
100 widgets into stock
80 widgets out
No need to free up the disk space, because the day after, you are going to add another hundred widgets.
And SQL Server doesn't free up the space.  It just sits there waiting for more widgets

There's a but coming.

But suppose you need to free up space?  In my situation, I purged the database of two million old records into an archive DB.  Now I have an archive DB taking up 150 GB.  The original DB contains only 10,000 records (the last 15 days) but it still takes up 150GB, with 140GB free.  There's no room on the disk, and the Infrastructure guys aren't keen on giving me any more because it is an old physical box and it would mean them actually doing some work for a change.  

No problem, I can do this

I set up a maintenance plan:



Doing it as a maintenance plan lets me specify a time out for each step, which saves me the task of coding it. 

Step 1 - Shrink it in manageable chunks
Rather than try and shift all that 140GB empty space in one go, I want to shift a bit at a time, rather like a batch  Anil Kumar sets out a way to do it, and I unashamedly stole his code
How big a chunk?  Doesn't really matter - 512MB seemed like a good figure.  The point is that it shrinks.  then shrinks, then shrinks again.  If it fails for some reason e.g. hits the time out, then you have done at least a few of the chunks.  

Step 2 - Release Space
Just in case it hasn't gone already, belt and braces
USE DatabaseName
GO
DBCC SHRINKFILE (N'FileName' , 0, TRUNCATEONLY)
GO

Step 3 - Smart Rebuild
All that shrinking is going to screw up your fragmentation levels something 'orrible.  
Bad, really bad. for performance.  
If Brent Ozar catches you, he will not be a happy bunny
So we need to Rebuild or Reorganize the indexes.  

All clear?

Right then.  So I fire it off, it runs, step 1 times out as expected, the other two steps run, all well and good.  

Except that the database hasn't shrunk.  Not one MB less.  



I spent a lot of time with my friend Google, and I see lots of people are having similar issues, hence this blog, which I hope will help others.  

I set up a bare minimum job to shrink the database a little bit:

USE DatabaseName
GO
DBCC SHRINKFILE (N'FileName' , 180176)
GO

I set it running, and went home.  Came in next day, and it had succeeded - Two hours 45 minutes, and it shaved 6.5GB off the database!  (That IS a little bit, when you have a 190GB database).  And that is a much longer time than I allowed for in my timeout period.  I reduced the target and ran it again - again it succeeded, this time removing 10GB in 03:37:40

I'm not sure why shrink in the manageable chunks code didn't work; it's on my list to troubleshoot that.  But in the meantime I will be able to live with a rough and ready shrink that works.  


Monday, 7 January 2019

Understanding Unique Constraints

I'm looking at a database from a vendor, who shall be Nameless.

It contains a table called tblConfig.  OK, I know some people hate to see tables with the prefix tbl, but I can see the point to it.  That isn't my issue

I'm reviewing indexes, with a view to getting rid of duplicates.  There really is no point in having two indexes doing the same thing - SQL Server will only use one of them, but it has to keep both updated.

And that's where I found it:

There is an index on that table called UniqueConstraint.  

There is another index, also called UniqueConstraint.  

One is unique, the other isn't, which is how the developer got away with it.  

The index UniqueConstraint which is unique indexes different fields to its namesake index UniqueConstraint  which is not unique.  

There is also an index called UniqueConstraintReporting.  This duplicates the index UniqueConstraint which is not unique.  UniqueConstraintReporting is unique.

I've dropped the UniqueConstraint index which is not unique, leaving me with the unique indexes UniqueConstraint and UniqueConstraintReporting

I hope that is all clear to everyone?








Tuesday, 23 October 2018

In Search of a Smarter Maintenance Plan

I wrote a blog post about Implementing Maintenance Plans a while back.  In it I suggested that if you have time to rebuild the indexes once a week, you should go ahead and do it. 

But what if you DON'T have time?

Indexes

I suggested rebuilding the indexes once a week.  And the bog standard maintenance plan rebuilds every index, whether it needs it or not.  But if things are heavily fragmented, or there are a lot of indexes with millions of rows each, that might take an awfully long time.  I would expect heavily fragmented indexes to adversely affect performance, so set up a Smart Rebuild / Reorganise job to run as part of the weekly maintenance plan.  



Instead of the usual Rebuild Index task, grab the Execute T-SQL Statement and add it to the maintenance plan.  You can restrict the run time with the Execution Time Out parameter - specifying 3600 seconds means that after an hour the job fails, so that it does not over-run the maintenance window.  Note that if it hits that time out, the job will fail, so I have put this step last.  It will also send an error message email if you have it set up to do so.  




Here's the script - I swiped it from Artemakis Artemiou - poli efcharisto!  
The job identifies all the indexes with fragmentation levels >15%
It starts with the worst and works down
Above 30% it rebuilds them
Between 15% and 30% it re-organises them

If the database is badly fragmented, you might want to run this daily, at least until they are under control.  


Statistics
I also set up a job to update statistics.  Statistics can get out of date, and queries may be less efficient because of this. 

Updating statistics is a bit of a funny one.  If you rebuild the indexes, you get a nice new set of statistics.  But if we are not rebuilding the indexes, you have to rely on Microsoft to do it as it thinks fit, according to its algorithm.  Sod's Law says that it won't update the ones you want it to update.  The SQL execution plan can ignore a perfectly good index because the statistics are not up to date, and may take far longer to run a query because of this.



As before, I added a T-SQL task to the Daily tasks,  last in the line.  
I've given it a timer limit of 5400 seconds (an hour and a half, to save you getting your calculator out)
And it runs a stored procedure which I found here...
Thanks to Bill Graziano!



Note that Bill's script sets a time limit too, so I don't really need to set an execution timeout.

There's an alternative to the above - the well regarded and reputable scripts from Ola Hallengren.
A lot of people swear by them.







Wednesday, 17 October 2018

The Madness Continues


I have a script which identifies MDF files which do not appear in my list of databases.

The purpose of this is to find databases, usually on development boxes, which have been detached and forgotten, but are taking up space on the disk.  A detached database is of course invisible, unless you go poking around looking for it.

What I usually do is:
i)  attach them again
ii) take a final backup just in case they actually are needed
     (not very likely, but sod's law says if you don't do this, they will suddenly become critical)
iii)  drop the thing

But today, I failed at the first hurdle:



Create file encountered operating system error 32

The process cannot access the file because it is being used by another process

SQL Server error 5123 tells me it's a problem attaching the database

Huh?  It's a detached database file, isn't it?  
What could possibly be using the detached file?

Some utter maniac renamed the .ldf file as .mdf

name fileid filename filegroup size maxsize growth usage
GenDraft 1 F:\User Databases\GenDraft_20180925.mdf PRIMARY 40960 KB Unlimited 8192 KB data only
GenDraft_log 2 G:\Logs\GenDraft_20180925.mdf NULL 12352 KB 2147483648 KB 4096 KB log only


Thursday, 16 August 2018

Database Sizes and Growth Rates

I've been asked to write a few words about this thrillingly sexy topic.  Why?  Because I've been looking at some existing servers and finding quite a few issues which are Bill's Fault.

Microsoft thinks that a good initial size for a database is 1MB, and it should grow by 1MB every time it grows.   Logs should grow by 10%, which isn't utterly mad, but isn't a good idea either.   Since Microsoft sets these defaults, it's easy to set up a database.  Which is good.  But since they are lousy defaults, it's bad.

I suppose these figures are there because there is no right answer to the question  "How big should these settings be then, Mr Smartypants DBA?". 

So what's the problem?

Well, I recently found a database which had grown from 500MB to 11071MB in 1MB increments.  That's a LOT of growth events.  Each time it grows, it has to stop and think about it.  And if it has to grow ten times, that's a lot of hanging around, twiddling your thumbs, waiting for it to think. 

Hindsight is a wonderful thing.  I set that database to grow in 512MB increments.  Because I now know how big it is, and how big it was when it started. 

In a perfect world, you would carefully calculate how big the database is likely to be after a year or two of adding data, and size it to that size, plus maybe 10% for luck.  In the real world, you rely on SQL Server's capacity to grow when it needs to. 

So what's wrong with growing by a percentage? 

First answer:  Small database, growing by 10% each time, not a big issue.  Big database, growing by 10% each time, maybe 10GB every time it grows, is going to take a long time!
Second answer: Log files are composed of Virtual Log Files (VLFs) and ideally you don't want huge numbers of them, all different sizes - which they will be if they keep growing by 10% at a time. 

And if you forced me to set a default growth rate without knowing anything about the database, I would choose 16MB for the data, 8MB for the logs.  Better than Microsoft's default values, anyway.  That's what I put in the MODEL database. 

Friday, 10 August 2018

Login Failed on Named Instance



I have a SQL Server 2016 server – call it Gandalf

There are three instances – Default, Frodo, and Sam

Every day I check the error logs.  Every day I see that someone or something is trying to log in at 0200. 

Error: 18456, Severity: 14, State: 5. 
Login failed for user ‘agent_default'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]  -

And on Sam, Login failed for user ‘agent_default' and Login failed for user ‘agent_Frodo'

Microsoft's syspolicy_purge_history job runs every night at 0200 on every instance.  It always succeeds.  But it is the only SQL Server agent job that runs at 0200.  So it must be that, surely?  I change it to run at 0215.  Next day, the error log says that our mysterious friend attempted to log in at 0200. 

Hmm, not that then.  Is it an automated scan from our white hat penetration testing server?  I wasted a lot of time trying to pin this down until I was able to find confirmation that Nope - it is not set up to look on Gandalf. 

And then suddenly I noticed that the login attempt had happened at 0215, not 0200.  It IS syspolicy_purge_history !  But it is the job on the OTHER instance which is causing the problem. 



1)      When the job   runs on the Frodo instance, no issues
2)      When the job runs on the default instance, the error log on Frodo says Default agent account failed to log in -
Error: 18456, Severity: 14, State: 5. 
Login failed for user ‘agent_default'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]  -
3)      When the job runs on the Sam instance, the error logs on Default and Frodo says
Login failed for user ‘agent_default' and
Login failed for user ‘agent_Frodo'


If you check out that job, Step 3 is running a chunk of Powershell.  It gets the Instance name and runs a task to erase records (older than the specified number of days).

if ('$(ESCAPE_SQUOTE(INST))' -eq 'MSSQLSERVER') {$a = '\DEFAULT'} ELSE {$a = ''};
(Get-Item SQLSERVER:\SQLPolicy\$(ESCAPE_NONE(SRVR))$a).EraseSystemHealthPhantomRecords()

Except that it also looks on all the other instances.  And because I have a different service account for each instance, it fails to find a login, which is why the job succeeds on each instance, but the other instance records a failed login attempt!
It's a known bug with Powershell, apparently.  And that link gives me the Powershell code to fix it!

$applicationName = "SQLPS ($env:USERNAME@$env:COMPUTERNAME)"
$SQLServerConnection = New-Object System.Data.SqlClient.SqlConnection
$SQLServerConnection.ConnectionString = "Data Source=$(ESCAPE_NONE(SRVR));Initial Catalog=master;Integrated Security=SSPI;Application Name=$applicationName"
$PolicyStoreConnection = New-Object Microsoft.SqlServer.Management.Sdk.Sfc.SqlStoreConnection($SQLServerConnection)
$PolicyStore = New-Object Microsoft.SqlServer.Management.Dmf.PolicyStore ($PolicyStoreConnection)
$PolicyStore.EraseSystemHealthPhantomRecords()

Just cut and paste that into Step 3 of the syspolicy_purge_history job, and the mysterious errors stop.