Saturday, 3 August 2013

Check your backups - or else!

You make backups.

You restore backups.

Not all of them, necessarily - but every now and then, restore a backup just to check that it's OK.  Find a location on a test server - maybe even on your own laptop, if it has enough grunt.  Restore it.  Check it - maybe run:

dbcc checkdb(vanilla) with no_infomsgs

It should think for a while and then say Command(s) completed successfully.

If it restores OK and DBCC finds no issues, then you can be reassured that things are probably OK with the backups for that database.  Drop it from the laptop - you don't need it anymore.  

Why "with no_infomsgs"?  If you omit this bit, DBCC Checkdb will come back with page after page of guff - this is OK, that's OK - this sort of thing:

DBCC results for 'Production.BillOfMaterials'.
There are 2679 rows in 20 pages for object "Production.BillOfMaterials".
DBCC results for 'sys.ifts_comp_fragment_610101214_4'.
There are 351 rows in 2 pages for object "sys.ifts_comp_fragment_610101214_4".
On and on it goes

And you have to read through it all  looking for anything which isn't OK.  Sling the stuff which is there "just for information" and life is much simpler.  

What if it is corrupt?
  1. Don't panic
  2. Don't detach the database
  3. Don't restart SQL
  4. Don't just run repair.
  5. Run an integrity check
  6. Afterwards, do a root-cause analysis
This is from Gail Shaw - read more of her words of wisdom here.  In fact save that link in your favourites, memorise it, get it tattooed upon your left buttock - whatever it takes.  You will need it on the day you find a problem.  I once found four corrupt databases.  The previous guy had been running this:

dbcc checkdb(Vanilla) with physical_only

According to Microsoft, this...
"Limits the checking to the integrity of the physical structure of the page and record headers and the allocation consistency of the database. This check is designed to provide a small overhead check of the physical consistency of the database, but it can also detect torn pages, checksum failures, and common hardware failures that can compromise a user's data."
It's quicker too.  But as I say, it missed four corrupt databases.  Gail's advice helped me recover all of them without any data loss.  So - thanks Gail.

Now most of the time you take backups to disk, but you can't keep everything on disk - in fact you probably don't keep more than three days' backup on disk.  More likely you use something like Symantec Backupexec to copy your backups from Disk to Tape overnight.  

You can keep every single backup ever made on tape if you want to, although it will need a lot of tapes and it's probably overkill.  A good compromise is to do something like this:
  • Keep every daily backup for a week
  • Keep your weekly backups for a month (a four week lunar month works for me).
  • Keep each monthly backup for a year.  
Ask yourself - will I ever need a backup that's older than a year?  Probably not, but it's up to you - you know your business, and maybe your business has a legal duty to keep stuff for seven years.  You can keep everything for ever if you want, but it will cost a lot of tapes and management.  You will almost certainly want to store them offsite - firms like Iron Mountain will provide this kind of service - they collect and drop off too.  Tell them Jack sent you (you never know).  

But here's the rub - every now and then you want to check that you can retrieve data from those tapes.  Pick a date - maybe the monthly backup for three months ago.  Pick a database.  And request that tape, get it copied to your test area, and restore the backup.  I tried that today.  Guess what?  there was some sort of problem with the tape.  I asked for Friday the 19th July, a couple of weeks ago now.  But what the tape actually contained was data from March.  For some reason (and we are frantically looking for that reason now) the backups were not overwriting the existing data.  

When we find the problem, it will probably be something so trivial it's not true - a missing comma in the command used, perhaps; maybe a worn patch on a $4.99 tape.  And now that we've found it, it won't go wrong again.  But it was found, and it will be fixed, only because I checked.  

So check.