Tuesday 9 August 2011

Data Quality - What could possibly go wrong?

Well, pretty much everything.  If you have enough data, there will be mistakes in it.  Try writing your own name down on a piece of paper.  In CAPITALS.  There's quite a good chance that you will spell your own name wrong - something about the unfamiliar capitals trips up the link between your brain and your fingers.  Now imagine that you are a data entry clerk typing in hundreds of names every day.  Some of them are Arabic, others Polish, some even Welsh.  Can you guarantee that every name is correct?

There are ways to help, of course.  Take addresses, now - most countries have a postal code system which can be used in automatic sorting machinery to help your letter get to the right place.  Most countries use a 4 or 5 digit numeric code - 90210 famously defines Beverly Hills in California, so typing an easy number into your address software automatically fills in the town (and it isn't Beverley Hills) and state.  And by reading the relevant postal address file, the software can check the spelling of the streets.  The postcode systems in the UK and the Netherlands can go right down to street, even sometimes premises level.  But these things have to be kept up to date - and periodically the postal authorities make changes in order to take on board changes out in the wild, even sometimes to fix mistakes.

So your name and address data may contain errors - perhaps your older data even preceded the current computer system and was transcribed from a Rolodex.  Check it.

Check the names and genders and titles as well, while you are at it.  Where you find blanks, there are a few things you can do:

  • Where there is a gender, you can derive a title (if Gender = F, then Title = Ms).  Note that while this is fairly risk-free for men, some women dislike being called Ms.  
  • And vice versa - where there is a title, you can derive a gender (if Title = Mr, then Gender = M).  
  • Where there is a forename, you can - usually - derive a title and gender (if Forename = Andrew, then Title = Mr and Gender = M).  This will still leave a number of names (e.g. Hilary) where it is not possible to determine the gender.  

Maybe you don't have a first name at all, just the initials.  Is that a man or a woman?  If you don't know, how can you address a letter?  Suppose it says Mr J Smith, but then states the gender as F?  Something wrong, but what?   You could of course classify such cases as "Unknown", but that might screw up your letters completely -  “Dear Unknown Smith” is not going to win you much business for your widget factory.  “Dear Customer” might be acceptable.

Suppose you have a list of email addresses - you need to make sure that they are valid before sending off your mailshot.  Lots of things you can check here:

  • Check for spaces in field
  • Check that name is present
  • Check that address does not end in a full stop     
  • Check that suffix e.g. .co.uk is not missing
  • Check that suffix is not truncated
  • Check that @ symbol is present     
  • Check that @ symbol is not duplicated      
  • Check that there is no spurious full stop in the address
I found some SQL code on the internet to do this - you don't expect me to do any actual WORK, do you? (After writing that sentence, the boss made me rewrite my code in PL/SQL in order to do that useful job for an Oracle database - I should have kept my mouth shut).  If you want a copy, let me know and I'll pass it on.  One big problem, of course - you can check all these things and have the most valid email address ever, but it's no good if your customer has changed to a new ISP last month.  

And while I think about it, don't forget that you need to comply with data protection legislation to hold all this stuff.  UK readers can find a handy checklist here, but there are similar rules for most countries.  You can get your data checked for people who have moved house, or people who have died (if you want to upset a bereaved relative, send a cheery letter to the recently deceased).  And if you know that someone has died, make sure that you don't contact them, especially not if you are going to write to Dear Mr Smith (Deceased).  

No comments:

Post a Comment