Tuesday 13 December 2011

How to Identify Twins

I'm trying to find a list of twins in my source database - reason being, twins share a surname, a date of birth, and a postcode.  Unfortunately, those are exactly the criteria that the destination database uses to identify and reject duplicates <sigh>.  So it's handy to have a list of twins beforehand and do some manual jiggerypokery to make things work as desired. 

(In an ideal world, perhaps our skilled team of Oracle experts would modify the destination database so that it considered the forename as well, and said to itself "Hey, twins!  Maybe I won't reject this guy after all."  But that puts us into a world of development changes and testing and pain - it's easier and quicker to leave the damn thing alone and manually wrestle with the input data.)

So the first step is to identify surnames which occur more than once in the data.  Notice that I'm grouping by Surname and Date of Birth:

      Select mb1.surname
      From   tblbods    mb1
      Group BY    mb1.surname, mb1.dob
      having count(mb1.surname) > 1

Which gives us:

Surname
SMITH
JONES
FORTESCUE-SMYTHE
MURPHY
MCDONALD
PATEL


Actually it gives us an awful lot more than this - there are 200,000 records in the table.  But I thought you would get bored if I listed more, plus I have had to change all the names and dates of birth so as to anonymise the data. 

Second step is to do the same for dates of birth which occur more than once.

      Select mb2.dob
      From   tblbods    mb2
      Group BY    mb2.surname,      mb2.dob
      having count(mb2.DOB) > 1

DoB
10/05/1993
11/06/1996
11/06/2007
25/12/2005
11/07/2010
14/02/2001


And then finally I want to select data where the surname is in the list of surnames from Step 1, and the date of birth is in the list of DoBs from Step 2.


Select
      mb.BodID, mb.Title, mb.Forename, mb.Surname,
      mb.relationship, mb.sex as Gender, mb.DoB
from
      -- Table of individual members
      tblbods     mb
where  mb.surname in
(
      -- Step 1 - surnames which occur more than once
      Select mb1.surname
      From   tblbods    mb1
      Group BY    mb1.surname ,     mb1.dob
      having count(mb1.surname) > 1
)

and mb.dob in
(
      -- Step 2 - Dates of Birth which occur more than once
      Select mb2.dob
      From   tblbods    mb2
      Group BY    mb2.surname,      mb2.dob
      having count(mb2.DOB) > 1
)

And the end result is:

BodID
Title
Forename
Surname
relationship
Gender
DoB
523189
MISS      
MARIA
SMITH
Child
F
11/07/2010
523190
MISS      
ALANA
SMITH
Child
F
11/07/2010
531637
MR        
JASPER
FORTESCUE-SMYTHE
Child
M
14/02/2001
531638
MR        
SIMON
FORTESCUE-SMYTHE
Child
M
14/02/2001
466662
MASTER    
RAHUL
PATEL
Child
M
10/05/1993
466664
MISS      
SAMIRA
PATEL
Child
F
10/05/1993
519007
MASTER    
JOSEPH
MURPHY
Child
M
11/06/2007
519008
MISS      
ANNETTE
MURPHY
Child
F
11/06/2007
339731
MISS      
PATRICIA
JONES
Child
F
11/06/1996
339736
MISS      
IMOGEN
JONES
Child
F
11/06/1996
419768
MASTER    
WINSTON
MCDONALD
Child
M
25/12/2005
419769
MASTER    
DOUGLAS
MCDONALD
Child
M
25/12/2005


Simples!  Hope you find this useful.









No comments:

Post a Comment