Friday 3 December 2010

Extracting Different Bits Of A Single Field

So what do you do when some crazy person designs a database with a single field to hold three separate bits of data?  Of course it probably wasn’t designed as such – probably somebody using the database decided that it would be neat if they could record three bits of data about the incoming call.  Rather than asking the developers to add a couple of extra fields, some bright spark decided to use the Category field to contain two or three things. 

So – let me show you what I mean. 

Audit Approach - Audit Approach Manual - Other
NHS - Other NHS
NHS - Audit Responsibilities
NHS - Audit Responsibilities
NHS - Accounting - Financial instruments
Local Government - Housing Benefits
Local Government - Housing Benefits
Local Government - Housing Benefits
NHS - Audit Responsibilities
NHS - Audit Responsibilities
Local Government - IFRS Conversion - PPE and other non-current assets

So sometimes there are two categories, and sometimes three.  Fortunately they always use a hyphen between them. 

But now I have to create a report which reads in this single field and splits it into three – before the first hyphen, between the hyphens, and after the second hyphen (if there is one). 



Category 1

So, the first bit is easy:

select
       --the original single field
       ca_full_name,
      
      
       -- cat 1 - before first hyphen
       substring(ca_full_name, 1, charindex(' - ', ca_full_name) )  
as Category_1,

SubString() lets you read in a text field, select a start point, then select a finish point.  So if your text says “Antidisestablishmentarianism” you can say

select substring ('Antidisestablishmentarianism', 1,4)

and get the result “Anti”. 

But in this case we don’t know where to stop, so instead of specifying the end point as a number, we have to use the Charindex() function. 

select charindex ('i', 'Antidisestablishmentarianism')

This means look through the text until you find the first letter ‘i’. 
The answer is 4

So the following code says look through the name until you find space hyphen space

charindex(' - ', ca_full_name)

Okay so far?




Category 2

Here’s how to get the middle bit.  I’m indebted to Stephen Lasham, who lives in New Zealand and wrote this article back in 2004 (isn’t the Internet wonderful?).  The trouble with Charindex() is that it only finds the first occurrence of the thing you are looking for. 

Extracting a string from between two delimiting characters



       -- cat 2 - from first hyphen to second hyphen or end
       CASE
       -- when no second hyphen,
       WHEN CHARINDEX(' - ',(SUBSTRING(ca_full_name,
CHARINDEX(' - ',ca_full_name,1)+1,99))) = 0
       -- then extract from first hyphen to the max length of 99
       THEN LTRIM(RTRIM(SUBSTRING(ca_full_name,
CHARINDEX(' - ',ca_full_name,1)+3,99)))
      
       -- else extract from the first hyphen up to second hyphen
       ELSE LTRIM(RTRIM(SUBSTRING(
       /* <text>> */ ca_full_name,
       /* <start > */      CHARINDEX(' - ',ca_full_name,1) + 3,
       /* <length> */      CHARINDEX(' - ', SUBSTRING(ca_full_name,
                            CHARINDEX(' - ', ca_full_name, 1)
+1,    99)) - 2 ) ) )
       END    AS Category_2,

So this time, there are two steps.  If there isn’t a second hyphen i.e. there are only two categories in the field, start not at the beginning but at the first hyphen, then read to the end.  The Ltrim() and Rtrim() bits get rid of any leading or trailing spaces, so it looks nice. 

If there IS a second hyphen, Stephen’s code starts at the first hyphen, then reads to the second hyphen. 



Category 3

Okay, nearly there. 

--  cat 3 - after second hyphen (if present)
       CASE
       -- when no second hyphen,
       WHEN CHARINDEX(' - ',(SUBSTRING(ca_full_name,
CHARINDEX(' - ',ca_full_name,1)+1,99))) = 0
       -- then print nothing
       THEN (SUBSTRING(ca_full_name,0,0))
       ELSE
       -- read back from the end until you get to the hyphen
       reverse(substring(reverse(ca_full_name), 1,
charindex(' - ', reverse(ca_full_name))-1))
       END                   as Category_3
      

So, if there isn’t a second hyphen, then print nothing at all (start at 0, stop at 0).  Because there isn’t any third bit, see?

But if there is a third bit, what I’ve done is read it from the end using the Reverse() function

select reverse ('Antidisestablishmentarianism')

Which of course gives us msinairatnemhsilbatsesiditnA. 

The code reads in the text, reverses it, reads through it until it gets to the hyphen.  Then it takes the backwards text up to that point and reverses it again so that it comes out facing the right way.  So we end up with this:

Category_1
Category_2
Category_3
Local Government
Whole of Government Accounts
Other
Probation
Accounts and Accounting
Service concessions and leases
Probation
IFRS Conversion
Employee benefits
Local Government
Capital Finance
Financial Instruments
Local Government
Operational and Legal Environment
Other
Other
Trash

Foundation Trusts
Additional Services
Other
Ethics
Non-Audit Services
Other
Foundation Trusts
Accounting
Service concessions and leases
Probation
Audit Responsibilities

Local Government
IFRS Conversion
IFRS conversion
Local Government
Operational and Legal Environment
Other


All done!  I hope this is useful.  And just think – none of this would have been necessary with a better database design in the first place. 

2 comments:

  1. Hi Jack,

    Looking at your code, I feel there is a simpler way of doing it, what you are after is a split function much as exists in VB and C#, this would be able to pass back each data item between the "-" to the calling function. This could be implemented either as a Scalar or as a Table function, whichever is your preference.

    Regards, Steve

    ReplyDelete
  2. Thanks Steve. I actually found a function ready made to do that when I searched Google, but thought I would play about with other techniques. Then I mislaid the function... :o)

    ReplyDelete