Thursday, 4 June 2020

Extracting Addresses from FreeText

The Task

A recent task I was set required the extraction of Addresses from Freetext.  Not an easy task, but I was supprised how successful my final result was...  So supprised, that I thought I would share it.

This was a SQL Server based task.  SQL Server's native pattern matching is not sofisticatd enough for the task, so I opted to use Regular Expressions.  I could have used an SSIS package to process to data, but I instead opted to use CLR and a Table-Valued Function to ruturn the found addresses.

I'm not an expert with Regex, I generally have to relearn it everytime I use it, but I often get the pattern and results I want.

Finding addresses

The addresses I needed to find are all UK based, meaning I can use the postal code as a starting point for address extraction.  The post code is the most distinctive identifiable componet of an address and was a manditory requirment of the addresses I was to extract.  My approch was to identify valid formated post codes, and then calculate the remainer of the address from there.

Based on this, I was able to formulate a regular expression pattern to perform the address extraxtion which can be split into the following components:
  • Flat Number (optional)
  • AddressLines (hidden)
    • Door Number (optional)
    • AddressLines (optional)
  • Post Code (mandatory)
I used a hidden/un-named group to combine the door number and address lines regex.  This ensures that if a door number is found, it will control the length of the address lines.

Post Code Regex

Based on the information from wikipedia regarding uk post code formatting (wikipedia-Postcodes_in_the_United_Kingdom.)  The format is comprised of an outward code and an inward code seperated by a space.  I'm aware the space can often be omitted.

The outward code contains an area code of one or two alphabetical characters followed by a district code which is a single digit, then optionally a single alphanumeric character. 

The inward code is made up from the sector code which is a single digit followed by the unit code which is two alphabetical characters.

To locate a valid UK post code I ended up with the following regex:
(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))
 

Address Lines Regex

To find the address lines, I'm looking for between 1 and 4 blocks of text seperated by comma which preceed the Post Code.

The regex I used for this was: 
(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4})
 

Door Number Regex

To find the door number, I used an assumption that it must start with a digit, but could contain a leter (example 2b or 7a.) I also allowed the door number to be either space or comma seperated from the address.

To find the door number, I used regex:
(?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)
 

Flat Number Regex

For the data I was working with, flat numbers always followed the word flat.  For that reason I was able to look explisitly for the literal word Flat (case-insensitive) and attempt to find a flat number in the following.  As with door number, I permitted a flat number to be suffixed with a letter.

The regex I used to find the Flat Numbers was:
(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)
 

Full Regex

So, all of the above regex components combined (with an unnamed group to combine Door Number and Address Lines) resulted in the following regex:
(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)?((?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)?(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4}))?\s?(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))

I was able to perform the regex using a clr function and return a table containing the match result along with the components of the address.

Results

When given the task, I was not expecting to do well beyond post codes.  The post code matching was almost perfect except for invalid or typo's in the freetext.  However, I got some very impressive results.  Any address with a door or flat number extracted perfectly.  However, those without where hit and miss, sometimes I grabed to much text preceding the post code and sometime not enough.

Conclusion

I'm not a regex expert, I'm sure it could be improved or optimised.  I'm also sure I may have missed some posible symbols which are valid addresses or other posible formatting of flat and door umbers.  However, for the task and dataset I was working with, the extraction was far better than expected.

I do have a few ideas how it could be improved, however, I have meet my requirement and the project has completed.  So for now, this is it.

No comments:

Post a Comment