Ryan IT Pro: Extracting Addresses from FreeText

The Task

A recent task I was set required the extraction of Addresses from Freetext. Not an easy task, but I was supprised how successful my final result was... So supprised, that I thought I would share it.

This was a SQL Server based task. SQL Server's native pattern matching is not sofisticatd enough for the task, so I opted to use Regular Expressions. I could have used an SSIS package to process to data, but I instead opted to use CLR and a Table-Valued Function to ruturn the found addresses.

I'm not an expert with Regex, I generally have to relearn it everytime I use it, but I often get the pattern and results I want.

Finding addresses

The addresses I needed to find are all UK based, meaning I can use the postal code as a starting point for address extraction. The post code is the most distinctive identifiable componet of an address and was a manditory requirment of the addresses I was to extract. My approch was to identify valid formated post codes, and then calculate the remainer of the address from there.

Based on this, I was able to formulate a regular expression pattern to perform the address extraxtion which can be split into the following components:

Flat Number (optional)
AddressLines (hidden)

Door Number (optional)
AddressLines (optional)

Post Code (mandatory)

I used a hidden/un-named group to combine the door number and address lines regex. This ensures that if a door number is found, it will control the length of the address lines.

Post Code Regex

Based on the information from wikipedia regarding uk post code formatting (wikipedia-Postcodes_in_the_United_Kingdom.) The format is comprised of an outward code and an inward code seperated by a space. I'm aware the space can often be omitted.

The outward code contains an area code of one or two alphabetical characters followed by a district code which is a single digit, then optionally a single alphanumeric character.

The inward code is made up from the sector code which is a single digit followed by the unit code which is two alphabetical characters.

To locate a valid UK post code I ended up with the following regex:

(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))

Address Lines Regex

To find the address lines, I'm looking for between 1 and 4 blocks of text seperated by comma which preceed the Post Code.

The regex I used for this was:

(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4})

Door Number Regex

To find the door number, I used an assumption that it must start with a digit, but could contain a leter (example 2b or 7a.) I also allowed the door number to be either space or comma seperated from the address.

To find the door number, I used regex:

(?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)

Flat Number Regex

For the data I was working with, flat numbers always followed the word flat. For that reason I was able to look explisitly for the literal word Flat (case-insensitive) and attempt to find a flat number in the following. As with door number, I permitted a flat number to be suffixed with a letter.

The regex I used to find the Flat Numbers was:

(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)

Full Regex

So, all of the above regex components combined (with an unnamed group to combine Door Number and Address Lines) resulted in the following regex:

(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)?((?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)?(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4}))?\s?(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))

I was able to perform the regex using a clr function and return a table containing the match result along with the components of the address.

Results

When given the task, I was not expecting to do well beyond post codes. The post code matching was almost perfect except for invalid or typo's in the freetext. However, I got some very impressive results. Any address with a door or flat number extracted perfectly. However, those without where hit and miss, sometimes I grabed to much text preceding the post code and sometime not enough.

Conclusion

I'm not a regex expert, I'm sure it could be improved or optimised. I'm also sure I may have missed some posible symbols which are valid addresses or other posible formatting of flat and door umbers. However, for the task and dataset I was working with, the extraction was far better than expected.

I do have a few ideas how it could be improved, however, I have meet my requirement and the project has completed. So for now, this is it.

Ryan IT Pro

Thursday, 4 June 2020

Extracting Addresses from FreeText