The Task
A recent task I was set required the extraction of Addresses from Freetext. Not an easy task, but I was supprised how successful my final result was... So supprised, that I thought I would share it.
This was a SQL Server based task. SQL Server's native pattern matching is not sofisticatd enough for the task, so I opted to use Regular Expressions. I could have used an SSIS package to process to data, but I instead opted to use CLR and a Table-Valued Function to ruturn the found addresses.
I'm not an expert with Regex, I generally have to relearn it everytime I use it, but I often get the pattern and results I want.
Finding addresses
The addresses I needed to find are all UK based, meaning I can use the postal code as a starting point for address extraction. The post code is the most distinctive identifiable componet of an address and was a manditory requirment of the addresses I was to extract. My approch was to identify valid formated post codes, and then calculate the remainer of the address from there.
Based on this, I was able to formulate a regular expression pattern to perform the address extraxtion which can be split into the following components:
- Flat Number (optional)
- AddressLines (hidden)
- Door Number (optional)
- AddressLines (optional)
- Post Code (mandatory)
I used a hidden/un-named group to combine the door number and address lines regex. This ensures that if a door number is found, it will control the length of the address lines.
Post Code Regex
Based on the information from wikipedia regarding uk post code formatting (wikipedia-Postcodes_in_the_United_Kingdom.) The format is comprised of an outward code and an inward code seperated by a space. I'm aware the space can often be omitted.
The outward code contains an area code of one or two alphabetical characters followed by a district code which is a single digit, then optionally a single alphanumeric character.
The inward code is made up from the sector code which is a single digit followed by the unit code which is two alphabetical characters.
To locate a valid UK post code I ended up with the following regex:
(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))
Address Lines Regex
To find the address lines, I'm looking for between 1 and 4 blocks of text seperated by comma which preceed the Post Code.
The regex I used for this was:
(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4})
Door Number Regex
To find the door number, I used an assumption that it must start with a digit, but could contain a leter (example 2b or 7a.) I also allowed the door number to be either space or comma seperated from the address.
To find the door number, I used regex:
(?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)
Flat Number Regex
For the data I was working with, flat numbers always followed the word flat. For that reason I was able to look explisitly for the literal word Flat (case-insensitive) and attempt to find a flat number in the following. As with door number, I permitted a flat number to be suffixed with a letter.
The regex I used to find the Flat Numbers was:
(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)
Full Regex
So, all of the above regex components combined (with an unnamed group to combine Door Number and Address Lines) resulted in the following regex:
(?<Flat>Flat\s?(?<FlatNo>\d\w*)\s?,?\s?)?((?<DoorNumber>\d+[,\s]+|\d\w[,\s]+)?(?<AddressLines>([A-Z-''\s]{2,}[,\.\s]){1,4}))?\s?(?<PostCode>(?<Area>[A-Z]{1,2})(?<District>\d\w?)\s*(?<Sector>\d)(?<Unit>[A-Z]{2}))
I was able to perform the regex using a clr function and return a table containing the match result along with the components of the address.
Results
When given the task, I was not expecting to do well beyond post codes. The post code matching was almost perfect except for invalid or typo's in the freetext. However, I got some very impressive results. Any address with a door or flat number extracted perfectly. However, those without where hit and miss, sometimes I grabed to much text preceding the post code and sometime not enough.
Conclusion
I'm not a regex expert, I'm sure it could be improved or optimised. I'm also sure I may have missed some posible symbols which are valid addresses or other posible formatting of flat and door umbers. However, for the task and dataset I was working with, the extraction was far better than expected.
I do have a few ideas how it could be improved, however, I have meet my requirement and the project has completed. So for now, this is it.
No comments:
Post a Comment