I have about 22,000 addresses. I’m trying to remove duplicate addresses. To do that I have to normalize each address, which includes the first name, last name (we want to send mailings to every person in the same house), street address 1 and address 2, city, state and zip.
In the street address I’m deleting a bunch of stuff that I consider not useful to help in standardization. removelist is sthe list of of strings within word boundaries I want to remove. removere becomes the regex constructed from removelist.
removelist=['APT\.?', 'Apartment', 'AVE\.?', 'AVENUE', 'CT\.?', 'COURT',
'DR\.?', 'DRIVE', 'LN\.?', 'LANE',
'RR\.?', 'ROUTE', 'RTE\.?' # Rural route
'ST\.?', 'STREET',
'NE', 'NORTHEAST', 'NW', 'NORTHWEST', 'SE', 'SOUTHEAST',
'SE', 'SOUTHWEST'
'Unit']
removere = '\b(' + '|'.join(removelist) + ')\b'
However the “\b” word boundary is coming out wrong in the removere regex. It is now:
(Pdb) p removere
'\x08(APT\\.?|Apartment|AVE\\.?|AVENUE|CT\\.?|COURT|DR\\.?|DRIVE|LN\\.?|LANE|RR\\.?|ROUTE|RTE\\.?ST\\.?|STREET|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SE|SOUTHWESTUnit)\x08'
The \b should not end up being “\x08” in my removere regex pattern.
What am I doing wrong here with the \b metacharacter? I’m familiar with regex, I used it a lot in Perl. But I’m new to Python and it’s quirks.
Main question: How do I use https://regex101.com to test a regex replacement for Python?
Thank you!
p.s. Any comments about my algorithm are welcome. I may have to rethink about removing these words, and instead, for example, standardize “Apartment” and “Apt.” to just “Apt”.
Also in my town the addresses “455 Oak Ave NE” and “455 Oak Ave SE” are different houses.
And over a year ago in another job I remember seeing Oklahoma has a lot of weird addresses using rural routes etc. But I only have one address for OK in this list and it’s now weird.
Some test data here:
testlist=['', 'Jim', 'Green', '4557 Board St SW', '', 'Grand Rapids', 'MI', '49548']
normaddr = normalizeaddr(testlist)
testlist=['', 'Tamy', 'Garland', '89 142nd St SE', '', 'Grand Rapids', 'MI', '49548']
normaddr = normalizeaddr(testlist)
testlist=['', 'John', 'Smith', '123 Oak ave.', '', 'Atlantic', 'IA']
normaddr1 = normalizeaddr(testlist)
testlist2 = ['', 'Dave', 'Jones', '457 Agate Rd', 'Apt 19', 'Detroit', 'MI', '45999']
normaddr2 = normalizeaddr(testlist2)
And raw data for your favorite regex site. Sorry, when the replace happens address data will be all upper case.
JIM GREEN 4557 BOARD ST SW
KELLY JONES 4557 BOARD ST SE
TAMY GARLAND 89 142ND ST SE
JOHN SMITH 123 OAK AVE BOARDMAN MI 49555
DAVE JONES 457 AGATE RD APT 29
KIM CASEY 457 AGATE RD APARTMENT 30
My first block of code above is in a function normalizeaddr().