How to test Python regex replace on regex101.com?

c-rob · March 21, 2024, 1:39pm

I have about 22,000 addresses. I’m trying to remove duplicate addresses. To do that I have to normalize each address, which includes the first name, last name (we want to send mailings to every person in the same house), street address 1 and address 2, city, state and zip.

In the street address I’m deleting a bunch of stuff that I consider not useful to help in standardization. removelist is sthe list of of strings within word boundaries I want to remove. removere becomes the regex constructed from removelist.

    removelist=['APT\.?', 'Apartment', 'AVE\.?', 'AVENUE', 'CT\.?', 'COURT',
                   'DR\.?', 'DRIVE', 'LN\.?', 'LANE',
                   'RR\.?', 'ROUTE', 'RTE\.?' # Rural route
                   'ST\.?', 'STREET',
                   'NE', 'NORTHEAST', 'NW', 'NORTHWEST', 'SE', 'SOUTHEAST',
                   'SE', 'SOUTHWEST'
                   'Unit']
    removere = '\b(' + '|'.join(removelist) + ')\b'

However the “\b” word boundary is coming out wrong in the removere regex. It is now:

(Pdb) p removere
'\x08(APT\\.?|Apartment|AVE\\.?|AVENUE|CT\\.?|COURT|DR\\.?|DRIVE|LN\\.?|LANE|RR\\.?|ROUTE|RTE\\.?ST\\.?|STREET|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SE|SOUTHWESTUnit)\x08'

The \b should not end up being “\x08” in my removere regex pattern.

What am I doing wrong here with the \b metacharacter? I’m familiar with regex, I used it a lot in Perl. But I’m new to Python and it’s quirks.

Main question: How do I use https://regex101.com to test a regex replacement for Python?

Thank you!

p.s. Any comments about my algorithm are welcome. I may have to rethink about removing these words, and instead, for example, standardize “Apartment” and “Apt.” to just “Apt”.

Also in my town the addresses “455 Oak Ave NE” and “455 Oak Ave SE” are different houses.

And over a year ago in another job I remember seeing Oklahoma has a lot of weird addresses using rural routes etc. But I only have one address for OK in this list and it’s now weird.

Some test data here:

testlist=['', 'Jim', 'Green', '4557 Board St SW', '', 'Grand Rapids', 'MI', '49548']
normaddr = normalizeaddr(testlist)
testlist=['', 'Tamy', 'Garland', '89 142nd St SE', '', 'Grand Rapids', 'MI', '49548']
normaddr = normalizeaddr(testlist)
testlist=['', 'John', 'Smith', '123 Oak ave.', '', 'Atlantic', 'IA']
normaddr1 = normalizeaddr(testlist)
testlist2 = ['', 'Dave', 'Jones', '457 Agate Rd', 'Apt 19', 'Detroit', 'MI', '45999']
normaddr2 = normalizeaddr(testlist2)

And raw data for your favorite regex site. Sorry, when the replace happens address data will be all upper case.

JIM GREEN 4557 BOARD ST SW
KELLY JONES 4557 BOARD ST SE
TAMY GARLAND 89 142ND ST SE
JOHN SMITH 123 OAK AVE BOARDMAN MI 49555
DAVE JONES 457 AGATE RD APT 29
KIM CASEY 457 AGATE RD APARTMENT 30

My first block of code above is in a function normalizeaddr().

MegaIng · March 21, 2024, 1:45pm

You need to use a raw string because \b is a valid string escape in addition to being a valid regex metachar: r'\b(' instead of '\b('. As a general rule of thumb, when working with regex, always use raw strings
I assume you don’t have a more useful way the data is present? Trying to do string processing on addresses is always going to run into an infinite amount of edge cases, very similar to trying to do anything with names except printing them. I.e. my general suggestion is: Don’t do whatever you are trying to do, although tbh I don’t quite understand your usecase.

c-rob · March 21, 2024, 1:53pm

although tbh I don’t quite understand your usecase.

Use case. Scan through all addresses and remove duplicate addresses. This is for a mailing list.

A sales man may get several spreadsheets from different sources, I make sure the spreadsheets have all the same columns in the same order, then write a program to remove any duplicate addresses from all spreadsheets I’m given as best I can.

I don’t need 100% accuracy, 97% would be good, or in that area. If there are 900 duplicate addresses left out of 22,000 addresses after I run my program, the accuracy would be about 96%.

MegaIng · March 21, 2024, 1:57pm

Lookup the addresses with some service like openmaps or google (haven’t actually investigate which would be available there) and check if they point to the same location. That is probably going to be the best you can do. Otherwise, sure, “random” string replacement will get you a lot of the way there.

steven.rumbalski · March 21, 2024, 2:04pm

I would suggest using a library like pyap or usaddress.

c-rob · March 21, 2024, 3:05pm

Both of these parse the address. This will be helpful on another project but not this one. I’ll make a note of these modules. On this project my data is already parsed. Thanks!

kknechtel · March 21, 2024, 4:46pm

I could dump many relevant Stack Overflow references on you for this, but even the popular/well-regarded ones aren’t super high quality for this topic, and there are a few moving parts to your specific confusion.

In Perl, regexes are built in to the language parsing. But in Python, they are provided by a library. There’s no =~ operator and no literal regex syntax.

As such, in Python, the escaping of string literals is handled completely separately from the escaping of special regex characters within a regex. Python processes source code like '\b' to figure out what characters are actually in the string, long before any re module functionality can process the characters in the string to figure out what the regex means.

In a Python string literal, the sequence \b means a character with Unicode code point 8 - a backspace control character. You may have seen this displayed in broken terminal environments as ^H - H because it’s the 8th letter of the alphabet. Older programmers may have a memory of writing ^H to signify text erasure (sort of like strikethrough) in plain-text media, such as USENET posts.

It’s the same sort of thing as when you use \n to put a newline in a string. Of course, in the case of \n, the regex syntax also treats the \n sequence as special:

>>> import re
>>> re.match('\n', '\n')
<re.Match object; span=(0, 1), match='\n'>
>>> re.match('\\n', '\n')
<re.Match object; span=(0, 1), match='\n'>

In the first case, the regex got a pattern that actually has a newline in it, and matched it to a string that contains a newline. In the second case, the regex got a pattern that contains a single backslash followed by a lowercase n, and interpreted this regex syntax to mean to look for a newline character (and matched it to the same string).

But \b couldn’t possibly work like that, because the regex \b syntax doesn’t correspond to a text character; it corresponds to a more complex regex rule - and because the string \b syntax does correspond to a text character that regex doesn’t consider special.

>>> re.match('\b', '\b')
<re.Match object; span=(0, 1), match='\x08'>
>>> re.match('\\b', '\b')
>>>

In the first case, a literal backspace character matches against a literal backspace character. In the second case, a backslash and lowercase b are interpreted as the rule to match a word boundary, but the string doesn’t contain any word boundaries.

Perhaps you’re wondering why a newline character that you represented with \n in the code, still looks like \n in a string representation; but a backspace character represented with \b is represented back to you as \x08. The answer is simply convention. The string simply contains characters; it has no memory of what the original string literal looked like. (After all, a string can trivially be created without a string literal ever having been involved in the Python source at all: there are built-in string constants, and functions like chr.) So each character has a standard “normalized” form used when Python creates a representation of the string.

We can easily print a table for reference:

for i in range(256):
    display = repr(chr(i))[1:-1]
    print(f'{display:>4}', end='')
    if (i % 16) == 15:
        print()

Which gives:

\x00\x01\x02\x03\x04\x05\x06\x07\x08  \t  \n\x0b\x0c  \r\x0e\x0f
\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
       !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /
   0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
   @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
   P   Q   R   S   T   U   V   W   X   Y   Z   [  \\   ]   ^   _
   `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o
   p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~\x7f
\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f
\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f
\xa0   ¡   ¢   £   ¤   ¥   ¦   §   ¨   ©   ª   «   ¬\xad   ®   ¯
   °   ±   ²   ³   ´   µ   ¶   ·   ¸   ¹   º   »   ¼   ½   ¾   ¿
   À   Á   Â   Ã   Ä   Å   Æ   Ç   È   É   Ê   Ë   Ì   Í   Î   Ï
   Ð   Ñ   Ò   Ó   Ô   Õ   Ö   ×   Ø   Ù   Ú   Û   Ü   Ý   Þ   ß
   à   á   â   ã   ä   å   æ   ç   è   é   ê   ë   ì   í   î   ï
   ð   ñ   ò   ó   ô   õ   ö   ÷   ø   ù   ú   û   ü   ý   þ   ÿ

(I didn’t put any padding space between the individual representations, to avoid awkward line-wrapping on the forum.)

The rules are

soft hyphen and non-breaking space are represented with a hex code escape, so that they don’t get confused for a regular hyphen or space
tab, newline and carriage return are represented with their familiar shortcuts
other control characters (the C0 and C1 control characters, as well as the “rubout” character 0x7f) are represented with a hex code escape

The \x escape sequences support exactly two hex digits, so they can only be used for Unicode code points up to 255. Beyond that, \u escape sequences (using exactly four hex digits) are used for the Basic Multilingual Plane (first 65536 code points), and \U (with exactly eight hex digits, even though the first two must always be zero) beyond that. Python reports these back to you when the code point is unassigned or has some special purpose (that could be a whole other post). But most of the time, if you put one in a string literal, the corresponding representation will just show the actual character, just like the print output:

>>> '☺'
'☺'
>>> '\u263a'
'☺'
>>> print('\u263a')
☺

Sorry, I can’t help with this. I recommend taking the time to get familiar with how Python’s string syntax works, and then using the REPL for testing.