How do I replace a hex value in a string with something printable?

c-rob · May 22, 2024, 12:41pm

I have Python 3.11 on Windows 10. I’m still fairly new to Python.

I have a string with data I got from a website. The website uses an extended ascii character for the minus sign and I’d like to change that to a normal printable dash. The next value of the odd character is \x2212.

But I’ve never seen anyway to represent a value in hex in a Python string so I can use the .replace() function.

This is what I’ve tried:

def cleanhtmllist(mylist): 
    r'''Change extended ascii to normal characters.
    In the Wikipedia page there is an extended ascii character that must be changed to a minus sign. 
    In: list of strings
    Out: List of strings

    Change this, to this: 
    \x2212 (dec 8722)  -
    '''
    procname = str(inspect.stack()[0][3]) + ":"
    
    for l in mylist: 
        l = l.replace('\x2212', '-')
        newlist.append(l)
        
    return newlist

The function actually receives a list of strings which I have to clean up.

I’m also having a problem finding the extended non-printing characters in this code which calls the function. This regex never finds the non-printing characters.

# I didn't know how to represent the hex ctr 0x2212 in Python so I used \x2212. 
ind_row_data = ['Walmart', 'more stuff', 'something else',  '\x2212$220,500']
tstr = ', '.join(ind_row_data) # Change list to string.
if re.match(r'[^ -~]', tstr): # Find high ascii ctrs.
    print(f"{ind_row_data[0]} profit={profit}")
    print(f"Row has ext ascii: {ind_row_data}")
    ind_row_data = cleanhtmllist(ind_row_data)

So the cleanhtmllist() function is not running.

alicederyn · May 22, 2024, 12:50pm

'\u2212'

The “u” is for Unicode code point

alicederyn · May 22, 2024, 12:53pm

If you print this out, you’ll see it’s actually

'"12$220,500'

\x only looks at the next two characters, in this case 22, which gives you a double-quote.

c-rob · May 22, 2024, 2:55pm

Ok, to check for high ascii characters I actually had to use hex codes in the regex expression like this:

if re.search(r'[^\x20-\x7e]', sourcestr): # Find high ascii ctrs.

My other method shown above does not work with Python.

And to replace characters I also had to use regex.

def cleanhtmllist(mylist): 
    r'''Change extended ascii to normal characters.
    In the Wikipedia page there is an extended ascii character that must be changed to a minus sign (plain dash). 
    In: list of strings
    Out: List of strings
    Change this, to this: 
    \x2212 (dec 8722)  -
    '''
    procname = str(inspect.stack()[0][3]) + ":"
    newlist: list = []
    for l in mylist: 
        l = re.sub('\u2212', '-', l)
        l = re.sub('\u2013', '-', l)
        newlist.append(l)
        
    return newlist

So these both work now.

kknechtel · May 22, 2024, 8:38pm

It does if your test data actually contains the characters you’re trying to filter out:

>>> import re
>>> re.match(r'[^ -~]', '\u2212$220,500')
<re.Match object; span=(0, 1), match='−'>

Also, there is no such thing as “high ascii”. ASCII only defines characters for bytes up to 0x7f. The high byte values in a system that is expecting to represent characters in some single-byte encoding, are undefined, and left up to said single-byte encoding. The so-called “code page” encodings are those which map bytes with values 00…7f identically to ASCII, and 80…ff to some custom set of characters.

But bytes fundamentally are not characters; every such encoding is just that - an encoding - and such encodings, inherently and obviously, can only possibly represent 256 different characters.

We use more than that nowadays, which is why Unicode exists. It’s also, indirectly, why the attempted \x2212 sequences didn’t work: when Python sees \x in the string literal, it doesn’t keep scanning for hex digits - it looks for a specific number of them: two. To provide four digits, we use \u, and to provide eight (although this is overkill, as they must start with either 000 or 001), we use \U.

No, that is definitely not necessary. The ordinary replace method of strings works perfectly fine:

>>> '\u2212$220,500'.replace('\u2212', '-')
'-$220,500'

But if you want to make multiple single-character replacements, please also look into str.maketrans and str.translate.

And if the goal is to make sure you end up with a string that is representable in ASCII, keep in mind that there are a lot more possible characters in the input.

cameron · May 22, 2024, 11:32pm

Aye.

But I myself have been in the “fold funky dashes to ASCII dashes (well,
minus sign)” camp: man pages. So much so that I’ve got a one line
“undash” shell script which runs:

 sed 's/−/-/g' ${1+"$@"}

which doubtless produces some mojibake.

Why? Because when I’m searching a man page for some option, eg “–foo” I
get …annoyed when a plain old /--foo fails because they’re em
dashes.

Topic		Replies	Views
[SOLVED] Problem in version - delete the topic Python Help	9	441	October 21, 2023
How to convert character to hexadecimal and back? Python Help	6	14644	March 22, 2023
Add length parameter to hex Python Help	3	1936	April 22, 2023
Str.replace of a set of characters Ideas	10	8055	December 22, 2019
Via io.read() method reading bytes from file got char mixed with hexadecimal number Python Help release , help	3	1435	September 9, 2022

How do I replace a hex value in a string with something printable?

Related Topics