Problem with really butchered text output

Hey guys I am taking web elements and putting them into .txt files, problem is I can’t seem to
get a proper translation while webscraping all the time (weird unicode) so I used encode(errors=“ignore”) and it turned it into a bytes object, so now all the strings start with "b’ "
But if I use decode, it doesnt work. Ascii seems to work but its not always working with my code (very complicated, a lot of variables)

So if I take a webdriver.get
then get an element, convert the element to utf-8, and write to a file.txt

what is the best way to clean up the code, and remove any of the "b ’ " before the strings, or the janky unicode \x81238 type codes???

Hello,

This is as expected. It is doing what it is supposed to be doing. See definition below.

Maybe the default on your system is something other than UTF-8? Have you tried telling the encoding and decoding functions explicitly which type of coding type that you want implemented?

Here is a simple test script. Note that you start with the original string with type str. It converts it to type raw bytes prior to sending. At the receiver end, it is decoded to obtain the original message type str.

strVar = 'eggs and ham'
print('\nOriginal string: ', strVar)

encode_str = strVar.encode(encoding='utf-8')
print('After encoding: ', encode_str)

decode_str = a.decode(encoding='utf-8')
print('After decoding: ', decode_str)

After running this test snippet, you should get:

Original string:  eggs and ham
After encoding:  b'eggs and ham'
After decoding:  eggs and ham

If you tell it which coding scheme that you want explicitly, there will be no ambiguity.

These janky types are special characters that fall outside of the regular ASCII type characters that have been encoded to type bytes.

print('\nSpecial character: Ä')
special = 'Ä'
sp_encoded = special.encode(encoding='utf-8')
print('After encoding: ', sp_encoded)
sp_decoded = sp_encoded.decode('utf-8')
print('After decoding: ', sp_decoded)

If you run the snippet, you get this:

Special character: Ä
After encoding:  b'\xc3\x84'
After decoding:  Ä

So, if you are getting byte strings, of any form, you should decode.

Encoding

When translating characters from and to raw bytes - the rules for translating a string of Unicode characters to a sequence of bytes, and extracting a string from a sequence of bytes. This translation back and forth between bytes and strings is defined by two terms: Encoding and Decoding.

• We encode from string to raw bytes.

• We decode from raw bytes to strings.

Encodings really only apply when text is stored or transferred externally, in files and other mediums. Text is translated to and from an encoding-specific format only when it is transferred to or from external text files, byte strings, or APIs with specific encoding requirements. Once in memory, though, strings have no encoding.

2 Likes

In notepad ++ I am using utf-8 encoding, it will error if the special characters are inside it now.
also, when I encode, it may work fine, but if I try to decode, it sometimes gives me errors saying that it’s the wrong data type to decode? even if I append str(), list(), strip() or read() or iterate through it. the only thing I have found that works for some objects is ascii()?