UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-14: character maps to <undefined>

morningsashimi · December 31, 2021, 2:04pm

Hi! How to solve "UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 0-14: character maps to "

for this code:
stockList = ts.get_stock_basics()

steven.daprano · January 1, 2022, 12:30am

Hi Jonathan,

Sorry you have not provided enough information to solve the problem. I’m
going to have to do a lot of guessing.

The absolute minimum we would need to see would be the entire
traceback, showing the full error starting with the line “Traceback” and
ending with the error message. But even that may not be enough.

What is “ts”? Where did it come from? What does it do?

Why is it using the ‘charmap’ codec? Can you tell it to use a different
encoding instead of ‘charmap’?

Are you reading data from a website? What encoding is the website using?
If you’re unsure, it is probably using UTF-8. Can you tell the mystery
ts object to use encoding=‘utf-8’?

You could try running your script on an Apple Mac or Linux instead of
Windows, that could fix the problem. (The ‘charmap’ codec is only used
on Windows.)

morningsashimi · January 1, 2022, 6:30am

I tried utf-8 in the code like this before: stockList = ts.get_stock_basics(encoding= “utf-8”) . but gave me other error message. I don’t know how to put the encoding.

import pymysql
import sys
import tushare as ts

try:
db = pymysql.connect(host=“localhost”,user=“root”,password=“cixlerui83466”,database=‘pythonstock’,port=3306)
except:
print(‘Error whaen Connecting to DB.’)
sys.exit()
cursor = db.cursor()
stockList = ts.get_stock_basics()

#for code in stockList.index:

#try:
    #createSql='CREATE TABLE stock_'+code+'( date varchar(255) , open float, close float, high float, low float, vol int(11))'
    #cursor.execute(createSql)
#except:
    #print('Error when Creating table for:' + code)

db.commit()
cursor.close()
db.close()

Traceback (most recent call last):
File “D:\Users\jon_j\eclipse-workspace\HelloP8\src\CreateTable.py”, line 11, in
stockList = ts.get_stock_basics()
File “D:\Python310\lib\site-packages\tushare\stock\fundamental.py”, line 52, in get_stock_basics
print("\u672c\u63a5\u53e3\u5373\u5c06\u505c\u6b62\u66f4\u65b0\uff0c\u8bf7\u5c3d\u5feb\u4f7f\u7528Pro\u7248\u63a5\u53e3\uff1ahttps://waditu.com/document/2")
File “D:\Python310\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position 0-14: character maps to

eryksun · January 1, 2022, 7:20am

In Python 3.6+, output to a Windows console file uses Unicode. All other files, however, default to using the process ANSI code page, which is typically a legacy encoding. The latter includes the standard files (i.e. stdin, stdout, stderr) when they’re redirected to something other than a console, such as a disk file or pipe. In your case the encoding of sys.stdout is code page 1252, maybe because the output is piped to Eclipse. There are two environment variables that you can set to override the encoding of sys.stdout. To force just the standard files to use UTF-8 , set PYTHONIOENCODING=utf-8. To use UTF-8 as the default in all cases, enable UTF-8 mode via PYTHONUTF8=1. To configure the latter for a single running instance of Python, use the command-line option -X utf8 instead of setting PYTHONUTF8=1.

steven.daprano · January 1, 2022, 7:39am

Jonathan said:

“I tried utf-8 in the code like this before: stockList =
ts.get_stock_basics(encoding= “utf-8”) . but gave me other error
message.”

Don’t be shy. Copy and paste the error message so we can see it.

I don’t know how to put the encoding.

What does the tushare documentation say? I can only find these, which
are all in Chinese:

You could try Google Translate.

steven.daprano · January 1, 2022, 7:45am

Looking at the source code here:

github.com

waditu/tushare/blob/093856995af0811d3ebbe8c179b8febf4ae706f0/tushare/stock/fundamental.py#L22

    
      
          from lxml import etree

          import re

          import time

          from pandas.compat import StringIO

          from tushare.util import dateu as du

          try:

              from urllib.request import urlopen, Request

          except ImportError:

              from urllib2 import urlopen, Request

          

          def get_stock_basics(date=None):

              """

                  获取沪深上市公司基本情况

              Parameters

              date:日期YYYY-MM-DD，默认为上一个交易日，目前只能提供2016-08-09之后的历史数据

          

              Return

              --------

              DataFrame

                         code,代码

                         name,名称

the get_stock_basics() function is hard-coded to use the GBK encoding:

text = text.decode('GBK')

so at this point I would say it is a bug in the tushare library.

eryksun · January 1, 2022, 9:55am

The get_stock_basics() call prints “本接口即将停止更新，请尽快使用Pro版接口：https://waditu.com/document/2”, which translates as “This interface will stop updating, please use the Pro version interface as soon as possible: https://waditu.com/document/2”. Here’s the source from PyPI:

def get_stock_basics(date=None):
    """
        获取沪深上市公司基本情况
    Parameters
    date:日期YYYY-MM-DD，默认为上一个交易日，目前只能提供2016-08-09之后的历史数据

    Return
    --------
    DataFrame
               code,代码
               name,名称
               industry,细分行业
               area,地区
               pe,市盈率
               outstanding,流通股本
               totals,总股本(万)
               totalAssets,总资产(万)
               liquidAssets,流动资产
               fixedAssets,固定资产
               reserved,公积金
               reservedPerShare,每股公积金
               eps,每股收益
               bvps,每股净资
               pb,市净率
               timeToMarket,上市日期
    """
    print("本接口即将停止更新，请尽快使用Pro版接口：https://tushare.pro/document/2")
    wdate = du.last_tddate() if date is None else date
    wdate = wdate.replace('-', '')
    if wdate < '20160809':
        return None
    datepre = '' if date is None else wdate[0:4] + wdate[4:6] + '/'
    request = Request(ct.ALL_STOCK_BASICS_FILE%(datepre, '' if date is None else wdate))
    text = urlopen(request, timeout=10).read()
    text = text.decode('GBK')
    text = text.replace('--', '')
    df = pd.read_csv(StringIO(text), dtype={'code':'object'})
    df = df.set_index('code')
    return df

The hard-coded GBK decode() in the source is unrelated to the encoding error. One can reasonably assume that this is a known encoding for the text on the website that’s being accessed via urlopen().

The problem is just the print() call and the encoding of sys.stdout. The Chinese text that it tries to print can’t be encoded with code page 1252, which is an 8-bit encoding for Western European alphabets. Refer to my previous post for a few simple ways to force sys.stdout to use UTF-8.

steven.daprano · January 1, 2022, 2:33pm

Nicely spotted. Thanks Eryk Sun.