Skip to content

[Bug] UnicodeDecodeError on Windows with non-ASCII characters (GBK codec error) #209

@my5icol

Description

@my5icol

Description

When running skill-seekers scrape on Windows (Chinese locale), the tool fails with a GBK codec error because several file operations don't specify encoding='utf-8'.

Error Message

Error: 'gbk' codec can't decode byte 0xac in position 206: illegal multibyte sequence

Environment

  • OS: Windows 10/11 (Chinese locale)
  • Python: 3.14
  • skill-seekers: 2.1.1

Root Cause

Windows Chinese edition uses GBK as the default encoding. The following file operations in doc_scraper.py don't specify UTF-8 encoding:

  1. load_config() function (~line 1390):
with open(config_path, 'r') as f:  # Missing encoding='utf-8'
check_existing_data() function (~line 1474):
with open(f"{data_dir}/summary.json", 'r') as f:  # Missing encoding='utf-8'
Suggested Fix
Add encoding='utf-8' to all file open operations:

# load_config()
with open(config_path, 'r', encoding='utf-8') as f:
    config = json.load(f)

# check_existing_data()
with open(f"{data_dir}/summary.json", 'r', encoding='utf-8') as f:
    summary = json.load(f)
Additional Notes
This issue affects all Windows users with non-English locales (Chinese, Japanese, Korean, etc.) where the system default encoding is not UTF-8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions