fix(shell): resolve Windows encoding issues for non-ASCII output#2423
fix(shell): resolve Windows encoding issues for non-ASCII output#2423tanzhenxin merged 20 commits intomainfrom
Conversation
- Add Python scripts for generating large text output (2000 lines) - Add Chinese progress bar shell script for testing terminal output - Add Windows GBK encoding reproduction tests for qwen-code These utilities help debug shell output handling and Windows encoding issues. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Add wrapCommandForWindowsEncoding() to prefix non-ASCII commands with chcp 65001 on Windows with non-UTF-8 codepages. Fix CP936 mapping from gb2312 to gbk (GBK is the correct superset). Update Windows encoding test scripts to demonstrate the fix. This ensures Chinese and other non-ASCII characters display correctly when running commands on Windows systems with GBK/CP936 codepage. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Simplify wrapCommandForWindowsEncoding to always prefix with chcp 65001 on Windows with non-UTF-8 codepage, rather than only for non-ASCII commands. This ensures script files written by qwen-code are also interpreted correctly by cmd.exe. Add WINDOWS_UTF8_CODE_PAGE constant and update test script for PowerShell 5.1 compatibility. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Return forceUtf8Output flag from wrapCommandForWindowsEncoding to signal that output should be decoded as UTF-8 when we've switched the codepage to 65001. This ensures consistent encoding for both command input and output on Windows with non-UTF-8 system codepage. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Add automatic LF to CRLF conversion when writing .bat/.cmd files - Handle null encoding detection gracefully in shellExecutionService - Add comprehensive tests for CRLF conversion edge cases This ensures Windows batch files work correctly, as cmd.exe requires CRLF line endings for proper parsing of multi-line constructs. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Remove wrapCommandForWindowsEncoding and forceUtf8Output parameter, relying solely on getCachedEncodingForBuffer for encoding detection. This simplifies the shell execution flow by removing the chcp 65001 command prefixing approach. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Rename outputEncoding to detectedEncoding for clarity - Add isValidUtf8 helper using TextDecoder in fatal mode - Restructure detection: UTF-8 → chardet → system encoding - Update tests to use non-UTF-8 bytes for accurate testing This prevents misclassifying UTF-8 output as legacy codepages on systems where the system encoding (e.g., GBK) could also decode those bytes. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Add comprehensive test plan and scripts for verifying encoding detection on Windows systems with non-UTF-8 codepages (e.g., GBK/CP936). This provides manual test cases for output decoding, file round-trips, PTY behavior, and edge cases like large output with late CJK content. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Prefix PowerShell commands with [Console]::OutputEncoding=UTF8 - Re-detect encoding on full buffer after streaming completes - Move system encoding fallback into detectEncodingFromBuffer This ensures CJK and other non-ASCII characters are correctly decoded on Windows systems with non-UTF-8 system codepages (e.g., GBK/CP936). Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Reduce test count from 8 to 7, remove script dependencies, and use system-created test files instead. Tests now focus on real-world scenarios: GBK dir listings, file content, PowerShell UTF-8 prefix, and late CJK detection after ASCII output. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…e tests Add integration-tests/windows-encoding.test.ts with automated tests for: - ASCII shell output - PowerShell CJK output via UTF-8 prefix - GBK directory listings and file content - UTF-8 file round-trip with CRLF preservation - GBK file edit preserving encoding Remove obsolete manual test scripts from test-shell/ and test-windows-encoding/. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Move system encoding fallback from detectEncodingFromBuffer into getCachedEncodingForBuffer for clearer responsibility. Remove unused WINDOWS_UTF8_CODE_PAGE export and inline the value. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Update JSDoc to make clear that detectEncodingFromBuffer only performs chardet statistical detection and returns null on failure. Callers like getCachedEncodingForBuffer are responsible for providing fallback logic. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
This removes the Windows-specific encoding integration tests that are no longer needed after consolidating the encoding utilities. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
📋 Review SummaryThis PR addresses Windows encoding issues for non-ASCII output through a multi-layered approach: PowerShell UTF-8 prefix injection, UTF-8-first encoding detection strategy, and CRLF line ending enforcement for .bat/.cmd files. The implementation is well-structured with comprehensive test coverage, though there are a few areas that could benefit from refinement. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
This removes an unnecessary external reference from the codebase. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
- Make CRLF conversion for .bat/.cmd files Windows-only - Extract PowerShell UTF-8 prefix into reusable function - Replace custom UTF-8 validation with Node.js built-in isUtf8() This ensures .bat/.cmd files are only converted on Windows where cmd.exe actually requires CRLF, and reduces code duplication for shell encoding. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Refactor the vi.mock for 'os' to use a simpler direct mock object instead of the importOriginal pattern, making the test setup more concise. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Remove hardcoded REPLAY_TERMINAL_COLS/ROWS/SCROLLBACK constants - Pass actual terminal dimensions to replayTerminalOutput() - Increase scrollback buffer to 10000 for better output capture This ensures terminal replay uses the actual terminal size instead of fixed dimensions, improving output accuracy. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Add needsUtf8Bom() to detect when UTF-8 BOM is needed based on file extension and system code page - PowerShell 5.1 on non-UTF-8 Windows systems (e.g. GBK) requires BOM to read scripts correctly - Remove default UTF8 encoding; undefined now triggers auto-detection - Add tests for needsUtf8Bom() covering Windows/non-Windows scenarios This ensures PowerShell scripts are written with UTF-8 BOM on systems that need it, fixing character encoding issues for non-ASCII content. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Add documentation for encoding detection, default encoding settings, CRLF handling for batch files, and UTF-8 BOM for PowerShell scripts on Windows. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
TLDR
Fixes Windows encoding issues where shell output containing CJK and other non-ASCII characters was garbled. The solution forces UTF-8 output for PowerShell, improves encoding detection (UTF-8 first), and ensures CRLF line endings for .bat/.cmd files.
Dive Deeper
Problem
On Windows systems with non-UTF-8 codepages (e.g., GBK/CP936 in Chinese locales), shell output containing CJK characters was corrupted. Additionally, .bat/.cmd files written with LF line endings could break multi-line constructs, labels, and goto statements.
Solution
This PR implements a multi-layered fix:
PowerShell UTF-8 prefix: Automatically prepend
[Console]::OutputEncoding=[System.Text.Encoding]::UTF8;to PowerShell commands on Windows, ensuring CJK and other non-ASCII output is emitted as UTF-8 regardless of system codepage.UTF-8-first encoding detection: Changed
getCachedEncodingForBufferto try UTF-8 validation first before falling back to chardet statistical detection or system encoding. This handles the common case where modern tools output UTF-8 even on non-UTF-8 systems.CRLF for .bat/.cmd files: The file system service now automatically converts LF to CRLF when writing .bat or .cmd files, ensuring they work correctly with cmd.exe.
Fixed codepage 936 mapping: Changed from
gb2312togbkfor Windows codepage 936, as GBK is the correct encoding name.Files Changed
packages/core/src/utils/systemEncoding.ts- UTF-8-first detection strategypackages/core/src/services/shellExecutionService.ts- PowerShell UTF-8 prefixpackages/core/src/services/fileSystemService.ts- CRLF conversion for .bat/.cmdReviewer Test Plan
On Windows (non-UTF-8 codepage):
echo 你好世界On any platform:
cd packages/core && npx vitest run src/utils/systemEncoding.test.ts src/services/shellExecutionService.test.ts src/services/fileSystemService.test.tsTesting Matrix
Linked issues / bugs
Fixes #2129
Fixes #1950
🤖 Generated with Qwen Code