8000 Fix:read file encoding by aheizi · Pull Request #3554 · RooCodeInc/Roo-Code · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix:read file encoding #3554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

aheizi
Copy link
@aheizi aheizi commented May 13, 2025

Related GitHub Issue

#3555

Description

Currently, roo-code reads files by default according to utf-8. When the file encoding is GBK or others, it will cause garbled text problems

Test Procedure

manual testing

  1. Set VSCode's default encoding to GBK
  2. Let roo-code read this file and edit this file

Type of Change

  • 🐛 Bug Fix: Non-breaking change that fixes an issue.
  • New Feature: Non-breaking change that adds functionality.
  • 💥 Breaking Change: Fix or feature that would cause existing functionality to not work as expected.
  • ♻️ Refactor: Code change that neither fixes a bug nor adds a feature.
  • 💅 Style: Changes that do not affect the meaning of the code (white-space, formatting, etc.).
  • 📚 Documentation: Updates to documentation files.
  • ⚙️ Build/CI: Changes to the build process or CI configuration.
  • 🧹 Chore: Other changes that don't modify src or test files.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Code Quality:
    • My code adheres to the project's style guidelines.
    • There are no new linting errors or warnings (npm run lint).
    • All debug code (e.g., console.log) has been removed.
  • Testing:
    • New and/or updated tests have been added to cover my changes.
    • All tests pass locally (npm test).
    • The application builds successfully with my changes.
  • Branch Hygiene: My branch is up-to-date (rebased) with the main branch.
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Changeset: A changeset has been created using npm run changeset if this PR includes user-facing changes or dependency updates.
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

before:
image

image

after:

image image

Documentation Updates

Does this PR necessitate updates to user-facing documentation?

  • No documentation updates are required.
  • Yes, documentation updates are required. (Please describe what needs to be updated or link to a PR in the docs repository).

Additional Notes


Important

Introduces readFileSmart to handle various file encodings, replacing default UTF-8 reading in key tools, with tests and new dependencies added.

  • Behavior:
    • Introduces readFileSmart in readFileWithEncoding.ts to detect and decode file encodings, prioritizing UTF-8 and Chinese encodings.
    • Replaces fs.readFile with readFileSmart in applyDiffTool.ts, insertContentTool.ts, searchAndReplaceTool.ts, and DiffViewProvider.ts to handle non-UTF-8 encodings.
  • Tests:
    • Adds tests for readFileSmart, scoreText, getCandidateEncodings, and tryDecodeBuffer in readFileWithEncoding.test.ts.
  • Dependencies:
    • Adds chardet and iconv-lite to package.json for encoding detection and conversion.

This description was created by Ellipsis for e1df5c3. You can customize this summary. It will automatically update as commits are pushed.

Copy link
changeset-bot bot commented May 13, 2025

⚠️ No Changeset found

Latest commit: da3e39a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@hannesrudolph hannesrudolph moved this from New to PR [Draft/WIP] in Roo Code Roadmap May 14, 2025
@aheizi aheizi marked this pull request as ready for review May 14, 2025 03:11
@aheizi aheizi requested review from mrubens and cte as code owners May 14, 2025 03:11
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels May 14, 2025
@aheizi aheizi marked this pull request as draft May 14, 2025 03:43
@aheizi aheizi marked this pull request as ready for review May 16, 2025 15:51
@hannesrudolph hannesrudolph moved this from New to PR [Pre Approval Review] in Roo Code Roadmap May 20, 2025
@hannesrudolph hannesrudolph moved this from PR [Needs Review] to TEMP in Roo Code Roadmap May 26, 2025
@daniel-lxs daniel-lxs moved this from TEMP to PR [Needs Review] in Roo Code Roadmap May 27, 2025
Copy link
Collaborator
@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.

Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with our custom logic. Could we simplify this to just use chardet's result?

Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.

/**
* Common text file extensions that should always be treated as text
*/
const alwaysTextExtensions = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this alwaysTextExtensions array is also defined in extract-text.ts but with a different format (dots vs no dots). Should we maybe centralize this list somewhere to avoid duplication?

bestEncoding = enc
}
} catch {
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe log these errors for debugging? Silent failures could make it hard to troubleshoot encoding issues later.


for (const enc of encodings) {
try {
const text = enc === "utf-8" || enc === "utf8" ? buffer.toString("utf8") : iconv.decode(buffer, enc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large files, wouldn't decoding the entire buffer multiple times be slow? Have you considered just trusting chardet's detection and only falling back if that fails?

const shouldTryAll = alwaysTextExtensions.includes(ext)
const { text: bestText, score } = tryDecodeBuffer(buffer, encodings)

if (shouldTryAll || score > 0.05) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at the 0.05 threshold? Have you tested this with different types of files to see if this value works well across various scenarios?

I'm curious about this custom scoring system, would just trusting chardet's detection be enough in this case?

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PR - Needs Preliminary Review size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: PR [Draft / In Progress]
Development

Successfully merging this pull request may close these issues.

3 participants
0