-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix:read file encoding #3554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix:read file encoding #3554
Conversation
|
# Conflicts: # src/integrations/editor/DiffViewProvider.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.
Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with our custom logic. Could we simplify this to just use chardet's result?
Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.
/** | ||
* Common text file extensions that should always be treated as text | ||
*/ | ||
const alwaysTextExtensions = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this alwaysTextExtensions
array is also defined in extract-text.ts
but with a different format (dots vs no dots). Should we maybe centralize this list somewhere to avoid duplication?
bestEncoding = enc | ||
} | ||
} catch { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe log these errors for debugging? Silent failures could make it hard to troubleshoot encoding issues later.
|
||
for (const enc of encodings) { | ||
try { | ||
const text = enc === "utf-8" || enc === "utf8" ? buffer.toString("utf8") : iconv.decode(buffer, enc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For large files, wouldn't decoding the entire buffer multiple times be slow? Have you considered just trusting chardet's detection and only falling back if that fails?
const shouldTryAll = alwaysTextExtensions.includes(ext) | ||
const { text: bestText, score } = tryDecodeBuffer(buffer, encodings) | ||
|
||
if (shouldTryAll || score > 0.05) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at the 0.05 threshold? Have you tested this with different types of files to see if this value works well across various scenarios?
I'm curious about this custom scoring system, would just trusting chardet's detection be enough in this case?
Related GitHub Issue
#3555
Description
Currently, roo-code reads files by default according to utf-8. When the file encoding is GBK or others, it will cause garbled text problems
Test Procedure
manual testing
Type of Change
src
or test files.Pre-Submission Checklist
npm run lint
).console.log
) has been removed.npm test
).main
branch.npm run changeset
if this PR includes user-facing changes or dependency updates.Screenshots / Videos
before:

after:
Documentation Updates
Does this PR necessitate updates to user-facing documentation?
Additional Notes
Important
Introduces
readFileSmart
to handle various file encodings, replacing default UTF-8 reading in key tools, with tests and new dependencies added.readFileSmart
inreadFileWithEncoding.ts
to detect and decode file encodings, prioritizing UTF-8 and Chinese encodings.fs.readFile
withreadFileSmart
inapplyDiffTool.ts
,insertContentTool.ts
,searchAndReplaceTool.ts
, andDiffViewProvider.ts
to handle non-UTF-8 encodings.readFileSmart
,scoreText
,getCandidateEncodings
, andtryDecodeBuffer
inreadFileWithEncoding.test.ts
.chardet
andiconv-lite
topackage.json
for encoding detection and conversion.This description was created by
for e1df5c3. You can customize this summary. It will automatically update as commits are pushed.