-
Notifications
You must be signed in to change notification settings - Fork 37.4k
fix: Make TxidFromString() respect string_view length #30436
New issue
8000Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Make TxidFromString() respect string_view length #30436
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments
9b6c6c5
to
fcec222
Compare
fcec222
to
3d0aec1
Compare
735865a
to
d20fc9c
Compare
d20fc9c
to
c3a9c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK c3a9c71 🐞
Show signature
Signature:
untrusted comment: signature from minisign secret key on empty file; verify via: minisign -Vm "${path_to_any_empty_file}" -P RWTRmVTMeKV5noAMqVlsMugDDCyyTSbA3Re5AkUrhvLVln0tSaFWglOw -x "${path_to_this_whole_four_line_signature_blob}"
RUTRmVTMeKV5npGrKx1nqXCw5zeVHdtdYURB/KlyA/LMFgpNCs+SkW9a8N95d+U4AP1RJMi+krxU1A3Yux4bpwZNLvVBKy0wLgM=
trusted comment: ACK c3a9c71c7077324bf0aa40f834f7537a14619340 🐞
0oe/WjDO2mSwVc5Blutf21rEvpYjnqCdQdO76x2UhqHObkUZghAnaWmbpcKtyptM6F8bw8dIl8IU3+TJAkcnBg==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK c3a9c71
c3a9c71
to
4923d90
Compare
4923d90
to
c3a9c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK c3a9c71
I think the changes are all good, but the PR and the most important commit "fix: Make TxidFromString() respect string length" (c3a9c71) are lacking a good description that actually says what the bug is.
I think they should say that currently TxidFromString
takes a string_view parameter, but internally it is calling the uint256S
function which expects a nul-terminated string. If TxidFromString
is called with a string that not contain \0
, the implementation will read the string and potentially uninitialized memory after it until it finds any character which is not a hex digit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK c3a9c71,
are lacking a good description that actually says what the bug is.
I agree, I spent quite a bit of time getting to the bottom of this and more documentation would have sped that up
but internally it is calling the uint256S function which expects a nul-terminated string
Maybe I'm being too pedantic (or wrong), but I'm not sure this is 100% correct? I don't think the uint256S(std::string_view str)
str
needs to be null-terminated. Rather, I think the problem is that by removing the string length data in TxidFromstring
(by passing std::string_view::data
instead of std::string_view
), uint256S
has to construct a std::string_view
from the const char*
without length information, which is dangerous (as you describe) when the initial string_view
was not null-terminated, e.g. because it is a substring?
For example, this test (with a non-null terminated string) fails before c3a9c71 and succeeds after:
//txid "0x74d681e0e03bafa802c8aa084379aa98d9fcd632ddc2ed9782b586ec87451f20"
const char arr[66] = {
'7', '4', 'd', '6', '8', '1', 'e', '0', 'e', '0', '3', 'b', 'a', 'f', 'a', '8',
'0', '2', 'c', '8', 'a', 'a', '0', '8', '4', '3', '7', '9', 'a', 'a', '9', '8',
'd', '9', 'f', 'c', 'd', '6', '3', '2', 'd', 'd', 'c', '2', 'e', 'd', '9', '7',
'8', '2', 'b', '5', '8', '6', 'e', 'c', '8', '7', '4', '5', '1', 'f', '2', '0',
'a', 'b' // extra 2 characters
};
const char* ptr = arr; // not null-terminated
std::string_view view{ptr, 64};
BOOST_CHECK_EQUAL(TxidFromString(view).ToString(), view);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #30436 (review)
but internally it is calling the uint256S function which expects a nul-terminated string
Maybe I'm being too pedantic (or wrong), but I'm not sure this is 100% correct? I don't think the
uint256S(std::string_view str)
str
needs to be null-terminated.
Yes, both explanations are saying the same thing but mine is unnecessarily confusing. The uint256S
function "expects" a nul-terminated string because it accepts a bare char*
argument, and in C bare char*
strings are almost always nul-terminated. But it does not "need" a nul-terminated string. The string can be terminated with any character other than 0123456789AaBbCcDdEeFf
, as mentioned in the next sentence.
I agree as well. Thanks for brining it up! This should be fixed before merge, other than that, I'd say it is ready. |
Side note: we can use bugprone-suspicious-stringview-data-usage (from unreleased clang-19) to catch this going forward, when we have clang-19. I tried running clang-tidy-19 locally from master branch for longer than I probably should have, and eventually gave up. As a poor man's alternative, I looked for other stringview data misusage based on some rough grepping, and couldn't find anything, but that was definitely not an exhaustive search either. |
c3a9c71
to
649c88d
Compare
Thanks for the feedback! Helped me realize I was being too terse. (I think it was a mistake alone just to not have (Edit: where/were typo) Only modified commit message: --- c3a9c71c7077324bf0aa40f834f7537a14619340 2024-07-19 22:42:56.019823786 +0200
+++ 788fe9cc9ab979c5e14f544ae05dc46436892b81 2024-07-19 22:48:26.785770275 +0200
@@ -1,3 +1,5 @@
-fix: Make TxidFromString() respect string length
+fix: Make TxidFromString() respect string_view length
-Appears to have been a fully dormant bug. Introduced since inception of TxidFromString(), discussed in https://github.com/bitcoin/bitcoin/pull/28922#discussion_r1404437378.
+Prior to this, passing in string_view::data() meant uint256S() would only receive the a naked char* pointer and scan past the string_view::length() until it found a null terminator.
+
+Appears to have been a fully dormant bug as callers were either passing a string literal or std::string directly to TxidFromFromString(), meaning null terminator always existed at pointer[length()]. Bug existed since original merge of TxidFromString(), discussed in https://github.com/bitcoin/bitcoin/pull/28922#discussion_r1404437378. |
Prior to this, passing string_view::data() into uint256S() meant the latter would only receive the a naked char* pointer and potentially scan past the string_view::length() until it found a null terminator (or some other non-hex character). Appears to have been a fully dormant bug as callers were either passing a string literal or std::string directly to TxidFromFromString(), meaning null terminator always existed at pointer[length()]. Bug existed since original merge of TxidFromString(), discussed in bitcoin#28922 (comment).
bfcc016
to
09ce350
Compare
Credit @ryanofsky as author of Use
As I was changing Hope this strikes a good balance. Thank you again for your patience and hopefully my intuition in doing async review online has been refined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 09ce350
Note: adding co-authorship for the commits that reviewers have contributed to is a good way to make sure all participants are motivated to continue doing it.
re-ACK 09ce350 🕓 Show signatureSignature:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 09ce350. I think the current code changes are about as small as you could make to fix the bug without introducing a string copy, and the surrounding test improvements are all very nice and welcome.
It seems like may be some suggestions from paplorinc that could be responded to, but they are pretty minor. So this looks ready for merge.
return rv; | ||
} | ||
inline uint160 uint160S(const std::string& str) | ||
inline uint160 uint160S(std::string_view str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #30436 (comment)
it's not very important in this particular method, but
inline
+const
could help with the actual inlining - it's also clearer to the reader to understand what happens when mutating the parameter, since after inlining it should ideally avoid the copying (which is easy to do for a const). At least this was my thinking for recommending it, let me know if I'm mistaken.
Responded in the other thread #30436 (comment)
BOOST_CHECK_EQUAL_COLLECTIONS(R1L.begin(), R1L.end(), R1Array, R1Array + R1L.size()); | ||
BOOST_CHECK_EQUAL_COLLECTIONS(TmpL.begin(), TmpL.end(), R1Array, R1Array + TmpL.size()); | ||
BOOST_CHECK_EQUAL_COLLECTIONS(R2L.begin(), R2L.end(), R2Array, R2Array + R2L.size()); | ||
BOOST_CHECK_EQUAL_COLLECTIONS(ZeroL.begin(), ZeroL.end(), ZeroArray, ZeroArray + ZeroL.size()); | ||
BOOST_CHECK_EQUAL_COLLECTIONS(OneL.begin(), OneL.end(), OneArray, OneArray + OneL.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit in f11f816: Seems confusing to use the left size to calculate the right end. Seems easier to just use std::end(right)
, if it compiles?
Edit: Or use std::array
and the begin()
and end()
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did that for the +32 case, but wasn't working for the +20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it is bad form to use variable from the left side of the comparison in the right side.
Tested std::end(R1Array)
but it would be off by 1 thanks to the null-terminator in the string, resulting in a size/distance-mismatch between the collections/iterators being tested.
Follow-up fixing this issue: #30526
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-merge ACK 09ce350
@@ -79,6 +79,18 @@ const std::function<std::string(const char*)> G_TRANSLATION_FUN = nullptr; | |||
/** Random context to get unique temp data dirs. Separate from g_insecure_rand_ctx, which can be seeded from a const env var */ | |||
static FastRandomContext g_insecure_rand_ctx_temp_path; | |||
|
|||
std::ostream& operator<<(std::ostream& os, const arith_uint256& num) | |||
{ | |||
os << ArithToUint256(num).ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ArithToUint256
conversion necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I missed the base_uint::ToString()
method. Not as familiar with arith_uint256
as I am with uint256
yet.
Should probably also document endianness of arith_uint256
strings to mirror the added comments to uint256
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up fixing this issue: #30526
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to comparison operator giving most significance to the least significant bytes. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to comparison operator giving most significance to the least significant bytes. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to comparison operator giving most significance to the least significant bytes. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
Github-Pull: bitcoin#30436 Rebased-From: f0eeee2
Github-Pull: bitcoin#30436 Rebased-From: f11f816
Github-Pull: bitcoin#30436 Rebased-From: 2f5577d
Clarify that hex strings are parsed as little-endian. Github-Pull: bitcoin#30436 Rebased-From: 01e314c
Prior to this, passing string_view::data() into uint256S() meant the latter would only receive the a naked char* pointer and potentially scan past the string_view::length() until it found a null terminator (or some other non-hex character). Appears to have been a fully dormant bug as callers were either passing a string literal or std::string directly to TxidFromFromString(), meaning null terminator always existed at pointer[length()]. Bug existed since original merge of TxidFromString(), discussed in bitcoin#28922 (comment). Github-Pull: bitcoin#30436 Rebased-From: 09ce350
Follow-up to bitcoin#30436. uint256 string representation was wrongfully documented as little-endian due to them being reversed by GetHex() etc, and base_blob::Compare() giving most significance to the beginning of the internal array. They are closer to "big-endian", but this commit tries to be even more precise than that. uint256_tests.cpp - Avoid using variable from the left side of the condition in the right side. setup_common.cpp - Skip needless ArithToUint256-conversion.
73e3fa1 doc + test: Correct uint256 hex string endianness (Hodlinator) Pull request description: This PR is a follow-up to #30436. Only changes test-code and modifies/adds comments. Byte order of hex string representation was wrongfully documented as little-endian, but are in fact closer to "big-endian" (endianness is a memory-order concept rather than a numeric concept). `[arith_]uint256` both store their data in arrays with little-endian byte order (`arith_uint256` has host byte order within each `uint32_t` element). **uint256_tests.cpp** - Avoid using variable from the left side of the condition in the right side. Credits to @maflcko: #30436 (comment) **setup_common.cpp** - Skip needless ArithToUint256-conversion. Credits to @stickies-v: #30436 (comment) --- <details> <summary> ## Logical reasoning for endianness </summary> 1. Comparing an `arith_uint256` (`base_uint<256>`) to a `uint64_t` compares the beginning of the array, and verifies the remaining elements are zero. ```C++ template <unsigned int BITS> bool base_uint<BITS>::EqualTo(uint64_t b) const { for (int i = WIDTH - 1; i >= 2; i--) { if (pn[i]) return false; } if (pn[1] != (b >> 32)) return false; if (pn[0] != (b & 0xfffffffful)) return false; return true; } ``` ...that is consistent with little endian ordering of the array. 2. They have the same endianness (but `arith_*` has host-ordering of each `uint32_t` element): ```C++ arith_uint256 UintToArith256(const uint256 &a) { arith_uint256 b; for(int x=0; x<b.WIDTH; ++x) b.pn[x] = ReadLE32(a.begin() + x*4); return b; } ``` ### String conversions The reversal of order which happens when converting hex-strings <=> uint256 means strings are actually closer to big-endian, see the end of `base_blob<BITS>::SetHexDeprecated`: ```C++ unsigned char* p1 = m_data.data(); unsigned char* pend = p1 + WIDTH; while (digits > 0 && p1 < pend) { *p1 = ::HexDigit(trimmed[--digits]); if (digits > 0) { *p1 |= ((unsigned char)::HexDigit(trimmed[--digits]) << 4); p1++; } } ``` Same reversal here: ```C++ template <unsigned int BITS> std::string base_blob<BITS>::GetHex() const { uint8_t m_data_rev[WIDTH]; for (int i = 0; i < WIDTH; ++i) { m_data_rev[i] = m_data[WIDTH - 1 - i]; } return HexStr(m_data_rev); } ``` It now makes sense to me that `SetHexDeprecated`, upon receiving a shorter hex string that requires zero-padding, would pad as if the missing hex chars where towards the end of the little-endian byte array, as they are the most significant bytes. "Big-endian" string representation is also consistent with the case where `SetHexDeprecated` receives too many hex digits and discards the leftmost ones, as a form of integer narrowing takes place. ### How I got it wrong in #30436 Previously I used the less than (`<`) comparison to prove endianness, but for `uint256` it uses `memcmp` and thereby gives priority to the *lower* bytes at the beginning of the array. ```C++ constexpr int Compare(const base_blob& other) const { return std::memcmp(m_data.data(), other.m_data.data(), WIDTH); } ``` `arith_uint256` is different in that it begins by comparing the bytes from the end, as it is using little endian representation, where the bytes toward the end are more significant. ```C++ template <unsigned int BITS> int base_uint<BITS>::CompareTo(const base_uint<BITS>& b) const { for (int i = WIDTH - 1; i >= 0; i--) { if (pn[i] < b.pn[i]) return -1; if (pn[i] > b.pn[i]) return 1; } return 0; } ``` (The commit documents that `base_blob::Compare()` is doing lexicographic ordering unlike the `arith_*`-variant which is doing numeric ordering). </details> ACKs for top commit: paplorinc: ACK 73e3fa1 ryanofsky: Code review ACK 73e3fa1 Tree-SHA512: 121630c37ab01aa7f7097f10322ab37da3cbc0696a6bbdbf2bbd6db180dc5938c7ed91003aaa2df7cf4a4106f973f5118ba541b5e077cf3588aa641bbd528f4e
8756ccd scripted-diff: Replace ParseHex[<std::byte>]("str") -> "str"_hex[_u8] (Hodlinator) 9cb6873 refactor: Prepare for ParseHex -> ""_hex scripted-diff (Hodlinator) 50bc017 refactor: Hand-replace some ParseHex -> ""_hex (Hodlinator) 5b74a84 util: Add consteval ""_hex[_v][_u8] literals (l0rinc) dc5f6f6 test refactor: util_tests - parse_hex clean up (Hodlinator) 2b5e6ef refactor: Make XOnlyPubKey tolerate constexpr std::arrays (Hodlinator) 403d86f refactor: vector -> span in CCrypter (Hodlinator) bd0830b refactor: de-Hungarianize CCrypter (Hodlinator) d99c816 refactor: Improve CCrypter related lines (Hodlinator) 7e1d9a8 refactor: Enforce lowercase hex digits for consteval uint256 (Hodlinator) Pull request description: Motivation: * Validates and converts the hex string into bytes at compile time instead of at runtime like `ParseHex()`. * Eliminates runtime dependencies: #30377 (comment), #30048 (comment) * Has stricter requirements than `ParseHex()` (disallows whitespace and uppercase hex digits) and replaces it in a bunch of places. * Makes it possible to derive other compile time constants. * Minor: should shave off a few runtime CPU cycles. `""_hex` produces `std::array<std::byte>` as the momentum in the codebase is to use `std::byte` over `uint8_t`. Also makes `uint256` hex string constructor disallow uppercase hex digits. Discussed: #30560 (comment) Surprisingly does not change the size of the Guix **bitcoind** binary (on x86_64-linux-gnu) by 1 single byte. Spawned already merged PRs: #30436, #30482, #30532, #30560. ACKs for top commit: l0rinc: ACK 8756ccd stickies-v: Rebase re-ACK 8756ccd, no changes since a096215 ryanofsky: Code review ACK 8756ccd, just rebasing since last review and taking advantage of CScript constructors in #29369, also tweaking a code comment Tree-SHA512: 9b2011b7c37e0ef004c669f8601270a214b388916316458370f5902c79c2856790b1b2c7c123efa65decad04886ab5eff95644301e0d84358bb265cf1f8ec195
Problem
Prior to this,
TxidFromString()
was passingstring_view::data()
intouint256S()
which meant it would only receive the a nakedchar*
pointer and potentially scan past thestring_view::length()
until it found a null terminator (or some other non-hex character).Appears to have been a fully dormant bug as callers were either passing a string literal or
std::string
directly toTxidFromFromString()
, meaning a null terminator always existed atpointer[length()]
. Bug existed since original merge ofTxidFromString()
.Solution
Make
uint256S()
(andbase_blob::SetHex()
) take and operate onstd::string_view
instead ofconst char*
and haveTxidFromString()
pass that in.(PR was prompted by comment in #30377 (comment) (referring to #28922 (comment))).