Description
Bug Report
Calling this a bug may not be super accurate to be fair, as this is intentional behavior. It's just odd behavior that's hard to work with
Setup
CometBFT version
Originally experienced on 0.37.4, however these problems still would occur in main
Have you tried the latest version: no, but based off the code it still seems these problems would occur
What happened?
I'll preface that after digging more into TxSearch, it seems like its shortcomings are known by people familiar with it, so I understand if it's maybe intended to be used in a different way than we are using it. However this wasn't very clear when first learning about TxSearch!
We operate some off-chain infrastructure that needs to index historical events from specific CosmWasm contracts. We use the TxSearch RPC endpoint against an archive node to do this by looking for all events in a sliding window of blocks. The query is something like: "tx.height >= XXXX AND tx.height <= YYYY AND wasm-hpl_hook_merkle::post_dispatch._contract_address = '<contract address>'
. A live example query can be found here. We have client side logic to request each page as needed, etc
We soon started noticing some weird cases popping up -- notably:
- Sometimes txs that certainly met the conditions in the query were not being returned by TxSearch
- Sometimes txs that certainly did not meet the conditions in the query were being returned by TxSearch
- Sometimes the total_count of txs would change for the exact same query
- Changing the order of the conditions within the query could result in wildly different results
After poking around, it seems the main culprit is the logic that this comment describes:
cometbft/internal/state/txindex/kv/kv.go
Lines 390 to 391 in 0b6c8ab
The server would timeout when processing the RPC, and return whatever values it had received so far. This isn't really a bug because this seems like very intentional behavior, but it feels like a bug unless you know the quirks of the RPC
What did you expect to happen?
A few thoughts:
- I think this timeout behavior is reasonable, but at the minimum I think it should be better documented, and ideally it shouldn't happen silently. We wasted about a day digging into why things weren't working as expected. I think most users like us will expect the RPC to be working as expected if no error is returned
- It feels like there's room for improvement on the general logic in TxSearch to not be so expensive, but it looks like steps have been made recently to make this more efficient
a. If I understand correctly, the pagination really just helps to minimize bytes being served over the wire, but each time a TxSearch RPC is made for a particular page, work is done again and again by the server to find all txs across all pages, and there's no caching of this for future page queries
b. It feels odd that if I'm e.g. just hoping to query the txs in a 1000 block range that match a condition, then the logic still ends up reading every tx in the DB that matches that query, and only after all those DB reads does it filter out the txs not in the desired block range. The Ethereum approach of using bloom filters for events per block & allows for more efficient querying of smaller block ranges. Because we require archive nodes to be able to index old txs, the logic atm in cometbft of reading the DB for every single matching tx outside of the block range ends up being computationally expensive
c. It also feels odd that if there are two non-height conditions in a query, when applying the second condition, instead of starting with the txs that matched the original condition, it will iterate through the DB directly and find all txs that match just that second condition, and only then get the intersection of the txs matching the first condition and the second. I get that this is because of the way the DB indexing is set up, but again this feels pretty suboptimal - It's not clear to me what the best practice for indexing is in the Cosmos ecosystem. Any input would be very helpful! Should we be polling block by block instead of using TxSearch? Or relying on existing indexing infrastructure used by e.g. explorers? Or setting a super high timeout on TxSearch? Ideally we'd like to not necessarily need to operate a node with special timeouts configured, and to just point to an archive node.
How to reproduce it
Run a chain with many txs and a shorter timeout period, observe outputs