8000 TxSearch silently does not honor a query if a timeout occurs · Issue #2101 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
TxSearch silently does not honor a query if a timeout occurs #2101
Open
@tkporter

Description

@tkporter

Bug Report

Calling this a bug may not be super accurate to be fair, as this is intentional behavior. It's just odd behavior that's hard to work with

Setup

CometBFT version
Originally experienced on 0.37.4, however these problems still would occur in main

Have you tried the latest version: no, but based off the code it still seems these problems would occur

What happened?

I'll preface that after digging more into TxSearch, it seems like its shortcomings are known by people familiar with it, so I understand if it's maybe intended to be used in a different way than we are using it. However this wasn't very clear when first learning about TxSearch!

We operate some off-chain infrastructure that needs to index historical events from specific CosmWasm contracts. We use the TxSearch RPC endpoint against an archive node to do this by looking for all events in a sliding window of blocks. The query is something like: "tx.height >= XXXX AND tx.height <= YYYY AND wasm-hpl_hook_merkle::post_dispatch._contract_address = '<contract address>'. A live example query can be found here. We have client side logic to request each page as needed, etc

We soon started noticing some weird cases popping up -- notably:

  1. Sometimes txs that certainly met the conditions in the query were not being returned by TxSearch
  2. Sometimes txs that certainly did not meet the conditions in the query were being returned by TxSearch
  3. Sometimes the total_count of txs would change for the exact same query
  4. Changing the order of the conditions within the query could result in wildly different results

After poking around, it seems the main culprit is the logic that this comment describes:

// Search will exit early and return any result fetched so far,
// when a message is received on the context chan.

The server would timeout when processing the RPC, and return whatever values it had received so far. This isn't really a bug because this seems like very intentional behavior, but it feels like a bug unless you know the quirks of the RPC

What did you expect to happen?

A few thoughts:

  1. I think this timeout behavior is reasonable, but at the minimum I think it should be better documented, and ideally it shouldn't happen silently. We wasted about a day digging into why things weren't working as expected. I think most users like us will expect the RPC to be working as expected if no error is returned
  2. It feels like there's room for improvement on the general logic in TxSearch to not be so expensive, but it looks like steps have been made recently to make this more efficient
    a. If I understand correctly, the pagination really just helps to minimize bytes being served over the wire, but each time a TxSearch RPC is made for a particular page, work is done again and again by the server to find all txs across all pages, and there's no caching of this for future page queries
    b. It feels odd that if I'm e.g. just hoping to query the txs in a 1000 block range that match a condition, then the logic still ends up reading every tx in the DB that matches that query, and only after all those DB reads does it filter out the txs not in the desired block range. The Ethereum approach of using bloom filters for events per block & allows for more efficient querying of smaller block ranges. Because we require archive nodes to be able to index old txs, the logic atm in cometbft of reading the DB for every single matching tx outside of the block range ends up being computationally expensive
    c. It also feels odd that if there are two non-height conditions in a query, when applying the second condition, instead of starting with the txs that matched the original condition, it will iterate through the DB directly and find all txs that match just that second condition, and only then get the intersection of the txs matching the first condition and the second. I get that this is because of the way the DB indexing is set up, but again this feels pretty suboptimal
  3. It's not clear to me what the best practice for indexing is in the Cosmos ecosystem. Any input would be very helpful! Should we be polling block by block instead of using TxSearch? Or relying on existing indexing infrastructure used by e.g. explorers? Or setting a super high timeout on TxSearch? Ideally we'd like to not necessarily need to operate a node with special timeouts configured, and to just point to an archive node.

How to reproduce it

Run a chain with many txs and a shorter timeout period, observe outputs

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0