TxSearch silently does not honor a query if a timeout occurs

Bug Report

Calling this a bug may not be super accurate to be fair, as this is intentional behavior. It's just odd behavior that's hard to work with

Setup

CometBFT version
Originally experienced on 0.37.4, however these problems still would occur in main

Have you tried the latest version: no, but based off the code it still seems these problems would occur

What happened?

I'll preface that after digging more into TxSearch, it seems like its shortcomings are known by people familiar with it, so I understand if it's maybe intended to be used in a different way than we are using it. However this wasn't very clear when first learning about TxSearch!

We operate some off-chain infrastructure that needs to index historical events from specific CosmWasm contracts. We use the TxSearch RPC endpoint against an archive node to do this by looking for all events in a sliding window of blocks. The query is something like: "tx.height >= XXXX AND tx.height <= YYYY AND wasm-hpl_hook_merkle::post_dispatch._contract_address = '<contract address>'. A live example query can be found here. We have client side logic to request each page as needed, etc

We soon started noticing some weird cases popping up -- notably:

Sometimes txs that certainly met the conditions in the query were not being returned by TxSearch
Sometimes txs that certainly did not meet the conditions in the query were being returned by TxSearch
Sometimes the total_count of txs would change for the exact same query
Changing the order of the conditions within the query could result in wildly different results

After poking around, it seems the main culprit is the logic that this comment describes:

cometbft/internal/state/txindex/kv/kv.go

Lines 390 to 391 in 0b6c8ab

    
           // Search will exit early and return any result fetched so far, 
        
           // when a message is received on the context chan.

The server would timeout when processing the RPC, and return whatever values it had received so far. This isn't really a bug because this seems like very intentional behavior, but it feels like a bug unless you know the quirks of the RPC

What did you expect to happen?

A few thoughts:

I think this timeout behavior is reasonable, but at the minimum I think it should be better documented, and ideally it shouldn't happen silently. We wasted about a day digging into why things weren't working as expected. I think most users like us will expect the RPC to be working as expected if no error is returned
It feels like there's room for improvement on the general logic in TxSearch to not be so expensive, but it looks like steps have been made recently to make this more efficient
a. If I understand correctly, the pagination really just helps to minimize bytes being served over the wire, but each time a TxSearch RPC is made for a particular page, work is done again and again by the server to find all txs across all pages, and there's no caching of this for future page queries
b. It feels odd that if I'm e.g. just hoping to query the txs in a 1000 block range that match a condition, then the logic still ends up reading every tx in the DB that matches that query, and only after all those DB reads does it filter out the txs not in the desired block range. The Ethereum approach of using bloom filters for events per block & allows for more efficient querying of smaller block ranges. Because we require archive nodes to be able to index old txs, the logic atm in cometbft of reading the DB for every single matching tx outside of the block range ends up being computationally expensive
c. It also feels odd that if there are two non-height conditions in a query, when applying the second condition, instead of starting with the txs that matched the original condition, it will iterate through the DB directly and find all txs that match just that second condition, and only then get the intersection of the txs matching the first condition and the second. I get that this is because of the way the DB indexing is set up, but again this feels pretty suboptimal
It's not clear to me what the best practice for indexing is in the Cosmos ecosystem. Any input would be very helpful! Should we be polling block by block instead of using TxSearch? Or relying on existing indexing infrastructure used by e.g. explorers? Or setting a super high timeout on TxSearch? Ideally we'd like to not necessarily need to operate a node with special timeouts configured, and to just point to an archive node.

How to reproduce it

Run a chain with many txs and a shorter timeout period, observe outputs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report

Setup

What happened?

What did you expect to happen?

How to reproduce it

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	// Search will exit early and return any result fetched so far,
	// when a message is received on the context chan.

Description

Bug Report

Setup

What happened?

What did you expect to happen?

How to reproduce it

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions