Description
Self Checks
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report (Language Policy).
- Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
Other environment information
Actual behavior
In the retrieval function of the Dealer class, located in rag/nlp/search.py (starting from line 348), there is a forced override of the page_size parameter when the doc_ids parameter is present. This raises questions about parameter consistency and could lead to unexpected behavior.
Specifically, in the following code snippet:
if doc_ids: similarity_threshold = 0 page_size = 30
When doc_ids is True, the page_size variable within the function is unconditionally reassigned to 30, regardless of the value initially passed by the caller.
Problem Analysis:
Parameter Semantic Inconsistency: The retrieval function explicitly accepts a page_size parameter, which typically implies that the caller can control the number of results returned. However, in the doc_ids scenario, this forced override renders the caller's page_size value ineffective, leading to unclear parameter semantics and unpredictable function behavior.
Potential Computational Waste: The calculation of idx (which selects chunk indices from the reranked results) on line 382:
idx = np.argsort(sim * -1)[(page - 1) * page_size:page * page_size]
This calculation uses the original page_size value passed by the caller. This means if the initial page_size is, for example, 100, the system will process and identify indices for 100 candidate chunks. However, the subsequent loop's length check and the final truncation will use the page_size value that has been forcibly set to 30 (when doc_ids is true). This leads to a mismatch where more chunks are processed initially than are ultimately returned, potentially causing unnecessary computation for the unused chunks.
Limited Flexibility: This hardcoded limitation restricts the ability to flexibly control the number of returned chunks when doc_ids are specified. Users might need fewer than 30 chunks (e.g., just a few representative snippets from specific documents) or more than 30 (e.g., all relevant chunks from a set of specified documents).
Expected behavior
To address these issues and improve the function's behavior, I propose two possible solutions:
Remove the forced page_size = 30 override:
The most straightforward solution is to simply remove line 390, page_size = 30; from rag/nlp/search.py.
Steps to reproduce
...
Additional information
No response