8000 Bugfix/browsermgr session by fischerdr · Pull Request #1200 · unclecode/crawl4ai · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bugfix/browsermgr session #1200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fischerdr
Copy link
@fischerdr fischerdr commented Jun 8, 2025

Summary
This pull request addresses a critical bug where the BrowserManager would improperly share a single browser context across multiple, distinct user sessions. This led to unpredictable behavior, most notably causing active sessions to fail when another session was terminated.
The fix enforces strict context isolation, ensuring that every new session is provisioned with its own dedicated browser context. This change guarantees session integrity and resolves the "Target page, context or browser has been closed" error observed in tests.
List of files changed and why
crawl4ai/browser_manager.py:
Refactored the get_page() method to create a new, dedicated BrowserContext whenever a new session_id is provided. This prevents context sharing between sessions.
Removed debugging logs that were added to trace the issue.
tests/browser/test_playwright_strategy.py:
Removed unused imports and corrected minor linter errors for code quality.
Replaced bare except: clauses with specific exception handling for more robust tests.
The test logic was slightly improved to more accurately reflect the state after a session is killed, though the primary fix was in the manager itself.
How Has This Been Tested?
The changes were verified by running the test_playwright_strategy.py test suite. Specifically, the test_playwright_session_management test, which previously failed reliably, now passes.
The test procedure is as follows:
Launch the test script: python tests/browser/test_playwright_strategy.py
The script initializes two separate sessions (session1 and session2), each with its own browser context.
It then kills session1.
Finally, it verifies that session2 remains fully functional, confirming that its context was not affected by the termination of session1.
Checklist:
[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[x] I have added/updated unit tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes

Summary by CodeRabbit

  • Chores
    • Introduced a code style configuration file for Python, specifying formatting rules and exclusions.
    • Updated the ignore list to exclude the .cursor/ directory from version control.
  • Refactor
    • Improved code readability and maintainability in browser management logic, including clearer session handling and consistent formatting.
  • Tests
    • Enhanced test clarity and logging for browser-related tests without altering test outcomes.

fischerdr added 2 commits June 7, 2025 15:25
…orts in browser_manager.py for consistency

- Introduced a new .flake8 file to enforce coding standards including max line length, complexity, and docstring conventions.
- Refactored import statements in browser_manager.py for better organization and consistency, ensuring all imports are grouped and sorted.
- Cleaned up whitespace and formatting issues throughout the file to adhere to style guidelines.
Copy link
coderabbitai bot commented Jun 8, 2025

Walkthrough

A new .flake8 configuration file was added to define Python linting rules, and .gitignore was updated to exclude the .cursor/ directory. The crawl4ai/browser_manager.py file received extensive formatting, import organization, and readability improvements, along with enhanced session handling logic in the get_page method. The Playwright strategy test suite was refactored for clarity, error logging, and code style consistency.

Changes

File(s) Change Summary
.flake8 Added new Flake8 configuration specifying linting rules, ignores, and per-file exceptions.
.gitignore Added .cursor/ directory to ignore list under a new comment header.
crawl4ai/browser_manager.py Refactored imports, improved formatting, and enhanced session handling logic in get_page; minor code cleanups.
tests/browser/test_playwright_strategy.py Refactored imports, improved error logging, and enhanced code style in test functions; updated test runner logic.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BrowserManager
    participant ManagedBrowser

    Client->>BrowserManager: get_page(crawlerRunConfig)
    alt session_id provided and found
        BrowserManager->>ManagedBrowser: get existing session context and page
        BrowserManager-->>Client: return session page
    else session_id provided but not found
        BrowserManager->
8000
;>ManagedBrowser: create new browser context and page
        BrowserManager->>BrowserManager: store new session in sessions dict
        BrowserManager-->>Client: return new session page
    else no session_id
        BrowserManager->>ManagedBrowser: use shared or config-based context
        BrowserManager-->>Client: return page
    end
Loading

Poem

In code we hop, with linting set anew,
The cursor now ignored, as all good bunnies do.
Sessions managed smartly, tests with clearer sight,
Formatting and logging, everything just right.
With every tweak and tidy, our warren grows strong—
🐇 Happy coding, friends, as we bounce along!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/browser/test_playwright_strategy.py (1)

232-296: Consider refactoring to reduce local variable complexity.

The session management test function has many local variables which could impact readability. Consider breaking this into smaller helper functions or reducing variable scope where possible.

Example refactor approach:

async def test_playwright_session_management():
    """Test session management with Playwright browser."""
    logger.info("Testing session management with Playwright browser", tag="TEST")
    
    browser_config = BrowserConfig(headless=True)
    manager = BrowserManager(browser_config=browser_config, logger=logger)
    
    try:
        await manager.start()
        logger.info("Browser launched successfully", tag="TEST")
        
        # Setup and verify sessions
        session_results = await _setup_and_verify_sessions(manager)
        return session_results
    except Exception as e:
        logger.error(f"Test failed: {str(e)}", tag="TEST")
        try:
            await manager.close()
        except Exception as e:
            logger.error(f"Error closing browser: {str(e)}", tag="TEST")
        return False

async def _setup_and_verify_sessions(manager):
    # Extract session setup logic here
    pass
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 232-232: Too many local variables (17/15)

(R0914)

crawl4ai/browser_manager.py (2)

974-984: Consider simplifying the else clause structure.

The else clause after the return statement can be simplified by removing the else and de-indenting the code, as suggested by static analysis.

Apply this refactoring:

         if crawlerRunConfig.session_id in self.sessions:
             context, page, _ = self.sessions[crawlerRunConfig.session_id]
             # Update last-used timestamp
             self.sessions[crawlerRunConfig.session_id] = (context, page, time.time())
             return page, context

-        # If session does not exist, create a new context and page specifically for it.
-        else:
-            context = await self.create_browser_context(crawlerRunConfig)
-            await self.setup_context(context, crawlerRunConfig)
-            page = await context.new_page()
-            self.sessions[crawlerRunConfig.session_id] = (
-                context,
-                page,
-                time.time(),
-            )
-            return page, context
+        # If session does not exist, create a new context and page specifically for it.
+        context = await self.create_browser_context(crawlerRunConfig)
+        await self.setup_context(context, crawlerRunConfig)
+        page = await context.new_page()
+        self.sessions[crawlerRunConfig.session_id] = (
+            context,
+            page,
+            time.time(),
+        )
+        return page, context

866-880: Consider simplifying nested if statements.

The nested if statements can be combined into a single condition for better readability.

Apply this refactoring:

-        if crawlerRunConfig:
-            # Check if there is value for crawlerRunConfig.proxy_config set add that to context
-            if crawlerRunConfig.proxy_config:
+        if crawlerRunConfig and crawlerRunConfig.proxy_config:
+            # Add proxy config to context
                 proxy_settings = {
                     "server": crawlerRunConfig.proxy_config.server,
                 }
🧰 Tools
🪛 Ruff (0.11.9)

866-868: Use a single if statement instead of nested if statements

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4bb0cc and 3e9d6a2.

📒 Files selected for processing (4)
  • .flake8 (1 hunks)
  • .gitignore (1 hunks)
  • crawl4ai/browser_manager.py (26 hunks)
  • tests/browser/test_playwright_strategy.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/browser/test_playwright_strategy.py (7)
crawl4ai/browser_manager.py (6)
  • BrowserManager (520-1069)
  • close (1036-1069)
  • start (179-267)
  • start (581-625)
  • get_page (952-1009)
  • kill_session (1011-1023)
crawl4ai/async_configs.py (2)
  • BrowserConfig (323-598)
  • CrawlerRunConfig (671-1323)
crawl4ai/async_logger.py (2)
  • AsyncLogger (78-321)
  • LogLevel (12-27)
tests/browser/test_cdp_strategy.py (1)
  • run_tests (209-225)
tests/browser/test_browser_manager.py (1)
  • run_tests (170-187)
tests/browser/test_builtin_strategy.py (1)
  • run_tests (142-157)
tests/browser/test_profiles.py (1)
  • run_tests (158-173)
🪛 Pylint (3.3.7)
tests/browser/test_playwright_strategy.py

[error] 25-25: Possibly using variable 'BrowserConfig' before assignment

(E0606)


[error] 28-28: Possibly using variable 'BrowserManager' before assignment

(E0606)


[refactor] 232-232: Too many local variables (17/15)

(R0914)

crawl4ai/browser_manager.py

[refactor] 253-258: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 967-983: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

🪛 Ruff (0.11.9)
crawl4ai/browser_manager.py

866-868: Use a single if statement instead of nested if statements

(SIM102)

🔇 Additional comments (7)
.gitignore (1)

231-233: LGTM: Clean repository maintenance.

Good addition to exclude cursor editor artifacts from version control.

.flake8 (1)

1-31: LGTM: Well-configured linting rules.

The flake8 configuration establishes good code quality standards with:

  • Reasonable complexity and line length limits
  • Appropriate exclusions for Black formatter compatibility
  • Sensible per-file ignores for different file types

This will help maintain consistent code style across the project.

tests/browser/test_playwright_strategy.py (3)

13-17: LGTM: Improved import organization.

The conditional import structure and explicit LogLevel import improve code clarity and organization.


59-61: LGTM: Enhanced exception handling.

Replacing bare except clauses with explicit exception catching and error logging significantly improves debugging capabilities and code robustness.

Also applies to: 107-108, 168-170, 226-228, 292-294


302-307: Verify focused test execution is intentional.

The run_tests function is currently only executing the session management test with other tests commented out. Ensure this is intentional for this PR focused on session isolation.

If this is temporary for PR testing, consider adding a comment indicating when to re-enable the other tests.

crawl4ai/browser_manager.py (2)

964-984: LGTM: Critical session isolation fix implemented correctly.

This change successfully addresses the main bug by ensuring each session gets its own dedicated browser context and page. The explicit session handling logic prevents context sharing between sessions, which was causing the "Target page, context or browser has been closed" error.

Key improvements:

  • Clear separation of session-specific logic with comments
  • New sessions create dedicated contexts instead of sharing
  • Proper session storage with timestamps for TTL management

This implementation aligns perfectly with the PR objectives to enforce strict session isolation.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 967-983: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


1-21: LGTM: Improved import organization.

The import formatting and organization improvements enhance code readability and maintain consistency with the new flake8 configuration.

Merge remote-tracking branch 'origin/main' into bugfix/browsermgr_session
@fischerdr
Copy link
Author

Also there is a bug in the browser_profiler.py

Bug Report

Title: TypeError on crwl cdp and crwl browser start due to incorrect ManagedBrowser initialization.

Reported By: User
Date: 2024-06-09
Affected Version: latest

Description:
When attempting to launch a browser instance using the CLI commands crwl cdp or crwl browser start, the application crashes and raises an AttributeError: 'NoneType' object has no attribute 'browser_type'. This prevents users from using the CDP and built-in browser management features.

Steps to Reproduce:

  1. Open a terminal in the project's root directory.
  2. Run the command: crwl cdp
  3. Observe the traceback and error message in the console.

Expected Behavior:
The command should launch a standalone browser instance with the Chrome DevTools Protocol (CDP) enabled. It should print the CDP connection URL to the console and keep the browser running until the user quits.

Actual Behavior:
The command fails immediately after logging the initial launch parameters, throwing the following error: Error launching CDP browser: 'NoneType' object has no attribute 'browser_type'.

Root Cause Analysis:
The ManagedBrowser class constructor requires a BrowserConfig object for its browser_config parameter. The launch_standalone_browser and launch_builtin_browser methods in crawl4ai/browser_profiler.py were calling the constructor with multiple keyword arguments instead of a single BrowserConfig instance. This resulted in the browser_config attribute being None inside the ManagedBrowser instance, which led to the AttributeError when its properties were accessed.

Resolution:
The issue was resolved by modifying the launch_standalone_browser and launch_builtin_browser methods in crawl4ai/browser_profiler.py to first construct a BrowserConfig object and then pass that object to the ManagedBrowser constructor.

Affected Files:

  • crawl4ai/browser_profiler.py
--- a/crawl4ai/browser_profiler.py
+++ b/crawl4ai/browser_profiler.py
@@ -621,22 +621,22 @@
         self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
         self.logger.info(f"Headless mode: {headless}", tag="CDP")
         
-        # Create managed browser instance
-        managed_browser = ManagedBrowser(
+        # Create a BrowserConfig object to pass to ManagedBrowser
+        browser_cfg = BrowserConfig(
             browser_type=browser_type,
             user_data_dir=profile_path,
             headless=headless,
-            logger=self.logger,
-            debugging_port=debugging_port
+            debugging_port=debugging_port,
+        )
+
+        # Create managed browser instance
+        managed_browser = ManagedBrowser(
+            browser_config=browser_cfg,
+            logger=self.logger
         )
         
         # Set up signal handlers to ensure cleanup on interrupt
         original_sigint = signal.getsignal(signal.SIGINT)
@@ -816,13 +816,19 @@
         user_data_dir = os.path.join(self.builtin_browser_dir, "user_data")
         os.makedirs(user_data_dir, exist_ok=True)
         
-        # Create managed browser instance
-        managed_browser = ManagedBrowser(
+        # Create a BrowserConfig object to pass to ManagedBrowser
+        browser_cfg = BrowserConfig(
             browser_type=browser_type,
             user_data_dir=user_data_dir,
             headless=headless,
+            debugging_port=debugging_port,
+        )
+
+        # Create managed browser instance
+        managed_browser = ManagedBrowser(
+            browser_config=browser_cfg,
             logger=self.logger,
-            debugging_port=debugging_port
         )
         
         try:

# Update last-used timestamp
self.sessions[crawlerRunConfig.session_id] = (context, page, time.time())
return page, context
# --- START: Session-specific logic ---
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix for the session failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0