8000 HttpResponse of Crawled Page Returns Null in Abot2 · Issue #243 · sjdirect/abot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
HttpResponse of Crawled Page Returns Null in Abot2 #243
Open
@Nitish0949

Description

@Nitish0949

Encountered an issue using the Abot2 package while crawling webpages. The httpResponse property of the crawled page (entity.CrawledPage.HttpResponseMessage) intermittently returns null for some pages.

Key Observations
This issue does not occur for all pages but only for certain ones.
The number of pages with a null httpResponse varies between different crawl runs.

Expected Behavior
The HttpResponseMessage should provide the HTTP response for all crawled pages.

Actual Behavior
The HttpResponseMessage is null for some pages, and the occurrence of these pages is inconsistent between crawl runs.

Additional Information
Error Message: When the issue occurs, the following error is logged:
The SSL connection could not be established, see inner exception.
Inner Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

This issue appears to be related to handling HTTPS connections or certain server configurations.

Steps Already Taken
Verified SSL/TLS settings and configurations.
Checked network connectivity and ensured the target URLs are reachable.
Observed that this issue is not URL-specific but varies across crawl runs.
Executed crawler with different configuration, below are some config which have been used
config 1:

public CrawlConfiguration CrawlConfig(int maxPages)
{
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
    CrawlConfiguration crawlConfig = new()
    {
        MaxConcurrentThreads = 1,
        MinCrawlDelayPerDomainMilliSeconds = 1000,
        IsSslCertificateValidationEnabled = false,
        MaxPagesToCrawl = 5000,
        HttpRequestTimeoutInSeconds = 30,
        MaxRetryCount = 5,
        MinRetryDelayInMilliseconds = 5,
        CrawlTimeoutSeconds = 5000,
    };
    return crawlConfig;
}

Config 2:

public CrawlConfiguration CrawlConfig(int maxPages)
{
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
    CrawlConfiguration crawlConfig = new()
    {
        MaxConcurrentThreads = 1,
        MinCrawlDelayPerDomainMilliSeconds = 1000,
        IsSslCertificateValidationEnabled = true,
        MaxPagesToCrawl = 5000,
        HttpRequestTimeoutInSeconds = 300,
        MaxRetryCount = 5,
        MinRetryDelayInMilliseconds = 5000,
        CrawlTimeoutSeconds = 5000,
    };
    return crawlConfig;
}

image

Would appreciate assistance in diagnosing and resolving this issue, or confirmation if this is a known bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0