Nutch Htmlunit Plugin

Introduction

According to the implementation of Apache Nutch 2.1, we can't get dynamic HTML information from fetch pages including AJAX requests as it will ignore all AJAX requests.

This plugin will use Htmlunit to fetch whole page content with necessary dynamic AJAX requests. It developed and tested with Apache Nutch 2.1, you can try it on other Nutch version or refactor the source codes as your design.

Quick Start

Using ivy or maven or manually to copy htmlunit dependencies to your apache-nutch-2.1/lib, please refer: http://htmlunit.sourceforge.net/dependencies.html
Copy runtime/local/plugins/* to your apache-nutch-2.1/plugins
Change your apache-nutch-2.1/conf/nutch-site.xml to use this plugin 'protocol-htmlunit', as below sample:


<property>
  <name>plugin.includes</name>
  <value>protocol-htmlunit|urlfilter-regex|parse-...</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

Optionally, you can config apache-nutch-2.1/conf/regex-urlfilter.txt to control htmlunit only fetch specified urls including internal AJAX request. See detail: https://github.com/xautlx/nutch-htmlunit/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/RegexHttpWebConnection.java
That's all. Now you can execute: apache-nutch-2.1/bin/nutch crawl urls, and see page contents parsed by htmlunit.

Contact Author

E-Mail: xautlx@hotmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
runtime/local/plugins		runtime/local/plugins
src/plugin		src/plugin
.gitignore		.gitignore
.project		.project
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nutch Htmlunit Plugin

Introduction

Quick Start

Contact Author

About

Uh oh!

Releases

Packages

License

zxcbnm178/nutch-htmlunit

Folders and files

Latest commit

History

Repository files navigation

Nutch Htmlunit Plugin

Introduction

Quick Start

Contact Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages