8000 GitHub - tmanabe/HEPS: a HEading-based Page Segmentation algorithm
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on May 22, 2025. It is now read-only.

tmanabe/HEPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HEPS: a HEading-based Page Segmentation algorithm

All the details are in our paper.

batch.rb

  • supports batch processing by HEPS.
  • Usage:
$ ruby batch.rb <path_to_PhantomJS_binary> ./html-dir ./target-dir
  • It is developed by using:
    • CentOS release 6.5
    • Ruby 2.1.2p95
    • PhantomJS 2.0.1-development

Notes

  • This implementation ignores the childNodes of IFRAME and NOSCRIPT elements as well as SCRIPT and STYLE elements.
  • Current parameter values are roughly optimized for entire our data set (not only the training data set explained in our paper).

Link

About

a HEading-based Page Segmentation algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0