8000 GitHub - stlukey/WebCrawler: A Page() object with web crawling capabilities.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

stlukey/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

A html Page() object with webcrawling capabilitys

Args:
url (str):          url to web page
level=1 (int):   level of recursive web crawling.
                      0;just that page,
                      1;that page and its links
                      2;that page, its links and its link's links
                      etc

Usage:

>>> page = Page(url, level=1)

>>> vars(page)
{
    'url': "...",                             # url passed through __init__
    'source': "...",                          # urllib2.urlopen(url).read()
    'soup': < __name__.BeautifulSoup at ... > # BeautifulSoup(source)
    'title': "...",                           # title tag
    'description': "...",                     # meta tag description
    'rank': "...",                            # google's rank of url
    'keywords': ["...", ...],                 # meta tag keywords
    'urls': [...],                            # href of a tags
    'links': [...]                            # Page instances of urls with
                                              # level - 1
}

# Each Page object creates links as Page(url, level-1)
>>> page.links
[< __name__.Page at ....>,
 < __name__.Page at ....>,
 < __name__.Page at ....>,
 ....]

>>> page.links[0]
< __name__.Page at ....>

>>> page = page.links[0]
>>> page.links
[< __name__.Page at ....>,
 < __name__.Page at ....>,
 < __name__.Page at ....>,
 ....]

About

A Page() object with web crawling capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0