8000 Use beautifulsoup4 instead of beautifulscraper by Rafiot · Pull Request #28 · kgaughan/uwhoisd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Use beautifulsoup4 instead of beautifulscraper #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8000
Merged
merged 1 commit into from
Aug 18, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ uwhoisd

.. image:: https://secure.travis-ci.org/kgaughan/uwhoisd.png?branch=master
:width: 89px
:height: 13px
:height: 13px
:target: http://travis-ci.org/kgaughan/uwhoisd

A 'Universal WHOIS' proxy server: you query it for information about a
Expand All @@ -14,5 +14,5 @@ It is only intended for use with domain names currently, but could be
generalised to work with other types of WHOIS server.

The daemon comes with a scraper to pull WHOIS server information from IANA's
root zone database at `tools/scraper.py`. This requires the `beautifulscraper
<https://pypi.python.org/pypi/beautifulscraper>`_ package to run.
root zone database at `tools/scraper.py`. This requires the `beautifulsoup4
<https://pypi.python.org/pypi/beautifulsoup4>`_ package to run.
4 changes: 4 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@

# Documentation.
Sphinx

# scrape IANA
beautifulsoup4
requests
27 changes: 16 additions & 11 deletions tools/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,14 @@
import socket
import sys
import time
import urlparse

import beautifulscraper
try:
from urllib.parse import urljoin
except ImportError:
from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

ROOT_ZONE_DB = 'http://www.iana.org/domains/root/db'
SLEEP = 0
Expand All @@ -23,14 +27,14 @@ def main():
"""
logging.basicConfig(stream=sys.stderr, level=logging.INFO)

print '[overrides]'
print('[overrides]')

logging.info("Scraping %s", ROOT_ZONE_DB)
scraper = beautifulscraper.BeautifulScraper()
body = scraper.go(ROOT_ZONE_DB)
zone_page = requests.get(ROOT_ZONE_DB).text
soup = BeautifulSoup(zone_page, 'html.parser')

no_server = []
for link in body.select('#tld-table .tld a'):
for link in soup.select('#tld-table .tld a'):
if 'href' not in link.attrs:
continue

Expand All @@ -43,9 +47,10 @@ def main():

time.sleep(SLEEP)

zone_url = urlparse.urljoin(ROOT_ZONE_DB, link.attrs['href'])
zone_url = urljoin(ROOT_ZONE_DB, link.attrs['href'])
logging.info("Scraping %s", zone_url)
body = scraper.go(zone_url)
b = requests.get(zone_url).text
body = BeautifulSoup(b, 'html.parser')

title = body.find('h1')
if title is None:
Expand All @@ -55,7 +60,7 @@ def main():
if len(title_parts) != 2:
logging.info("Could not find TLD in '%s'", title)
continue
ace_zone = title_parts[1].encode('idna').lower()
ace_zone = title_parts[1].encode('idna').decode().lower()

whois_server_label = body.find('b', text='WHOIS Server:')
whois_server = ''
Expand All @@ -76,10 +81,10 @@ def main():
no_server.append(ace_zone)
else:
logging.info("WHOIS server for %s is %s", ace_zone, whois_server)
print '%s=%s' % (ace_zone, whois_server)
print('%s=%s' % (ace_zone, whois_server))

for ace_zone in no_server:
print '; No record for %s' % ace_zone
print('; No record for %s' % ace_zone)

logging.info("Done")

Expand Down
0