GitHub - Weflac/reptile

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
__pycache__		__pycache__
maoyan		maoyan
tesseract		tesseract
weixin		weixin
beautifulsoup_demo.py		beautifulsoup_demo.py
geckodriver.log		geckodriver.log
ghostdriver.log		ghostdriver.log
lxml_demo.py		lxml_demo.py
mongodb.py		mongodb.py
phantomjs_demo.py		phantomjs_demo.py
pymysql_demo.py		pymysql_demo.py
pyquery_demo.py		pyquery_demo.py
readme		readme
redis_demo.py		redis_demo.py
requests_demo.py		requests_demo.py
requirements.txt		requirements.txt
selenium_demo.py		selenium_demo.py
urllib-re.py		urllib-re.py

Repository files navigation

# 爬虫，抓取网络数据

1.所需要的安装包，
   通过 pip install -r requirements.txt 命令安装所需要的包

   内置库：
      # urllib
      # re
   请求库：
     # requests  更方便的请求库
     # selenium  通过浏览器获得数据
     # phantomjs 无界面浏览器，通过PhantomJS获取数据
  解析库：
     # pyquery   强大而又灵活的网页解析库，和jquery很像的库
     # lxml      网页解析
     # beautifulsoup4 依赖lxml库
     # jupyter   可以理解为记事本，运行在网页端，可以在记事本里面写一些代码，调试，在线运行，markdown

  存储库：
     # pymysql
     # pymongo
     # redis
  
  图表库：
     # pyecharts  百度echarts

2.分布式爬虫，
    安装  pip install Scrapy
    依赖库：
        # wheel  wheel文件后缀whl
        # lxml
        # PyOpenssl
        # Twisted
        # Pywin32    https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/

     创建项目：
        在cmd命令输入scrapy
            Available commands:
              bench         Run quick benchmark test
              fetch         Fetch a URL using the Scrapy downloader
              genspider     Generate new spider using pre-defined templates
              runspider     Run a self-contained spider (without creating a project)
              settings      Get settings values
              shell         Interactive scraping console
              startproject  Create new project
              version       Print Scrapy version
              view          Open URL in browser, as seen by Scrapy

       1. scrapy startproject quotetutorial 创建项目：quotetutorial
       2. scrapy genspider quotes quotes.toscrape.com   创建爬取网站： quotes.toscrape.com

3.ORC图片识别
    安装 pip install pytesseract
    依赖库：
        # pillow
        # tesseract ocr     #windwos需要下载安装程序：http://code.google.com/p/tesseract-ocr/