8000 GitHub - Weflac/reptile
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Weflac/reptile

Repository files navigation

# 爬虫,抓取网络数据

1.所需要的安装包,
   通过 pip install -r requirements.txt 命令安装所需要的包

   内置库:
      # urllib
      # re
   请求库:
     # requests  更方便的请求库
     # selenium  通过浏览器获得数据
     # phantomjs 无界面浏览器,通过PhantomJS获取数据
  解析库:
     # pyquery   强大而又灵活的网页解析库,和jquery很像的库
     # lxml      网页解析
     # beautifulsoup4 依赖lxml库
     # jupyter   可以理解为记事本,运行在网页端,可以在记事本里面写一些代码,调试,在线运行,markdown

  存储库:
     # pymysql
     # pymongo
     # redis
  
  图表库:
     # pyecharts  百度echarts

2.分布式爬虫,
    安装  pip install Scrapy
    依赖库:
        # wheel  wheel文件后缀whl
        # lxml
        # PyOpenssl
        # Twisted
        # Pywin32    https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/

     创建项目:
        在cmd命令输入scrapy
            Available commands:
              bench         Run quick benchmark test
              fetch         Fetch a URL using the Scrapy downloader
              genspider     Generate new spider using pre-defined templates
              runspider     Run a self-contained spider (without creating a project)
              settings      Get settings values
              shell         Interactive scraping console
              startproject  Create new project
              version       Print Scrapy version
              view          Open URL in browser, as seen by Scrapy

       1. scrapy startproject quotetutorial 创建项目:quotetutorial
       2. scrapy genspider quotes quotes.toscrape.com   创建爬取网站: quotes.toscrape.com

3.ORC图片识别
    安装 pip install pytesseract
    依赖库:
        # pillow
        # tesseract ocr     #windwos需要下载安装程序:http://code.google.com/p/tesseract-ocr/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0