-
Notifications
You must be signed in to change notification settings - Fork 0
Weflac/reptile
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# 爬虫,抓取网络数据 1.所需要的安装包, 通过 pip install -r requirements.txt 命令安装所需要的包 内置库: # urllib # re 请求库: # requests 更方便的请求库 # selenium 通过浏览器获得数据 # phantomjs 无界面浏览器,通过PhantomJS获取数据 解析库: # pyquery 强大而又灵活的网页解析库,和jquery很像的库 # lxml 网页解析 # beautifulsoup4 依赖lxml库 # jupyter 可以理解为记事本,运行在网页端,可以在记事本里面写一些代码,调试,在线运行,markdown 存储库: # pymysql # pymongo # redis 图表库: # pyecharts 百度echarts 2.分布式爬虫, 安装 pip install Scrapy 依赖库: # wheel wheel文件后缀whl # lxml # PyOpenssl # Twisted # Pywin32 https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/ 创建项目: 在cmd命令输入scrapy Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy 1. scrapy startproject quotetutorial 创建项目:quotetutorial 2. scrapy genspider quotes quotes.toscrape.com 创建爬取网站: quotes.toscrape.com 3.ORC图片识别 安装 pip install pytesseract 依赖库: # pillow # tesseract ocr #windwos需要下载安装程序:http://code.google.com/p/tesseract-ocr/
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published