Python爬虫实战教程——爬取xkcd漫画

Hicoder

2022-01-27

0 前言

Python版本：3.7.0

开发工具：IDLE（Python 3.7 64-bit）、Google Chrome

1 网络爬虫（web crawler）？

网络爬虫（又被称为网页蜘蛛，网络机器人），是一种按照一定的规则，自动地抓取网络信息的程序或者脚本。 ——百度百科简单点，网络爬虫是我们编写的自动从网络上抓取对我们有用的信息的程序

2 必备HTML、CSS知识

一个基本的网页-example.html：<!doctype html><html> <head> <title>网页的标题</title> </head> <body> 网页主体——在浏览器中显示的内容包含在这里 </body></html>

2.1 了解HTML基本组成

详细学习HTML：菜鸟教程runoob

在example.html中，“<!doctype html>”声明文档类型，为html。用“<”、“>”包含单词或字母构成html的标签，标签一般是成对的，如“<p></p>”。“<head></head>”中包含网页的基本信息，如网页标题（显示在浏览器标题栏）、编码、作者、描述等。“<body></body>”中包含的内容将展示在浏览器中。标签有属性。如src、href等。常见的标签：<!doctype html><html> <head> <title>网页的标题</title> </head> <body> <div> <h1> 我是一号标题 </h1> <p> 我是一个段落 </p> <div> <img src="https://img.baidu.com/1.jpg" alt="1.jpg" title="这是图片"> <a href="http://linjianming.com/">这是链接</a> </div> </div> </body></html>

2.2 了解CSS是什么？CSS选择器是什么？

详细学习CSS：菜鸟教程runoob

CSS用来告诉浏览器该怎么显示网页内容——使网页更美观。常用CSS选择器：id、class、标签选择器<!doctype html><html> <head> <title>网页的标题</title> </head> <style> .div-0 { background-color: grey; } #header { font-size: 22px; color: red; } p { font-size: 14px; text-align:center; } </style> <body> <div class="div-0"> <h1 id="header"> 我是一号标题 </h1> <p> 我是一个段落 </p> </div> </body></html>

3 Python基础知识请前往学习：随心而码

4 Python开发网络爬虫

4.1 requests

安装requests模块：pip3 install requests

用途：从网络上下载文件和网页。

常用函数：requests.get()函数接受一个要下载的URL

import requestsres = requests.get('https://www.baidu.com/')

res.raise_for_statue() # 检查错误，出错则抛出异常

4.2 bs4

安装bs4模块：pip3 install

bs4用途：解析HTML

使用方法：用select()方法寻找HTML元素，用标签的get()方法从元素中获取数据。常用CSS选择器的模式：

import requests,bs4
res = requests.get('https://www.baidu.com/')
res.encoding = 'utf-8' # 转换编码，解决中文乱码
res.raise_for_status() # 检查错误，出错则抛出异常
soup = bs4.BeautifulSoup(res.text,"html.parser")
iput = soup.select('input[type="submit"]')
iputValue = iput[0].get('value')
print(iputValue)

4.3 os

Python自带，不用手动安装。

常用：用 os.makedirs() 创建新文件夹

os.path.join() 构建文件路径

os.path.basename() 获取文件基本名称

5 下载xkcd漫画——《Python编程快速上手——让繁琐工作自动化》

#! python3
import requests,os,bs4
url = 'https://xkcd.com/1/' 
# starting url
os.makedirs('xkcd',exist_ok=True)
i = 1
while not url.endswith('#'): # Download the page print('Downloading page %s...' %url) 
    res = requests.get(url) 
    res.raise_for_status() 
    soup = bs4.BeautifulSoup(res.text, "html.parser") # Find the URL of the comic image 
    comicElem = soup.select('#comic img') 
    if comicElem == []: 
        print('Could not find comic image.') 
    else: 
        comicUrl = 'https:' + comicElem[0].get('src') 
    # Download the image    
    print('Downloading image %s...' %(comicUrl)) 
    res = requests.get(comicUrl) 
    res.raise_for_status() 
    # Save the image to ./xkcd 
    imageFile= open(os.path.join('xkcd',str(i) + '_' + os.path.basename(comicUrl)), 'wb') 
    for chunk in res.iter_content(100000): 
        imageFile.write(chunk) 
        imageFile.close() 
    # Get the Prev button's url 
    nextLink = soup.select('a[rel="next"]')[0] url = 'https://xkcd.com' +  nextLink.get('href') 
    i = i + 1
print('Done.')

@爬虫

本作品采用署名-非商业性使用-相同方式共享 4.0 国际 (CC BY-NC-SA 4.0)进行许可.

Python爬虫实战教程——爬取xkcd漫画

Hicoder

0 前言

1 网络爬虫（web crawler）？

2 必备HTML、CSS知识

2.1 了解HTML基本组成

2.2 了解CSS是什么？CSS选择器是什么？

3 Python基础知识请前往学习：随心而码

4 Python开发网络爬虫

4.1 requests

4.2 bs4

4.3 os

5 下载xkcd漫画——《Python编程快速上手——让繁琐工作自动化》

分享此文

猜你喜欢

1评论

随心而码

Python爬虫实战教程——爬取xkcd漫画

Hicoder

0 前言

1 网络爬虫（web crawler）？

2 必备HTML、CSS知识

2.1 了解HTML基本组成

2.2 了解CSS是什么？CSS选择器是什么？

3 Python基础知识请前往学习：随心而码

4 Python开发网络爬虫

4.1 requests

4.2 bs4

4.3 os

5 下载xkcd漫画——《Python编程快速上手——让繁琐工作自动化》

感谢您的赞赏

分享此文

分享到微信朋友圈

猜你喜欢

1评论

随心而码