当前位置：首页>后端>正文

python 爬虫字体文件如何下载 python爬虫下载文档

后端2024-05-30 23:54:30

最近学习了下python爬虫，在简单看了一些文档之后就想着做点东西来完善下自己学习的内容。

因此就写了下面的代码，来实现把一个网站上面的小说内容下载下来。小说是一章一章的结构，因此在把每章的内容爬下来之后，还需要合并到一个TXT文件中。

python的版本是3.6，然后使用了beautifulsoup库。

网站的界面如下:

python 爬虫字体文件如何下载 python爬虫下载文档,python 爬虫字体文件如何下载 python爬虫下载文档_http,第1张

从上图可以看到，网站里面的内容每一章都是单独的下载链接。因此我需要把所有的文件合并到一起。经过查看页面的编码，发现为GBK编码，因此对下载到的文件，都需要通过GBK编码转换为字符串，然后写到最终文件中。

至于为什么我会把文件通过字符串的方式写到最终文件中，而不是通过字节码的方式，是因为考虑到为了能够把多个文件拼接到一起后，还考虑到文件的格式。比如每一章的名字，不同章之间能够有几个换行。方便我本地的一些靠小说的APP能够识别出小说的目录结构。

整体代码如下：

import urllib.request
import urllib.error
import re
from bs4 import BeautifulSoup
import time
import random


def download(url, user_agent='wswp', num_retries=2):
    print('downloading: %', url)
    # 防止对方禁用Python的代理，导致forbidden错误
    headers = {'User-agent': user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('download error:', e.reason)
        html = None
        if num_retries > 0:
            # URLError是一个上层的类，因此HttpERROR是可以被捕获到的。code是HttpError里面的一个字段
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1)
    return html


def get_links(html):
    """
    return a list of links from html
    :param html:
    :return:
    """
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']')
    return webpage_regex.findall(html)


page_url = 'https://www.szyangxiao.com/txt197165.shtml'
html_result = download(page_url)

if html_result is None:
    exit(1)
else:
    # print(html_result)
    pass
# 分析得到的结果，从中找到需要访问的内容
# download_links = filter_download_novel_links(get_links(str(html_result)))
#
# for link in download_links:
#     print(link)
soup = BeautifulSoup(html_result, 'html.parser')
fixed_html = soup.prettify()

# print(fixed_html)

uls = soup.find_all('ul', attrs={'class': 'clearfix'})
lis = uls[1].find_all('li', attrs={'class': 'min-width'})

with open(r'F:\红楼之庶子风流.txt', 'w') as target_file_writer:
    for li in lis[340:]:
        a = li.find('a')
        href = a.get('href').replace('//', 'https://')
        text = a.text.replace('下载《红楼之庶子风流 ', '').replace('》txt', '')
        # print(text)
        # print(str(download(href), 'gbk'))
        target_file_writer.write(text)
        target_file_writer.write('\n')
        target_file_writer.write(str(download(href), 'gbk'))
        target_file_writer.write('\n')
        time.sleep(random.randint(5, 10))
        # print(text, href)

以上就是全部的需求和代码。当然不推荐看免费小说，以上代码只是只是为了一个实验。

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

本次对代码做了点改动。因为发现了不同系统下，比如英文系统环境下，通过字符串的方式写入文件会报错。因此我把代码改成了通过字节码的方式写到文件中。

# coding:utf-8
import urllib.request
import urllib.error
from bs4 import BeautifulSoup
import time
import random


# 使用LXML的方式来代替BeautifulSoup的方式


def download(url, user_agent='wswp', num_retries=2):
    print('downloading: %', url)
    # 防止对方禁用Python的代理，导致forbidden错误
    headers = {'User-agent': user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.error.URLError as e:
        print('download error:', e.reason)
        html = None
        if num_retries > 0:
            # URLError是一个上层的类，因此HttpERROR是可以被捕获到的。code是HttpError里面的一个字段
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1)
    return html


page_url = 'https://www.szyangxiao.com/txt197165.shtml'
html_result = download(page_url)

if html_result is None:
    exit(1)
else:
    pass

# 分析得到的结果，从中找到需要访问的内容
soup = BeautifulSoup(html_result, 'html.parser')
fixed_html = soup.prettify()

uls = soup.find_all('ul', attrs={'class': 'clearfix'})
lis = uls[1].find_all('li', attrs={'class': 'min-width'})

# 修改文件写入方式为byte方式。
with open(r'C:\study\红楼之庶子风流.txt', 'wb') as target_file_writer:
    default_encode = 'utf-8'

    new_line = '\n'.encode(default_encode)

    for li in lis[340:]:
        a = li.find('a')
        href = a.get('href').replace('//', 'https://')
        # 把字符串转换成byte，然后写入到文件中
        text = a.text.replace('下载《红楼之庶子风流 ', '').replace('》txt', '').encode(default_encode)

        target_file_writer.write(text)
        target_file_writer.write(new_line)
        # 把字符串转换成byte，然后写入到文件中。因为源文件为gbk的编码方式，因此需要先decode，然后重新encode
        target_file_writer.write(download(href).decode('gbk').encode(default_encode))
        target_file_writer.write(new_line)
        time.sleep(random.randint(5, 10))

这样代码就适合运行在各种环境上了。

查看全文

https://www.xamrdz.com/backend/3t41962898.html

相关文章：