Scrapy 2.6 Feed exports 数据文件输出使用指南

科技 08-06 来源： Mr数据杨

Scrapy 自带了 Feed 输出，并且支持多种序列化格式（serialization format）及存储方式（storage backends）。

Python3 的 Scrapy 爬虫框架 中数据爬取过程中数据输出操作。生成一个带有爬取数据的“输出文件（通常叫『输出 feed』），来供其它系统使用。

序列化方式（serialization formats）

数据类型	输出格式化	Item 类型输出
JSON	json	JsonItemExporter
JSON lines	jsonlines	JsonLinesItemExporter
CSV	csv	CsvItemExporter
XML	xml	XmlItemExporter
Pickle	pickle	PickleItemExporter
Marshal	marshal	MarshalItemExporter

在这里插入图片描述

数据存储（Storage ）

使用 feed 输出时可以通过使用 URL（通过 FEED_URI 设置）来定义存储端。feed 输出支持 URI 方式支持的多种存储后端类型。

自带支持的存储后端有：本地文件系统、FTP、S3（需要 boto）、标注输出。

Storage URI parameters 存储URI参数

存储 URI 也包含参数。当 feed 被创建时这些参数可以被覆盖。

%(time)s - 当 feed 被创建时被 timestamp 覆盖
%(name)s - 被 spider 的名字覆盖

# 存储在 FTP，每个 spider 一个目录ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json# 存储在 S3，每个 spider 一个目录s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

Storage backends 存储端

存储类型	系统限制	URI scheme	依赖库	样例
本地文件系统	Unix	file	-	file://tmp/export.csv
FTP	-	ftp	-	tp://user:http://pass@ftp.example.com/path/to/export.csv
S3	-	s3	boto	s3://aws_key:aws_secret@mybucket/path/to/export.csv
谷歌云存储(GCS)	-	gs	-	gs://mybucket/path/to/export.csv
标准输出	-	stdout	-	stdout:

数据项目过滤 Item filtering

ItemFilter 数据项目过滤

用于选择是否应允许将项目导出到特定的数据内容。

item_filter 过滤举例

自定义过滤类别分配 item_filter ，触发条件即过滤。

class MyCustomFilter:    def __init__(self, feed_options):        self.feed_options = feed_options    def accepts(self, item):        if "xxx" in item and item["xx"] == "xxxx":            return True         return False

Post-Processing 提交处理

Scrapy 提供了一个选项来激活插件以在将数据导出到 feed 存储之前对它们进行后处理。

通过 postprocessing 选项激活该功能。主要功能参数有：

使用gzip压缩接收到的数据，其中包括：gzip_compresslevel、gzip_mtime、gzip_filename。

class scrapy.extensions.postprocessing.GzipPlugin

使用lzma压缩接收到的数据，其中包括：lzma_format、lzma_check、lzma_preset、lzma_filters。

class scrapy.extensions.postprocessing.LZMAPlugin(file: BinaryIO, feed_options: Dict[str, Any])

使用bz2压缩接收到的数据，其中包括：bz2_compresslevel。

class scrapy.extensions.postprocessing.Bz2Plugin(file: BinaryIO, feed_options: Dict[str, Any])

settings 数据输出设置

- FEEDS(强制性)- FEED_EXPORT_ENCODING- FEED_STORE_EMPTY- FEED_EXPORT_FIELDS- FEED_EXPORT_INDENT- FEED_STORAGES- FEED_STORAGE_FTP_ACTIVE- FEED_STORAGE_S3_ACL- FEED_EXPORTERS- FEED_EXPORT_BATCH_ITEM_COUNT

Feeds 数据输出

默认输出字典格式。启用提要导出功能需要此设置。

{    'items.json': {        'format': 'json',        'encoding': 'utf8',        'store_empty': False,        'fields': None,        'indent': 4,        'item_export_kwargs': {           'export_empty_fields': True,        },    },    '/home/user/documents/items.xml': {        'format': 'xml',        'fields': ['name', 'price'],        'encoding': 'latin1',        'indent': 8,    },    pathlib.Path('items.csv'): {        'format': 'csv',        'fields': ['price', 'name'],    },}

主要参数列表：

FEED_STORAGES_BASE

文件存储基础字典。

{    '': 'scrapy.extensions.feedexport.FileFeedStorage',    'file': 'scrapy.extensions.feedexport.FileFeedStorage',    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',    's3': 'scrapy.extensions.feedexport.S3FeedStorage',    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',}

FEED_EXPORTERS_BASE

文件输出基础字典。

{    'json': 'scrapy.exporters.JsonItemExporter',    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',    'jl': 'scrapy.exporters.JsonLinesItemExporter',    'csv': 'scrapy.exporters.CsvItemExporter',    'xml': 'scrapy.exporters.XmlItemExporter',    'marshal': 'scrapy.exporters.MarshalItemExporter',    'pickle': 'scrapy.exporters.PickleItemExporter',}