闭关几个月,今天为大家带来利用Python爬虫抓取豆瓣电影《魔女2》影评,废话不多说。
爬取了6月7月25的影片数据,Let's start happily
Python版本: 3.6.4
相关模块:
requests模块
json模块
re模块
os模块
pandas模块
time模块
安装Python并添加到环境变量,pip安装需要的相关模块即可。
本文以爬取豆瓣电影《魔女2》影评,讲解如何爬取豆瓣电影《魔女2》评论!
1.获取页面内容
# 爬取页面 urldouban_url = 'https://movie.douban.com/subject/34832354/comments?start=40&limit=20&status=P&sort=new_score'# requests 发送请求get_response = requests.get(douban_url)# 将返回的响应码转换成文本(整个网页)get_data = get_response.text
2.分析页面内容,获取我们想要的内容
爬取的页面
3.利用re模块解析数据
def get_nextUrl(html): """抓取下一个页面的 url""" try: # 找到下一页的 url url = html.find('a', 'next').attrs['href'] # print(url) next_start = re.search(r'[0-9]\d{0,5}', url).group(0) print("已经到 " + str(next_start) + " 请稍等一会儿
") next_url = "https://movie.douban.com/subject/34832354/comments?percent_type=" \ "&start={}&limit=20&status=P&sort=new_score&comments_only=1&ck=Cuyu".format(next_start) # print(next_url) return next_url except: print("到底了~")
运行结果
数据存储表格
def get_html(url): headers = Agent_info() try: r = requests.get(url=url, headers=headers, timeout=30) r.encoding = r.apparent_encoding status = r.status_code # 爬虫的状态 datas = json.loads(r.text)["html"] str_html = "{}".format(datas) html = BeautifulSoup(str_html, "html.parser") print("爬虫状态码: " + str(status)) # print(type(html)) return html except Exception as e: print("数据爬取失败!") print(e)def etl_data(html): """提取出我们想要的数据""" comments = html.find_all('div', 'comment-item') # print(comments[0]) datas = [] for span in comments: # 短评发表的时间 times = span.find('span', 'comment-time').attrs['title'] # 用户名 name = span.find('a').attrs["title"] # 用户评分星级 try: level = span.find('span', 'rating').attrs['class'][0][-2:] if (level == '10'): level = "一星" elif (level == '20'): level = "二星" elif (level == '30'): level = "三星" elif (level == '40'): level = "四星" elif (level == '50'): level = "五星" except Exception as e: level = "无评价" content = span.find('span', 'short').string.strip() content = re.sub(r'
', '', content) love_point = span.find('span', 'vote-count').string.strip() arr = [times, name, level, content, love_point] datas.append(arr) df = pd.DataFrame(datas) df.columns = ["时间", "用户", "星级", "短评", "支持数"] # print(arr) return df#代码测试2022.8.6无异常,点赞超过100,更新爬虫读书Top250
词云图
留言与评论(共有 0 条评论) “” |