Python实现一个论文下载器的过程

脚本专栏 2024/11/16 佚名

2 0 1

极乐门资源网 Design By www.ioogu.com

在科研学习的过程中，我们难免需要查询相关的文献资料，而想必很多小伙伴都知道SCI-HUB，此乃一大神器，它可以帮助我们搜索相关论文并下载其原文。可以说，SCI-HUB造福了众多科研人员，用起来也是“美滋滋”。

然而，当师姐告诉我：“xx，可以帮我下载几篇文献嘛"text-align: center">

一、代码分析

代码分析的详细思路跟以往依旧如此雷同，逃不过的还是：抓包分析->模拟请求->代码整合。由于一会儿kimol君还得去搬砖，今天就不详细展开了。

1. 搜索论文

通过论文的URL、PMID、DOI号或者论文标题等搜索到对应的论文，并通过bs4库找出PDF原文的链接地址，代码如下：

def search_article(artName):
 '''
 搜索论文
 ---------------
 输入：论文名
 ---------------
 输出：搜索结果（如果没有返回""，否则返回PDF链接）
 '''
 url = 'https://www.sci-hub.ren/'
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Content-Type':'application/x-www-form-urlencoded',
    'Content-Length':'123',
    'Origin':'https://www.sci-hub.ren',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 data = {'sci-hub-plugin-check':'',
   'request':artName}
 res = requests.post(url, headers=headers, data=data)
 html = res.text
 soup = BeautifulSoup(html, 'html.parser')
 iframe = soup.find(id='pdf')
 if iframe == None: # 未找到相应文章
  return ''
 else:
  downUrl = iframe['src']
  if 'http' not in downUrl:
   downUrl = 'https:'+downUrl
  return downUrl

2. 下载论文

得到了论文的链接地址之后，只需要通过requests发送一个请求，即可将其下载：

def download_article(downUrl):
 '''
 根据论文链接下载文章
 ----------------------
 输入：论文链接
 ----------------------
 输出：PDF文件二进制
 '''
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 res = requests.get(downUrl, headers=headers)
 return res.content

二、完整代码

将上述两个函数整合之后，我的完整代码如下：

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 5 16:32:22 2021
@author: kimol_love
"""
import os
import time
import requests
from bs4 import BeautifulSoup
 
def search_article(artName):
 '''
 搜索论文
 ---------------
 输入：论文名
 ---------------
 输出：搜索结果（如果没有返回""，否则返回PDF链接）
 '''
 url = 'https://www.sci-hub.ren/'
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Content-Type':'application/x-www-form-urlencoded',
    'Content-Length':'123',
    'Origin':'https://www.sci-hub.ren',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 data = {'sci-hub-plugin-check':'',
   'request':artName}
 res = requests.post(url, headers=headers, data=data)
 html = res.text
 soup = BeautifulSoup(html, 'html.parser')
 iframe = soup.find(id='pdf')
 if iframe == None: # 未找到相应文章
  return ''
 else:
  downUrl = iframe['src']
  if 'http' not in downUrl:
   downUrl = 'https:'+downUrl
  return downUrl
  
def download_article(downUrl):
 '''
 根据论文链接下载文章
 ----------------------
 输入：论文链接
 ----------------------
 输出：PDF文件二进制
 '''
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 res = requests.get(downUrl, headers=headers)
 return res.content
 
def welcome():
 '''
 欢迎界面
 '''
 os.system('cls')
 title = '''
    _____ _____ _____  _ _ _ _ ____ 
    / ____|/ ____|_ _| | | | | | | | _ \ 
    | (___ | |  | |______| |__| | | | | |_) |
    \___ \| |  | |______| __ | | | | _ < 
    ____) | |____ _| |_  | | | | |__| | |_) |
    |_____/ \_____|_____| |_| |_|\____/|____/
    
   '''
 print(title)
 
if __name__ == '__main__':
 while True:
  welcome()
  request = input('请输入URL、PMID、DOI或者论文标题：')
  print('搜索中...')
  downUrl = search_article(request)
  if downUrl == '':
   print('未找到相关论文，请重新搜索！')
  else:
   print('论文链接：%s'%downUrl)
   print('下载中...')
   pdf = download_article(downUrl)
   with open('%s.pdf'%request, 'wb') as f:
    f.write(pdf)
   print('---下载完成---')
  time.sleep(0.8)

不出所料，代码一跑，我便轻松完成了师姐交给我的任务，不香嘛？

python论文下载器,python,下载器

标签：

python论文下载器,python,下载器

极乐门资源网 Design By www.ioogu.com

极乐门资源网 免责声明：本站文章均来自网站采集或用户投稿，网站不提供任何软件下载或自行开发的软件！如有用户或公司发现本站内容信息存在侵权行为，请邮件告知！ 858582#qq.com

极乐门资源网 Design By www.ioogu.com

评论“Python实现一个论文下载器的过程”

暂无Python实现一个论文下载器的过程的评论...

www.ioogu.com 极乐门资源网

139,976影音资源

144,792福利资源

21,817软件资源

631,128技术资源

Python实现一个论文下载器的过程

一、代码分析

1. 搜索论文

2. 下载论文

二、完整代码

python论文下载器,python,下载器

Python实现邮件发送的详细设置方法(遇到问题)

利用python为PostgreSQL的表自动添加分区

评论“Python实现一个论文下载器的过程”

RTX 5090要首发性能要翻倍！三星展示GDDR7显存

更新日志

友情链接

Python实现一个论文下载器的过程

一、代码分析

1. 搜索论文

2. 下载论文

二、完整代码

python论文下载器,python,下载器

Python实现邮件发送的详细设置方法(遇到问题)

利用python为PostgreSQL的表自动添加分区

评论“Python实现一个论文下载器的过程”

RTX 5090要首发 性能要翻倍！三星展示GDDR7显存

更新日志

友情链接

RTX 5090要首发性能要翻倍！三星展示GDDR7显存