今天老师给我改论文的时候布置了一个任务,让我把所有论文的pdf按格式打包发给她。可是之前我用zotero的时候都是在线保存的,有些是没有pdf的,怎么办?而且就算有pdf,他们的命名格式也五花八门,难道一个个手改吗?
这篇文章我引用了96篇,我肯定不可能一篇篇手改,那也太蠢了。程序员不可能干超过三遍的事情,所以肯定要写个程序。
万幸的是,zotero是可以导出csv格式,还保存了你存储pdf的位置。
打开生成的csv,我们可以看到“File Attachments”(左图)一列保存了pdf保存的地址。我们只需要根据这个地址索引,就可以获取所有的pdf,然后再根据具体的出版时间、作者、标题生成一下参考文献列表即可。
这样就构思了第一版代码
思路就是从File Attachments里面搞到内容复制到文件夹里面去
import csv
import shutil
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
try:
shutil.copy(row["File Attachments"],mypdfdic)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
然而,我发现大部分文件都复制失败了。debug发现,有些File Attachments它包含了多个文献,除了pdf还包含网页文件。比如如下格式: C:\Users\xxr00\Zotero\storage\S9GT3DMY\Agostino 等。 - 2008 - Voluntary, spontaneous, and reflex blinking in Par.pdf; C:\Users\xxr00\Zotero\storage\8WUZSCZA\mds.html
那么我们就需要排除掉网页格式,这很简单,我们对每个分号做一个切割,然后保留最后四个字母是“pdf”的就可以了。另外,还发现只要是最后有一个分号它也不能成功复制,毕竟复制要求格式很严谨。
但是这种思路被我排除了,我觉得这样可能还会遇到其它坑。一种更简单的方式是,根据File Attachments提供的位置,我们获取类似S9GT3DMY的信息,然后读取对应文件夹下所有的pdf文件。毕竟字符串处理这问题其实恶心起来鬼知道能恶心成什么样子,还是少惹他为好。
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
zoterodic = r'C:\Users\xxr00\Zotero\storage'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = '';
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
shutil.copy(pdfdir,mypdfdic)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
中间写了不少乱七八糟的变量,实际上就是读取对应地址下文件,然后看它是不是pdf,如果是,就复制它。就这么简单。当然这里得确保对应路径下只有一个pdf文件,如果不是的话可不行。不过这一版还是pdf原来的名字,肯定是有问题的。所以还是应该再修改pdf的名字。
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
zoterodic = r'C:\Users\xxr00\Zotero\storage'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
os.chdir(mypdfdic) # 修改路径到之前的文件
mypdfdic = os.getcwd()
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = '';
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
oldname = tmpstr
shutil.copy(pdfdir,mypdfdic)
# 修改文件名
#作者这个较为复杂,第一个作者只要第一个,两个作者用and,三个作者用etal
authorname = row["Author"]
tmp = authorname.split(";")
for i in range(len(tmp)):
au = tmp[i]
tmp[i] = au[0:au.rfind(',', 1)]
#if len(tmp)>2:
if len(tmp) == 1:
author_final = tmp[0]
elif len(tmp) == 2:
author_final = tmp[0] + ' and ' + tmp[1]
else:
author_final = tmp[0] + ' et.al,'
year_fianl = row["Publication Year"]
title_final = row['Title']
publication_final = row['Publication Title']
newname = '(' + author_final + year_fianl + ') ' + title_final + ', ' + publication_final + '.pdf'
os.rename(oldname, newname)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
这一版本总算可以将pdf的名称也导入出来了。
但是,我发现有部分文献没有被成功改名。。。然后看了看原因,似乎是因为标题+出版社太长了。没办法,我们就不要出版社了,然后确保当字符串长度超过200时,将除了200的全部改成...
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
zoterodic = r'C:\Users\xxr00\Zotero\storage'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
os.chdir(mypdfdic) # 修改路径到之前的文件
mypdfdic = os.getcwd()
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = '';
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
oldname = tmpstr
shutil.copy(pdfdir,mypdfdic)
# 修改文件名
#作者这个较为复杂,第一个作者只要第一个,两个作者用and,三个作者用etal
authorname = row["Author"]
tmp = authorname.split(";")
for i in range(len(tmp)):
au = tmp[i]
tmp[i] = au[0:au.rfind(',', 1)]
#if len(tmp)>2:
if len(tmp) == 1:
author_final = tmp[0]
elif len(tmp) == 2:
author_final = tmp[0] + ' and ' + tmp[1]
else:
author_final = tmp[0] + ' et.al,'
year_fianl = row["Publication Year"]
title_final = row['Title']
finalname = '(' + author_final + year_fianl + ') ' + title_final + ', '
if len(finalname)>200:
finalname = finalname[0:200] + '...'
newname = finalname + '.pdf'
os.rename(oldname, newname)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
然后发现还是有部分没被成功改过来。真的是字符串狗都不做,语言没错了。继续debug
我发现是因为路径中出现了冒号。。。和盘符冲突。没办法,为了尽量减少麻烦,将英文冒号改为中文冒号。
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
zoterodic = r'C:\Users\xxr00\Zotero\storage'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
os.chdir(mypdfdic) # 修改路径到之前的文件
mypdfdic = os.getcwd()
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = '';
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
oldname = tmpstr
shutil.copy(pdfdir,mypdfdic)
# 修改文件名
#作者这个较为复杂,第一个作者只要第一个,两个作者用and,三个作者用etal
authorname = row["Author"]
tmp = authorname.split(";")
for i in range(len(tmp)):
au = tmp[i]
tmp[i] = au[0:au.rfind(',', 1)]
#if len(tmp)>2:
if len(tmp) == 1:
author_final = tmp[0]
elif len(tmp) == 2:
author_final = tmp[0] + ' and ' + tmp[1]
else:
author_final = tmp[0] + ' et.al,'
year_fianl = row["Publication Year"]
title_final = row['Title']
finalname = '(' + author_final + year_fianl + ') ' + title_final + ', '
if len(finalname)>200:
finalname = finalname[0:200] + '...'
newname = finalname + '.pdf'
newname = newname.replace(':',':')
os.rename(oldname, newname)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
当然,如果原来zotero就没有保存pdf,那就没办法了。这可能真的得手工干。我先开始补充。实际上,对待一些文件,右键就可以看到是否有pdf。你可以全选文件,然后直接右键然后“找寻pdf”。看zotero能帮你找到几个哈哈。
结果发现,它能找到纯属运气,大部分得靠你自己.
补完文献后,发现代码仍然有一个错误。。。
找完文献后,发现运行代码还是会有一点问题。主要可能是zotero存储了多个位置,会出现如下形式:
C:\Users\xxr00\Zotero\storage\ZFTV27TR\Theiler 等。 - 1992 - Testing for nonlinearity in time series the metho.pdf; C:\Users\xxr00\Zotero\storage\UQ2GUHS2\016727899290102S.html
即将pdf和网页存在了不同的位置,比如上例中存在了ZFTV27TR和UQ2GUHS2中
字符串处理到这里已经开始十分厌烦了。算了,直接打补丁吧。简单说就是用分号间隔,然后看谁pdf就要谁。
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
mycsvfile = 'doc.csv'
mypdfdic = '.\pdf'
zoterodic = r'C:\Users\xxr00\Zotero\storage'
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
os.chdir(mypdfdic) # 修改路径到之前的文件
mypdfdic = os.getcwd()
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split(";")
for i in range(len(tmp)):
strlen = len(tmp[i])
if tmp[i][strlen-3:strlen] == 'pdf':
tmp = tmp[i]
break
tmp = tmp.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = ''
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
oldname = tmpstr
shutil.copy(pdfdir,mypdfdic)
# 修改文件名
#作者这个较为复杂,第一个作者只要第一个,两个作者用and,三个作者用etal
authorname = row["Author"]
tmp = authorname.split(";")
for i in range(len(tmp)):
au = tmp[i]
tmp[i] = au[0:au.rfind(',', 1)]
#if len(tmp)>2:
if len(tmp) == 1:
author_final = tmp[0]
elif len(tmp) == 2:
author_final = tmp[0] + ' and ' + tmp[1]
else:
author_final = tmp[0] + ' et.al'
year_fianl = row["Publication Year"]
title_final = row['Title']
finalname = '(' + author_final + ',' + year_fianl + ') ' + title_final + ', '
if len(finalname)>200:
finalname = finalname[0:200] + '...'
newname = finalname + '.pdf'
newname = newname.replace(':',':')
newname = newname.replace('"', '“')
os.rename(oldname, newname)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))
通过所有样例。别说,真就像本科acm的时候写大模拟的题目。不过总算写出来了,以后都可以直接用了。
最后根据结果做一点优化,让生成的格式更好看,主要就调整一下空格
import csv
import shutil
import os
copySuccess = 0
copyFail = 0
#唯一需要修改的地方
mycsvfile = 'doc.csv' #存储从zotero导出csv的地址
mypdfdic = '.\pdf' #保存pdf的位置
zoterodic = r'C:\Users\xxr00\Zotero\storage' #zotero保存原始文件的位置
with open(mycsvfile,newline='',encoding='utf-8-sig')as csvfile:
cr = csv.DictReader(csvfile)
os.chdir(mypdfdic) # 修改路径到之前的文件
mypdfdic = os.getcwd()
for row in cr:
print('Copying...{}'.format(row["File Attachments"]))
docname = (row["File Attachments"])
try:
tmp = docname.split(";")
for i in range(len(tmp)):
strlen = len(tmp[i])
if tmp[i][strlen-3:strlen] == 'pdf':
tmp = tmp[i]
break
tmp = tmp.split("\\")
keyword = tmp[5]
rowpdf_dic = zoterodic + '\\' + keyword
files = os.listdir(rowpdf_dic)
pdfdir = ''
for tmpstr in files:
strlen = len(tmpstr)
pdf_yorn = tmpstr[strlen-3:strlen]
if pdf_yorn == 'pdf':
pdfdir = rowpdf_dic + '\\' + tmpstr
oldname = tmpstr
shutil.copy(pdfdir,mypdfdic)
# 修改文件名
#作者这个较为复杂,第一个作者只要第一个,两个作者用and,三个作者用etal
authorname = row["Author"]
tmp = authorname.split(";")
for i in range(len(tmp)):
au = tmp[i]
tmp[i] = au[0:au.rfind(',', 1)]
#if len(tmp)>2:
if len(tmp) == 1:
author_final = tmp[0]
elif len(tmp) == 2:
author_final = tmp[0] + ' and ' + tmp[1]
else:
author_final = tmp[0] + ' et.al'
year_fianl = row["Publication Year"]
title_final = row['Title']
finalname = '(' + author_final + ', ' + year_fianl + ') ' + title_final
if len(finalname)>200:
finalname = finalname[0:200] + '...'
newname = finalname + '.pdf'
newname = newname.replace(':',':')
newname = newname.replace('"', '“')
os.rename(oldname, newname)
copySuccess = copySuccess + 1
except:
copyFail = copyFail + 1
print('Done.{}Succeed,{}Failed.'.format(copySuccess,copyFail))