根据pdf文件编号或者文字内容批量修改pdf文件名

2023年5月30日 481点热度 2人点赞 0条评论

提出问题

工作中有时要批量对pdf或者word文档进行批量规范命名，遇到这种事情往往人工成本会很大。因此，利用ocr工具自己写了一个小工具，可用于根据pdf中指定位置编号对文件进行批量重命名。

解决方法

算法的主要思想是：（1）将pdf转换成图片保存；（2）图片矫正（有些图片拍摄时倾斜）；（3）根据图片中标号的位置进行ocr文字识别；（4）文件重命名。算法的识别准确率较高，但仍存在少量不准确，需人工调节。

import pytesseract
import numpy as np
import math
from scipy import ndimage

def pyMuPDF_fitz(pdfPath, imagePath, name):
    startTime_pdf2img = datetime.datetime.now()#开始时间

    print("imagePath="+imagePath)
    pdfDoc = fitz.open(pdfPath)
    page = pdfDoc[0]
    rotate = int(0)
    # 每个尺寸的缩放系数为1.3，这将为我们生成分辨率提高2.6的图像。
    # 此处若是不做设置，默认图片大小为：792X612, dpi=96
    zoom_x = 5 #(1.33333333-->1056x816)   (2-->1584x1224)
    zoom_y = 5
    mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
    pix = page.getPixmap(matrix=mat, alpha=False)

    if not os.path.exists(imagePath):#判断存放图片的文件夹是否存在
        os.makedirs(imagePath) # 若图片文件夹不存在就创建

    pix.writePNG(imagePath+'/'+ name + '.png')#将图片写入指定的文件夹内

    #endTime_pdf2img = datetime.datetime.now()#结束时间
    #print('pdf2img时间=',(endTime_pdf2img - startTime_pdf2img).seconds)

def rotate(img):
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    #print (gray.shape)
    # 边缘检测
    edges = cv2.Canny(gray,50,150,apertureSize = 3)

    #霍夫变换，摘自https://blog.csdn.net/feilong_csdn/article/details/81586322
    lines = cv2.HoughLines(edges,1,np.pi/180,0)
    for rho,theta in lines[0]:
        a = np.cos(theta)
        b = np.sin(theta)
        x0 = a*rho
        y0 = b*rho
        x1 = int(x0 + 1000*(-b))
        y1 = int(y0 + 1000*(a))
        x2 = int(x0 - 1000*(-b))
        y2 = int(y0 - 1000*(a))
    if x1 == x2 or y1 == y2:
        pass
    t = float(y2-y1)/((x2-x1)+1)
    # 得到角度后
    rotate_angle = math.degrees(math.atan(t))
    if rotate_angle > 60:
        rotate_angle = -90 + rotate_angle
    elif rotate_angle < -60:
        rotate_angle = 90 + rotate_angle
    # 图像根据角度进行校正
    rotate_img = ndimage.rotate(img, rotate_angle)
    return rotate_img

if __name__ == "__main__":
    imagePath = './images'
    pdf_dir = './Documents'
    for file in glob.glob('./Documents/*.pdf'):
        print ('正在处理：', file)
        name = os.path.split(file)[1][:-4]
        #print (name)
        pyMuPDF_fitz(file, imagePath, name)
        img = cv2.imread(imagePath + '/' + name + '.png')
        img = rotate(img)
        cv2.imwrite('./imgs' + '/' + name + '.png', img)
        left_thred = 230
        gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        ret, thresh1 = cv2.threshold(gray_img, left_thred, 255, cv2.THRESH_BINARY)
        blur = cv2.blur(thresh1,(3,3))
        y = 735
        x = 575
        h = 80
        w = 462
        crop_img = img[y:y+h, x:x+w].copy()
        text = pytesseract.image_to_string(crop_img)
        print (text)
        if not os.path.exists(pdf_dir + '/' + text.replace("\n", "") + '.pdf'):
            os.rename(pdf_dir + '/' + name + '.pdf', pdf_dir + '/' + text.replace("\n", "") + '.pdf')

结果

图1 整个代码文件存放；图2 输出的文件重命名结果

图1

图2

注：其中tesseract-ocr需下载安装，并在anaconda配置该库路径。

配置tesseract运行文件

一般地，在C:\ProgramData\Anaconda3\Lib\site-packages\pytesseract目录下pytesseract.py文件找到tesseract_cmd = 'tesseract'，改成tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'即可，具体路径按自己安装的tesseract路径配置。

Post Views: 486