AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / computer / 问题 / 1837459
Accepted
YorSubs
YorSubs
Asked: 2024-04-01 21:35:49 +0800 CST2024-04-01 21:35:49 +0800 CST 2024-04-01 21:35:49 +0800 CST

PowerShell、ImageMagick、GhostScript 将 PDF 的每一页提取为单独的图像

  • 772

我想获取 PDF 并将每个页面提取为图像。我已经能够使用 ImageMagick 和 GhostScript 做到这一点,但结果质量非常差。我尝试了许多不同的输出选项,但没有任何运气。下面的脚本应该是相当不言自明的。它可以工作,但与打开 PDF 相比,图像质量确实令人失望。

  • 有没有一种方法可以使用 ImageMagick 来输出高质量的图像?
  • 使用其他工具怎么样,但最好以编程方式,因为如果我必须在 GUI 中一张一张地处理大量 PDF,那么处理它们会很尴尬。
# Extract each page from a PDF as a png using ImageMagick
# ImageMagick requires GhostScript for PDF manipulation so have to make sure that is installed
# Current install folder: C:\Program Files\ImageMagick-7.1.1-Q16-HDRI
# Chocolatey package does not inclued the 'identify.exe' command

# Path to the PDF file
$pdfFilePath = "C:\0\MyFile.pdf"
# Output directory for images
$outputDirectory = "C:\0"
# Image type to output to (tried jpg, png, tiff etc)
$imageExtension = "jpg"

# Check if running as Admin
if (!([Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")) { Write-Host "Please run this script as an administrator."; exit }
# Check if Chocolatey is installed, if not, install it
if (!(Test-Path "$env:ProgramData\chocolatey")) { Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')) }
# Check for magick.exe on path; if not installed, install ImageMagick
$imageMagickExePath = Get-ChildItem -Path "C:\Program Files\ImageMagick-*" -Filter "magick.exe" -Recurse | Select-Object -First 1 -ExpandProperty FullName
if (!(Get-Command "magick.exe" -ea silent) -and ($null -eq $imageMagickExePath)) { Write-Host "ImageMagick not found. Installing..."; choco install imagemagick -y }
if ($null -eq $imageMagickExePath) { Write-Host "Error: magick.exe not found at $imageMagickExePath"; exit }
Write-Host "magick.exe found at '$imageMagickExePath'"
# Check for gswin64.exe on path; if not installed, install GhostScript
$gsExePath = Get-ChildItem -Path "C:\Program Files\gs\gs*\bin" -Filter "gswin64.exe" -Recurse | Select-Object -First 1 -ExpandProperty FullName
if (!(Get-Command "gswin64.exe" -ea silent) -and ($null -eq $gsExePath)) { Write-Host "GhostScript not found. Installing..."; choco install ghostscript -y }
if ($null -eq $gsExePath) { Write-Host "Error: gswin64.exe not found, this is required by ImageMagick for PDF manipulation"; exit }
Write-Host "gswin64.exe found at '$gsExePath'"
# Add Ghostscript directory to the PATH temporarily so that ImageMagick can use it
$env:Path += ";$($gsExePath | Split-Path -Parent)"

# Create the output directory if it doesn't exist
if (-not (Test-Path $outputDirectory)) { New-Item -ItemType Directory -Force -Path $outputDirectory | Out-Null }

# Convert each page of the PDF to PNG
$imageNamePrefix = [System.IO.Path]::GetFileNameWithoutExtension($pdfFilePath)
$imageNamePrefix = $imageNamePrefix -replace '\s+', '_'

# Use ImageMagick's identify command to get the total number of pages in the PDF
$numberOfPages = (identify "$pdfFilePath" 2>$null | Measure-Object -Line).Lines
Write-Host "'$pdfFilePath' has $numberOfPages pages"
# Use ImageMagick's convert command to convert PDF
Start-Process $imageMagickExePath -ArgumentList "convert `"$pdfFilePath`" -density 600 -quality 100 -antialias -resize 300% `"$outputDirectory\$imageNamePrefix-%d.$imageExtension`"" -NoNewWindow -Wait

# Determine the maximum number of digits to normalise all page numbers to that length
$maxDigits = $numberOfPages.ToString().Length

# Normalize page numbers
for ($i = 0; $i -le $numberOfPages; $i++) {
    $pageNumber = "{0:D$maxDigits}" -f $i
    $oldFileName = Join-Path $outputDirectory "$imageNamePrefix-$i.$imageExtension"
    $newFileName = Join-Path $outputDirectory "$imageNamePrefix-$pageNumber.$imageExtension"
    if ((Test-Path $oldFileName) -and !(Test-Path $newFileName)) { Rename-Item -Path $oldFileName -NewName $newFileName }
}

# Remove the Ghostscript directory from the PATH
$env:Path = $env:Path -replace [regex]::Escape(";"+($gsExePath | Split-Path -Parent))

# command-line tools for manipulating PDFs:
# https://libgen.rs/search.php?req=pdf+hacks&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=defs
# pdftk (PDF Toolkit): pdftk is a command-line tool for manipulating PDF files. It can merge, split, rotate, watermark, and decrypt PDF files.
# QPDF: QPDF is another command-line tool for structural, content-preserving transformation of PDF files. It's particularly useful for linearizing PDFs, decrypting, and compressing them.
# Poppler Utilities (pdftohtml, pdftotext, pdfimages): Poppler is a PDF rendering library and its utilities provide command-line tools for converting PDFs to various formats such as HTML, text, and images. pdftohtml converts PDF to HTML, pdftotext converts PDF to plain text, and pdfimages extracts images from PDFs.
# Ghostscript: Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF) page description languages. It can be used for a wide variety of tasks including converting PDFs to various formats, merging PDFs, and more.
# MuPDF: MuPDF is a lightweight PDF, XPS, and E-book viewer. It also includes command-line tools for extracting text and images from PDFs.
# PDFMiner: PDFMiner is a tool for extracting information from PDF documents. It includes a command-line tool for extracting text, images, and other content from PDFs.
powershell
  • 1 1 个回答
  • 73 Views

1 个回答

  • Voted
  1. Best Answer
    K J
    2024-04-01T22:04:17+08:002024-04-01T22:04:17+08:00

    假设您想要 300 DPI PNG,最简单的方法是使用“drop on me”.CMD 文件或将链接放入“SendTo”文件夹中。无论哪种方式,都可以将一个文件即时导出为图像,可以很容易地适应文件文件夹。

    使用https://github.com/oschwartz10612/poppler-windows提供的 2024 64 位版本的 Poppler PDFtoPPM 二进制文件

    "path to\pdftoppm.exe" -png -r 300 -aa yes -progress "%~1" "%~dpn1"
    

    在此输入图像描述

    因此这个 12 页 PDF 将导出到同一工作文件夹。

    对于更复杂的用法,然后扩展 CMD 文件以在包含多个文件和/或子文件夹的当前工作目录中运行。

    在此输入图像描述

    GhostScript 的类似命令可能是这样的

    "path to\bin\gswin64c" -sDEVICE=pngalpha -r300 -o"%~dpn1-%%04d.png" -f "%~1"
    

    根据 @KenS 评论,使用 %%04d 简化为 000# 位

    在此输入图像描述

    输出的差异可以通过更改附加开关来调整,但正如上面的 2 个命令所示,GhostScript(左下)会生成更紧凑的文件。

    在此输入图像描述

    继原始帖子之后,在 PowerShell 中,对于包含一些 PDF 的文件夹,以下内容将循环遍历并提取每个 PDF 的 PNG。pdftoppm.exe根据需要替换、gswin64c.exe、 以及包含 PDF 的文件夹的位置。请注意,对于gswin64c.exe上面 CMD 中的编号,%在控制台上变为%%CMD 脚本内部,因此%%04d对于 CMD 脚本,而%04d在 PowerShell 中工作):

    $pdftoppm = "C:\Poppler\Library\bin\pdftoppm.exe"
    $gswin64c = "C:\Program Files\gs\gs10.03.0\bin\gswin64c.exe"
    $pdfs = Get-ChildItem -Path "C:\0\*.pdf"
    
    foreach ($pdf in $pdfs) { 
        $pdfFullName = $pdf.FullName
        $pdfNameNoExt = [System.IO.Path]::GetFileNameWithoutExtension($pdf.Name)
        Start-Process $pdftoppm -ArgumentList "-png -r 300 -aa yes -progress `"$pdfFullName`" `"$pdfNameNoExt`"" -Wait
        Start-Process $gswin64c -ArgumentList "-sDEVICE=pngalpha -r300 -o`"$pdfNameNoExt-%04d.png`" -f `"$pdfFullName`"" -Wait
    }
    
    
    • 3

相关问题

  • 如何将变量字符串放入powershell中的数组?

  • Powershell 和正则表达式:Notepad++“保存时备份”文件列表。编辑名称,按上次写入时间排序

  • 将前景颜色添加到 Powershell 配置文件?

  • 禁用后无法启用 Microsoft Print to PDF

  • 我可以让这个 PowerShell 脚本接受逗号吗?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    如何减少“vmmem”进程的消耗?

    • 11 个回答
  • Marko Smith

    从 Microsoft Stream 下载视频

    • 4 个回答
  • Marko Smith

    Google Chrome DevTools 无法解析 SourceMap:chrome-extension

    • 6 个回答
  • Marko Smith

    Windows 照片查看器因为内存不足而无法运行?

    • 5 个回答
  • Marko Smith

    支持结束后如何激活 WindowsXP?

    • 6 个回答
  • Marko Smith

    远程桌面间歇性冻结

    • 7 个回答
  • Marko Smith

    子网掩码 /32 是什么意思?

    • 6 个回答
  • Marko Smith

    鼠标指针在 Windows 中按下的箭头键上移动?

    • 1 个回答
  • Marko Smith

    VirtualBox 无法以 VERR_NEM_VM_CREATE_FAILED 启动

    • 8 个回答
  • Marko Smith

    应用程序不会出现在 MacBook 的摄像头和麦克风隐私设置中

    • 5 个回答
  • Martin Hope
    Vickel Firefox 不再允许粘贴到 WhatsApp 网页中? 2023-08-18 05:04:35 +0800 CST
  • Martin Hope
    Saaru Lindestøkke 为什么使用 Python 的 tar 库时 tar.xz 文件比 macOS tar 小 15 倍? 2021-03-14 09:37:48 +0800 CST
  • Martin Hope
    CiaranWelsh 如何减少“vmmem”进程的消耗? 2020-06-10 02:06:58 +0800 CST
  • Martin Hope
    Jim Windows 10 搜索未加载,显示空白窗口 2020-02-06 03:28:26 +0800 CST
  • Martin Hope
    andre_ss6 远程桌面间歇性冻结 2019-09-11 12:56:40 +0800 CST
  • Martin Hope
    Riley Carney 为什么在 URL 后面加一个点会删除登录信息? 2019-08-06 10:59:24 +0800 CST
  • Martin Hope
    zdimension 鼠标指针在 Windows 中按下的箭头键上移动? 2019-08-04 06:39:57 +0800 CST
  • Martin Hope
    jonsca 我所有的 Firefox 附加组件突然被禁用了,我该如何重新启用它们? 2019-05-04 17:58:52 +0800 CST
  • Martin Hope
    MCK 是否可以使用文本创建二维码? 2019-04-02 06:32:14 +0800 CST
  • Martin Hope
    SoniEx2 更改 git init 默认分支名称 2019-04-01 06:16:56 +0800 CST

热门标签

windows-10 linux windows microsoft-excel networking ubuntu worksheet-function bash command-line hard-drive

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve