我想获取 PDF 并将每个页面提取为图像。我已经能够使用 ImageMagick 和 GhostScript 做到这一点,但结果质量非常差。我尝试了许多不同的输出选项,但没有任何运气。下面的脚本应该是相当不言自明的。它可以工作,但与打开 PDF 相比,图像质量确实令人失望。
- 有没有一种方法可以使用 ImageMagick 来输出高质量的图像?
- 使用其他工具怎么样,但最好以编程方式,因为如果我必须在 GUI 中一张一张地处理大量 PDF,那么处理它们会很尴尬。
# Extract each page from a PDF as a png using ImageMagick
# ImageMagick requires GhostScript for PDF manipulation so have to make sure that is installed
# Current install folder: C:\Program Files\ImageMagick-7.1.1-Q16-HDRI
# Chocolatey package does not inclued the 'identify.exe' command
# Path to the PDF file
$pdfFilePath = "C:\0\MyFile.pdf"
# Output directory for images
$outputDirectory = "C:\0"
# Image type to output to (tried jpg, png, tiff etc)
$imageExtension = "jpg"
# Check if running as Admin
if (!([Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")) { Write-Host "Please run this script as an administrator."; exit }
# Check if Chocolatey is installed, if not, install it
if (!(Test-Path "$env:ProgramData\chocolatey")) { Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')) }
# Check for magick.exe on path; if not installed, install ImageMagick
$imageMagickExePath = Get-ChildItem -Path "C:\Program Files\ImageMagick-*" -Filter "magick.exe" -Recurse | Select-Object -First 1 -ExpandProperty FullName
if (!(Get-Command "magick.exe" -ea silent) -and ($null -eq $imageMagickExePath)) { Write-Host "ImageMagick not found. Installing..."; choco install imagemagick -y }
if ($null -eq $imageMagickExePath) { Write-Host "Error: magick.exe not found at $imageMagickExePath"; exit }
Write-Host "magick.exe found at '$imageMagickExePath'"
# Check for gswin64.exe on path; if not installed, install GhostScript
$gsExePath = Get-ChildItem -Path "C:\Program Files\gs\gs*\bin" -Filter "gswin64.exe" -Recurse | Select-Object -First 1 -ExpandProperty FullName
if (!(Get-Command "gswin64.exe" -ea silent) -and ($null -eq $gsExePath)) { Write-Host "GhostScript not found. Installing..."; choco install ghostscript -y }
if ($null -eq $gsExePath) { Write-Host "Error: gswin64.exe not found, this is required by ImageMagick for PDF manipulation"; exit }
Write-Host "gswin64.exe found at '$gsExePath'"
# Add Ghostscript directory to the PATH temporarily so that ImageMagick can use it
$env:Path += ";$($gsExePath | Split-Path -Parent)"
# Create the output directory if it doesn't exist
if (-not (Test-Path $outputDirectory)) { New-Item -ItemType Directory -Force -Path $outputDirectory | Out-Null }
# Convert each page of the PDF to PNG
$imageNamePrefix = [System.IO.Path]::GetFileNameWithoutExtension($pdfFilePath)
$imageNamePrefix = $imageNamePrefix -replace '\s+', '_'
# Use ImageMagick's identify command to get the total number of pages in the PDF
$numberOfPages = (identify "$pdfFilePath" 2>$null | Measure-Object -Line).Lines
Write-Host "'$pdfFilePath' has $numberOfPages pages"
# Use ImageMagick's convert command to convert PDF
Start-Process $imageMagickExePath -ArgumentList "convert `"$pdfFilePath`" -density 600 -quality 100 -antialias -resize 300% `"$outputDirectory\$imageNamePrefix-%d.$imageExtension`"" -NoNewWindow -Wait
# Determine the maximum number of digits to normalise all page numbers to that length
$maxDigits = $numberOfPages.ToString().Length
# Normalize page numbers
for ($i = 0; $i -le $numberOfPages; $i++) {
$pageNumber = "{0:D$maxDigits}" -f $i
$oldFileName = Join-Path $outputDirectory "$imageNamePrefix-$i.$imageExtension"
$newFileName = Join-Path $outputDirectory "$imageNamePrefix-$pageNumber.$imageExtension"
if ((Test-Path $oldFileName) -and !(Test-Path $newFileName)) { Rename-Item -Path $oldFileName -NewName $newFileName }
}
# Remove the Ghostscript directory from the PATH
$env:Path = $env:Path -replace [regex]::Escape(";"+($gsExePath | Split-Path -Parent))
# command-line tools for manipulating PDFs:
# https://libgen.rs/search.php?req=pdf+hacks&lg_topic=libgen&open=0&view=simple&res=25&phrase=1&column=defs
# pdftk (PDF Toolkit): pdftk is a command-line tool for manipulating PDF files. It can merge, split, rotate, watermark, and decrypt PDF files.
# QPDF: QPDF is another command-line tool for structural, content-preserving transformation of PDF files. It's particularly useful for linearizing PDFs, decrypting, and compressing them.
# Poppler Utilities (pdftohtml, pdftotext, pdfimages): Poppler is a PDF rendering library and its utilities provide command-line tools for converting PDFs to various formats such as HTML, text, and images. pdftohtml converts PDF to HTML, pdftotext converts PDF to plain text, and pdfimages extracts images from PDFs.
# Ghostscript: Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF) page description languages. It can be used for a wide variety of tasks including converting PDFs to various formats, merging PDFs, and more.
# MuPDF: MuPDF is a lightweight PDF, XPS, and E-book viewer. It also includes command-line tools for extracting text and images from PDFs.
# PDFMiner: PDFMiner is a tool for extracting information from PDF documents. It includes a command-line tool for extracting text, images, and other content from PDFs.
假设您想要 300 DPI PNG,最简单的方法是使用“drop on me”.CMD 文件或将链接放入“SendTo”文件夹中。无论哪种方式,都可以将一个文件即时导出为图像,可以很容易地适应文件文件夹。
使用https://github.com/oschwartz10612/poppler-windows提供的 2024 64 位版本的 Poppler PDFtoPPM 二进制文件
因此这个 12 页 PDF 将导出到同一工作文件夹。
对于更复杂的用法,然后扩展 CMD 文件以在包含多个文件和/或子文件夹的当前工作目录中运行。
GhostScript 的类似命令可能是这样的
根据 @KenS 评论,使用 %%04d 简化为 000# 位
输出的差异可以通过更改附加开关来调整,但正如上面的 2 个命令所示,GhostScript(左下)会生成更紧凑的文件。
继原始帖子之后,在 PowerShell 中,对于包含一些 PDF 的文件夹,以下内容将循环遍历并提取每个 PDF 的 PNG。
pdftoppm.exe
根据需要替换、gswin64c.exe
、 以及包含 PDF 的文件夹的位置。请注意,对于gswin64c.exe
上面 CMD 中的编号,%
在控制台上变为%%
CMD 脚本内部,因此%%04d
对于 CMD 脚本,而%04d
在 PowerShell 中工作):