有哪些科学绘图软件可用？

Question

Asked: 2021-08-23 23:25:34 +0800 CST2021-08-23 23:25:34 +0800 CST 2021-08-23 23:25:34 +0800 CST

在扫描的 pdf 文件中获取白色背景上的打印机就绪黑色文本（删除灰度或彩色背景）

772

如何将纸质文档的照片转换为扫描文档？是相关的，但不一样，因为我在谈论 pdf 文件。在链接问题下的答案中，图像的处理似乎很复杂，特别是因为它涉及单独处理每个图像：鉴于我的 pdf 有数百页，我期望的解决方案不是处理/编辑图像，而只是扫描数码照片并以真实的方式记录。我的意思是类似于“虚拟扫描仪”的东西，其输入将是基于照片的 pdf 或照片集合，输出是“正常”扫描文档。（另外推荐的Scantailor工具 - 也在这里- 现在似乎缺少 Linux 版本。）

这与 OCR无关，也与将图像转换为文本无关。

为了澄清我的意思，我将发布一些示例。

有基于 text而非 image 的 pdf 文件，它们是导出为 pdf 的文本文件（比如说 docx 或 odt）。它们看起来可以打印了：

以上不是我在这里讨论的内容。

我感兴趣的是下面图片中的 pdf，即看起来太像图像的扫描文本页面和看起来像数字化文本的扫描文本页面之间的区别。

第一个是由看起来像书页照片的图像组成的：

或者

这样的副本很难在纸上重新打印，因为背景也会被打印出来。

第二个是人们对扫描文本的期望，并且可以打印：

或者

类似图片的 pdf 可能已经过 OCR 处理并且其文本可搜索，并且看起来仍然像（页面）照片的集合：OCR 不是这里的问题。

我想要的是“扫描”pdf的清晰黑白外观，并删除照片中正常但打印页面中不应该存在的所有“真实”细节（尤其是阴影）。

正如@vanadium 在评论中注意到的那样，我正在寻找一种能够自动清理文档图片的软件解决方案，就像智能手机上的 Google Scan 一样。

正如@user535733 在评论中所说，这里的问题似乎至少在某种程度上是将灰度（扫描/图像）文本转换为黑白的问题。

4 个回答

Voted

pLumo · Answer 1 · 2021-08-23T23:42:42+08:00

scantailor不再维护，但您仍然可以从源代码构建并使用它。

但是，原始存储库需要qt4，在最近的 Ubuntu 版本中不容易安装。你可以使用例如这个已经适应的叉子qt5。

先决条件：

sudo apt install libjpeg-dev zlib1g-dev libpng-dev libtiff-dev libboost-dev libxrender-dev libboost-all-dev

安装：

git clone https://github.com/victl/scantailor
cd scantailor
cmake .
make
sudo make install

免责声明：我不认识这个分叉的维护者，也不能说他版本的安全性。

另一种选择是使用Scantailor advanced。您可以通过snap...安装它

sudo snap install scantailor-advanced

...或flatpak。

...或通过ppa。

sudo add-apt-repository ppa:alex-p/scantailor
sudo apt update
sudo apt install scantailor # or scantailor-advanced

快速测试：

cipricus · Answer 2 · 2021-08-25T05:25:36+08:00

作为 PDF 的直接解决方案（无需手动提取图像）：

用于恢复OCR （如本答案补充ocrmypdf部分末尾所述）我注意到显示的选项听起来与所要求的完全一样：ocrmypdf -h

--remove-background Attempt to remove background from gray or color pages, setting it to white

最初的 pdf 已经有 OCR，除非使用以下选项之一，否则会出错：

-f, --force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF)

或者

-s, --skip-text Skip OCR on any pages that already contain text, but include the page in final output; useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pages

将每个单独应用到我的一个大文件中，其中一个包含数百页已经有 OCR 的页面会导致该过程崩溃。

在我看来，最好的解决方案是首先将初始文件（删除 OCR）打印为 pdf，然后执行

ocrmypdf input.pdf output.pdf -l <LANG> --remove-background -v

对于英语，-l不需要该选项。-v用于终端中的详细信息。

结果 pdf 大于输入（由于--remove-background 选项）：按如下所述减小大小。

关于 Scan Tailor，作为主要答案的补充

甚至它的图标也说明了一个事实，即它完全适用于此处所要求的内容：

以下是如何将 Scan Tailor 与 pdf 一起使用：

将所有 pdf 页面提取为图像文件- 因为此工具不直接处理 pdf 并且需要图像。Master PDF Editor 可以做到这一点，但在我的机器上，它在提取大约 80 张图像后崩溃。但它仍然可以通过设置要提取的新批次/范围的页面来使用。（PDF Mod 在任何处理之前崩溃）。经过几次试验后，我更喜欢 CLI 可靠但速度较慢的方法，其命令如下：pdftoppm MY_PDF.pdf NAME -tiff- 如此处所述。— 可以使用其他变量来代替tiff（提供tif文件），例如pngor jpeg。请参阅此处的一组 Dolphin 服务菜单操作，了解各种提取选项：

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=pdf;tif;jpeg;
X-KDE-Submenu=PDF action: EXTRACT ALL pages
Icon=application-pdf

[Desktop Action pdf]
Name=Extract pages as pdf
Icon=application-pdf
Exec=bash -c 'pdf=$(pdftk "%u" burst); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';

[Desktop Action tif]
Name=Extract pages as tif
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -tiff); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';


[Desktop Action jpeg]
Name=Extract pages as jpeg
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(pdftoppm "$f" "${f%%.*}" -jpeg); kdialog --title "Extract pages" --msgbox "Extracted! $pdf";';

在 Scan Tailor 中加载和处理生成的图像。将生成的图像文件放在一个单独的文件夹中，然后将该文件夹添加到 Scan Tailor 中的 New Project>Input Directory 下。（我已经从 PPA安装了该程序，正如@N0rbert 在主要答案下的评论中所说。）如果为每个页面选择“灰度和彩色”而不是默认值，某些包含真实图像而不是文本的页面可能看起来更好“黑白”（此处为文本）。逐一运行列出的程序。在运行最后一个页面之前检查页面（“输出”）。

从生成的图像中创建一个新的 pdf。（首先检查结果tif文件是否符合您的要求。）有很多方法可以创建新的 pdf。同样，我很快尝试过的 GUI 工具崩溃或给出了奇怪的结果，所以我更喜欢将生成的tif文件放在一个单独的文件夹中，然后运行命令img2pdf *.tif -o out.pdf- 如此处所述。（这可能需要对文件进行正确的命名/编号。更多信息请点击此处。）

生成的“定制” pdf 将小于初始 PDF，但缩小的百分比取决于我忽略的因素（但我想应该在步骤 1 中提取初始 pdf 中包含的页面）他们已经拥有的格式；我认为jpeg并且tif应该使用而不是png；在使用pdfimages -list your.pdf上面和下面的命令进行处理之前，在终端中使用以查看有关格式、dpi 和其他详细信息的详细信息）。

可以使用以下命令进一步减少最终的 pdf：

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

更多详细信息，请点击此处。

以下是基于上述链接的一组 Dolphin 服务菜单操作：

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=shrink;shrink0;shrink1;shrink2;
X-KDE-Submenu=PDF action: SHRINK
Icon=application-pdf

[Desktop Action shrink]
Name=Shrink pdf to "printer" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/printer    -sOutputFile="${f%.pdf}_printer.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink0]
Name=Shrink pdf to "prepress" size, 300dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress    -sOutputFile="${f%.pdf}_prepress.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';


[Desktop Action shrink1]
Name=Shrink pdf to "ebook size, 150dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook    -sOutputFile="${f%.pdf}_small.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

[Desktop Action shrink2]
Name=Shrink pdf to "screen" size, 72dpi
Icon=application-pdf
Exec=bash -c 'f="%u"; pdf=$(gs -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/screen    -sOutputFile="${f%.pdf}_smaller.pdf" "$f"); kdialog --title "Shrink" --msgbox "Done! $pdf";';

我也从这个答案中得到了一些帮助。

OCR (text search and copy capability) is lost during the above procedure, if present in the initial pdf. In order to get OCR, use ocrmypdf input.pdf output.pdf for English, as said here. For other languages, look for them with apt-cache search tesseract-ocr, and install them. Add -l <LANG> at the end of the command for specific languages; more here; see their names also here.

Here is a Dolphin service menu action for Romanian OCR with two options (one with progress in terminal and fixed output name, the other with background process but with output name based on input; I would like to have both process in terminal and output name based on input but don't know how; if someone can do it, please post here!). For English, replace "Romanian" and remove the -l ron variable:

[Desktop Entry]
Type=Service
ServiceTypes=KonqPopupMenu/Plugin
MimeType=application/pdf;
Actions=ocr1;ocr2;
X-KDE-Submenu=PDF action: apply OCR
Icon=application-pdf

[Desktop Action ocr1]
Name=Apply OCR Romanian (see progress in terminal; output name: ocr_ro.pdf!)
Icon=application-pdf
Exec=konsole --noclose -e ocrmypdf "%u" ocr_ro.pdf -l ron

[Desktop Action ocr2]
Name=Apply OCR Romanian (backgroud process: NO terminal! input>output name)
Icon=application-pdf
Exec=bash -c 'f="%u"; ocrmypdf "$f" "${f%.pdf}_ocr.pdf" -l ron;'

(Extracting and processing images, as well as 'printing as pdf' removes OCR, but reducing size with ghostscript as above does not, so the "shrinking" can be applied before or after the OCR.)

Angel115 · Answer 3 · 2021-08-24T00:41:07+08:00

使用 imageMagick 和以下脚本http://www.fmwconcepts.com/imagemagick/shadowhighlight/index.php我得到了很好的结果

这是使用以下参数的结果：

./shadowhighlight -ma 100 -sa 100 -ha 00 -hw 0 -bc 20 inputFile.png OutputFile.png

Ajay · Answer 4 · 2021-08-27T04:55:26+08:00

Just install Gimp(preferably use appimage). Following are the options:

Select Colour>Thresold and it is done your image will be black and white. for for this you have to do it for each page

Second option 2) Select Image>Mode>Indexed>Use black and white 1 bit palette

Any number of pages your pdf may have this will convert all to 1 bit Black and White.

Edit on 02/11/2021: As per query raised by cipiricus

Here are steps that I follow:

Scan pages with "simple scan" or Xsane. (I found simple scan do better work in color) OR use already available scanned pdf.
File>open OR drag and drop pdf file in GIMP. Here you need to give width X height of image you need. (Check what dpi you need 150 dpi or 300 dpi give value of width accordingly)
Now the pdf file with more than 1 pages open as layers.
Go to Image>Mode>Indexed>Use black and white 1 bit palette
现在我使用 File> "Export As" 导出 pdf
检查导出的 pdf 的每一页是否符合要求。如果不是，我用以下方法单独处理每个有缺陷的页面：a）选择图像>模式>灰度b）（如果页面上的灰色/噪点过多）选择颜色>曝光并根据需要进行调整。c）选择颜色>阈值，完成后，您的图像将变为黑白。为此，您必须为每个有缺陷的页面执行此操作以匹配所需的质量。d) 现在我将这个编辑过的页面插入到这层原始 pdf 文件层中，并删除有缺陷的页面层。并再次导出 pdf。希望这会有所帮助。

在扫描的 pdf 文件中获取白色背景上的打印机就绪黑色文本（删除灰度或彩色背景）

作为 PDF 的直接解决方案（无需手动提取图像）：

关于 Scan Tailor，作为主要答案的补充

如何运行 .sh 脚本？

如何安装 .tar.gz（或 .tar.bz2）文件？

如何列出所有已安装的软件包

无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗？

在扫描的 pdf 文件中获取白色背景上的打印机就绪黑色文本（删除灰度或彩色背景）

4 个回答

作为 PDF 的直接解决方案（无需手动提取图像）：

关于 Scan Tailor，作为主要答案的补充

相关问题