zabbix_sender 错误

Question

jsx97

Asked: 2025-04-15 04:03:18 +0800 CST2025-04-15 04:03:18 +0800 CST 2025-04-15 04:03:18 +0800 CST

为什么我的 UTF-8 文件名总是与 Perl 中的正则表达式括号表达式匹配？

772

这是一个脚本，用于修复文件从 Windows 移动到 Mac 时损坏的西里尔文文件名（基于对使用不同编码后文件名被乱码后的恢复文件名的回答）

#!/bin/zsh

# Usage: <script> <target directory>
# Requires Perl::Rename

find "$1" -mindepth 1 -print0 |
  rename -0 -d -e '
    use Unicode::Normalize qw(NFC);
    use Encode qw(:all);

    if ($_ =~ /[†°Ґ£§•¶І®©™Ђђ≠]/) {
      my $check = DIE_ON_ERR | LEAVE_SRC;
      my $new = eval {encode("UTF-8",
                      decode("cp866",
                      encode("mac-cyrillic",
                      NFC(decode("UTF-8", $_, $check)), $check), $check))
                     };
      if ($new) {$_ = $new;} else {warn $@;}
    }'

我希望它仅重命名目标目录中文件名中至少包含以下字符之一的文件：†°Ґ£§•¶І®©™Ђђ≠。但由于某种原因，脚本会重命名那里的所有文件：例如，正确的文件名срочно.txt更改为无意义的ёЁюўэю.txt。我做错了什么？

我的测试文件夹的路径很简单/Users/john/scripts/test：没有空格，也没有西里尔字母或特殊字符。

该脚本在 macOS 和 BSD 版本上使用find。

问题得到解答两天后的更新：Stéphane 的 Chazelas 和 Choroba 的版本对我来说很好用。Terdon 的版本对我来说还不行。

4 个回答

Voted

choroba · Answer 1 · 2025-04-15T05:46:27+08:00

默认情况下，Perl 不接受 UTF-8 编码的源代码。如果您使用 UTF-8 编码的字符，则需要告知 Perl，否则 Perl 会将其视为字节（在本例中，字节 209 匹配）。

use utf8;

另外，你应该使用-u选项rename来告诉 Perl 文件名是 UTF-8 编码的（或者根据需要指定任何其他编码）。因此，编写你的脚本：

#!/bin/zsh

# Usage: <script> <target directory>
# Requires Perl::Rename

find "$1" -mindepth 1 -print0 |
  rename -0 -u -d -e '
    use Unicode::Normalize qw(NFC);
    use Encode qw(:all);
    use utf8;
    if ($_ =~ /[†°Ґ£§•¶І®©™Ђђ≠]/) {
      my $check = DIE_ON_ERR | LEAVE_SRC;
      my $new = eval {encode("UTF-8",
                      decode("cp866",
                      encode("mac-cyrillic",
                      NFC(decode("UTF-8", $_, $check)), $check), $check))
                     };
      if ($new) {$_ = $new;} else {warn $@;}
    }'

使用以下 Makefile 进行测试（fix是脚本本身）：

.PHONY: test
test:
    mkdir path
    touch path/срочно.txt
    touch path/†°Ґ£§•¶І®©™Ђђ≠
    ./fix path
    ls path

.PHONY: clean
clean:
    rm -rf path

输出：

абвгдежзийклмн  срочно.txt

Stéphane Chazelas · Answer 2 · 2025-04-16T00:40:04+08:00

您正在对未解码的文件名进行匹配，因此在进行匹配之前，您需要进行解码（和decode("UTF-8", $_, $check)部分）。NFC()

另外，正如前面提到的，perl由于 Unix 默认以 iso8859-1 来解释其代码（或者更确切地说，在字节级别，无需进行任何编码解码），而不是 UTF-8，所以/[†°Ґ£§•¶І®©™Ђђ≠]/除非你告诉它这些是用 UTF-8 表示的，否则它将无法工作。

因此 /[†°Ґ£§•¶І®©™Ђђ≠]/，这实际上与相同/[\x{E2}\x{80}\x{A0}\x{C2}\x{B0}\x{D2}\x{90}\x{C2}\x{A3}\x{C2}\x{A7}\x{E2}\x{80}\x{A2}\x{C2}\x{B6}\x{D0}\x{86}\x{C2}\x{AE}\x{C2}\x{A9}\x{E2}\x{84}\x{A2}\x{D0}\x{82}\x{D1}\x{92}\x{E2}\x{89}\x{A0}]/，您会认出这\xe2\x80\xa0是字符的 UTF-8 编码†：

$ printf %s '†' | iconv -t utf-8 |  od -An -vtx1
 e2 80 a0

该正则表达式将匹配任何包含\xe2或\x80或\xa0等字符的字符串，如果您不对文件名进行任何编码/解码，它将匹配任何字符的编码包含\xe2、\x80...字节的字符串，并且数千个字符在以UTF-8编码时包含这样的字节，例如р（U+0440），其UTF-8编码为0xd1 0x80。

use utf8正如其他人所建议的，它告诉 perl 其代码采用 UTF-8 编码，但这需要在脚本开始时完成。在这里，perl代码作为常规参数传递给rename脚本（而不是作为代码参数传递给perl），并由该脚本作为语句的一部分进行评估eval，因此use utf8在那里添加的不适用。比较：

$ perl -e 'use utf8; printf "%#x\n", ord("≠")'
0x2260

相同于：

$ perl -Mutf8 -e 'printf "%#x\n", ord("≠")'
0x2260

这是的代码点≠，已从 UTF-8 正确解码。使用：

$ rename 'use utf8; printf "%#x\n", ord("≠")' .
0xe2

这是 UTF-8 编码的第一个字节的值≠（也是â(U+00E2) 的代码点，在 iso8859-1 中编码为 0xe2）。

$ rename -u -e 'printf "%#x\n", ord("≠")' .
0xe2

使用-u没有帮助，因为它与文件名的编码/解码有关，而不是 perl 代码，并且在这里我们不想使用，-u因为我们想通过检查它是否成功来进行我们自己的编码/解码。

在这里，您可以使用PERL_UNICODE=A rename...它来告诉脚本的perl参数A需要以 UTF-8 编码，或者您可以使用\x{HHHH}或\N{character name}来表示这些字符并将代码保留为 ASCII：

find "$@" -depth -mindepth 1 -print0 |
  rename -0 -d -e '
    use Unicode::Normalize qw(NFC);
    use Encode qw(:all);
    use utf8;
    my $check = DIE_ON_ERR | LEAVE_SRC;
    my $new = eval {NFC(decode("UTF-8", $_, $check))};
    if ($new) {
      if ($new =~ /[\N{DAGGER}\N{DEGREE SIGN}\N{CYRILLIC CAPITAL LETTER GHE WITH UPTURN}\N{POUND SIGN}\N{SECTION SIGN}\N{BULLET}\N{PILCROW SIGN}\N{CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I}\N{REGISTERED SIGN}\N{COPYRIGHT SIGN}\N{TRADE MARK SIGN}\N{CYRILLIC CAPITAL LETTER DJE}\N{CYRILLIC SMALL LETTER DJE}\N{NOT EQUAL TO}]/) {
        $new = eval {encode("UTF-8",
                     decode("cp866",
                     encode("mac-cyrillic", $new, $check), $check))
                    };
        if ($new) {$_ = $new;} else {warn $@;}
      }
    } else {warn $@}'

（我曾经uconv -x name得到过那些角色名称，用来uconv -x hex/perl得到\x{HHHH}表格）。

或者find进行匹配（假设find/fnmatch()实现可以与多字节字符配合使用）

和：

find . -depth -mindepth 1 '(' -name '*[†°Ґ£§•¶І®©™Ђђ≠]*' -o \
  -name $'*=\u0338*' ')' -print0 |
  rename -0 -d -e '
    use Unicode::Normalize qw(NFC);
    use Encode qw(:all);
    my $check = DIE_ON_ERR | LEAVE_SRC;
    my $new = eval {encode("UTF-8",
                    decode("cp866",
                    encode("mac-cyrillic",
                    NFC(decode("UTF-8", $_, $check)), $check), $check))
                   };
    if ($new) {$_ = $new} else {warn$@}'

（其中是macos 可能在文件名¹中使用的字符=\u0338的分解形式）。≠

或者使用zshglob 代替find：

print -rNC1 -- $^@/**/*(=$'\338'|[†°Ґ£§•¶І®©™Ђђ≠])*(NDod) |
  same rename command as above.

^{¹ 并且NFC()在 perl 代码中它将转换为其C组合的Form，即解码/编码链将转换为的 orm н。}

terdon · Answer 3 · 2025-04-15T22:20:07+08:00

问题在于你匹配了它，$_但没有将其视为 Unicode。你需要先将其解码$_成 Unicode，然后再匹配。以下代码应该可以正常工作：

#!/bin/zsh

# Usage: <script> <target directory>
# Requires Perl::Rename

find "$1" -mindepth 1 -print0 |
  rename -0 -d -e '
    use Unicode::Normalize qw(NFC);
    use Encode qw(:all);

    if (decode("UTF-8",$_) =~ /[†°Ґ£§•¶І®©™Ђђ≠]/) {
      my $check = DIE_ON_ERR | LEAVE_SRC;
      my $new = eval {encode("UTF-8",
                      decode("cp866",
                      encode("mac-cyrillic",
                      NFC(decode("UTF-8", $_, $check)), $check), $check))
                     };
      if ($new) {$_ = $new;} else {warn $@;}
    }'

我使用（foo.sh上面的脚本在哪里）进行了测试：

$ /home/terdon/perl5/bin/rename --version
/home/terdon/perl5/bin/rename using File::Rename version 2.02, File::Rename::Options version 2.01

和：

$ ls -l
total 0
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 abd§•¶
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 file.foo
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 срочно.txt

$ foo.sh .

$ ls -l
total 0
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 abdдеж
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 file.foo
-rw-r--r-- 1 terdon terdon 0 Apr 15 18:11 срочно.txt

jsx97 · Answer 4 · 2025-04-16T06:39:20+08:00

这是我自己的版本，有一些额外的调整和一个测试用例。

#!/bin/zsh

perl -MFile::Rename -e 1 2>/dev/null || {
  echo "Error: Perl module File::Rename is not installed." >&2
  exit 1
}

rename_all=false

find "$1" -mindepth 1 -depth -print0 |
  while IFS= read -r -d '' file; do
    name="${file##*/}"
    if $rename_all || echo "$name" | grep -q '[†°Ґ£§•¶І®©™Ђђ≠]'; then
      rename -0 -d -e '
        use Unicode::Normalize qw(NFC);
        use Encode qw(:all);
        my $check = DIE_ON_ERR | LEAVE_SRC;
        my $new = eval {
          encode("UTF-8",
            decode("cp866",
              encode("mac-cyrillic",
                NFC(decode("UTF-8", $_, $check)), $check), $check))
        };
        if ($new) { $_ = $new } else { warn $@ }' "$file"
    fi
  done

前：

target-dir
├── abc1.txt
├── срочно1.txt
├── бваг™вга†1
│   ├── abc2.txt
│   ├── срочно2.txt
│   ├── бваг™вга†2
│   │   ├── abc.txt
│   │   ├── срочно.txt
│   │   └── бваг™вга†.txt
│   └── бваг™вга†2.txt
└── бваг™вга†1.txt

后：

target-dir
├── abc1.txt
├── срочно1.txt
├── структура1
│   ├── abc2.txt
│   ├── срочно2.txt
│   ├── структура2
│   │   ├── abc.txt
│   │   ├── срочно.txt
│   │   └── структура.txt
│   └── структура2.txt
└── структура1.txt

rename --version: /Users/john/perl5/bin/rename using File::Rename version 2.02, File::Rename::Options version 2.01

为什么我的 UTF-8 文件名总是与 Perl 中的正则表达式括号表达式匹配？

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

为什么我的 UTF-8 文件名总是与 Perl 中的正则表达式括号表达式匹配？

4 个回答

相关问题