重新排列字母并比较两个单词

Question

schrodingerscatcuriosity

Asked: 2022-04-28 16:10:44 +0800 CST2022-04-28 16:10:44 +0800 CST 2022-04-28 16:10:44 +0800 CST

如果列有多个值，则分别复制每个值的行

772

我有一个具有以下格式的文件，每列由制表符分隔：

C1  C2  C3
a   b,c d
e   f,g,h   i
j   k   l
...

现在我需要根据第二列中用逗号分隔的值的数量（如果是这种情况）来确定行数。这些行必须具有其中一个值，而不是其他值。结果将是这样的：

C1  C2  C3
a   b   d
a   c   d
e   f   i
e   g   i
e   h   i
j   k   l
...
...

由于这是由于尽快工作，我刚刚制作了一个不要在家执行此操作的脚本，用逐行阅读while，因为我缺乏相关技能awk，或者没有使用其他工具探索其他可能的解决方案。脚本如下：

^{同时我正在修改剧本}

# DON'T DO THIS AT HOME SCRIPT
> duplicados.txt
while IFS= read -r line; do
  # get the value of the column of interest
  cues="$(echo "$line" | awk -F'\t' '{ print $18 }')"
  # if the column has commas then it has multiple values
  if [[ "$cues" =~ , ]]; then
    # count the commas
    c=$(printf "%s" "$cues" | sed 's/[^,]*//g' | wc -c)
    # loop according to the number of commas
    for i in $(seq $(($c + 1))); do
      # get each value of the column of interest according to the position
      cue="$(echo "$cues" | awk -F',' -v c=$i '{ print $c; ++c }')"
      # save the line to a file substituting the whole column for the value
      echo "$line" | sed "s;$cues;$cue;" >> duplicados.txt
    done
    continue
  fi
  # save the single value lines
  echo "$line" >> duplicados.txt
done < inmuebles.txt

有了这个，我得到了想要的结果（据我所知）。正如你可以想象的那样，脚本很慢而且效率很低。我怎么能用awk或其他工具做到这一点？

真实数据的样本是这样的，感兴趣的列是数字 18：

1409233 UNION   VIAMONTE    Estatal Provincial  DGEP    3321    VIAMONTE                            -33.7447365;-63.0997115 Rural Aglomerado    140273900   140273900-ESCUELA NICOLAS AVELLANEDA
1402961 UNION   SAN MARCOS SUD  Estatal Provincial  DGEA, DGEI, DGEP    3029, 3311, Z11 SAN MARCOS SUD                          -32.629557;-62.483976 / -32.6302699949582;-62.4824499999125 / -32.632417;-62.484932 Urbano  140049404, 140164000, 140170100, 140173100  140049404-C.E.N.M.A. N° 201 ANEXO SEDE SAN MARCOS SUD, 140164000-C.E.N.P.A. N° 13 CASA DE LA CULTURA(DOC:BERSANO), 140170100-ESCUELA HIPOLITO BUCHARDO, 140173100-J.DE INF. HIPOLITO BUCHARDO
1402960 UNION   SAN ANTONIO DE LITIN    Estatal Provincial  DGEA, DGEI, DGETyFP 3029, TZONAXI, Z11  SAN ANTONIO DE LITIN    3601300101020009    360102097366    0250347         SI / SI -32.212126;-62.635999 / -32.2122558;-62.6360432 / -32.2131931096409;-62.6291815804363   Rural Aglomerado    140049401, 140313000, 140313300, 140483400, 140499800   140049401-C.E.N.M.A. N° 201 ANEXO SAN ANTONIO DE LITIN, 140313000-I.P.E.A. Nº 214. MANUEL BELGRANO, 140313300-J.DE INF. PABLO A. PIZZURNO, 140483400-C.E.N.P.A. DE SAN ANTONIO DE LITIN, 140499800-C.E.N.P.A. B DE SAN ANTONIO DE LITIN

5 个回答

Voted

steeldriver · Answer 1 · 2022-04-28T16:21:35+08:00

Best Answer

steeldriver

2022-04-28T16:21:35+08:002022-04-28T16:21:35+08:00

您可以awk通过拆分复合列,并循环结果来做到这一点：

awk -F'\t' 'BEGIN{OFS=FS} {n=split($2,a,/,/); for(i=1;i<=n;i++){$2 = a[i]; print}}' file

也许更干净，你可以用Miller来做- 特别是使用nest 动词：

$ cat file
C1      C2      C3
a       b,c     d
e       f,g,h   i
j       k       l

$ mlr --tsv nest --explode --values --across-records --nested-fs ',' -f C2 file
C1      C2      C3
a       b       d
a       c       d
e       f       i
e       g       i
e       h       i
j       k       l

更紧凑--explode --values --across-records --nested-fs ','的可以替换为--evar ','

10

Philippos · Answer 2 · 2022-04-29T00:05:42+08:00

Philippos

2022-04-29T00:05:42+08:002022-04-29T00:05:42+08:00

由于您还用标记了问题sed，因此我感到敦促添加sed解决方案：

sed -e '/,/{s//\n/;h;s/[^\t]*\n//;x;s/\n[^\t]*//p;G;D;}'

_{（注意：为了便于阅读，我使用\n了换行符和\t制表符，就像您可以使用 GNU 一样sed。对于可移植的解决方案，请使用带有实际换行符的反斜杠，而不是\n实际的制表符\t，输入ctrlV后跟tab）}

带逗号的行被复制到保留空间，一个副本打印逗号之前的内容，另一个副本使用逗号之后的部分进入下一个循环。详细地：

为避免与多个逗号混淆，我们将其中一个替换为换行符s//\n/
h在我们弄乱线路之前保存一份到旧空间
s/[^\t]*\n//删除第一个逗号之前的部分
然后我们x改变缓冲区
s/\n[^\t]*//p删除从逗号开始的部分并打印它
G将保持空间附加到模式空间。这可以包含加法逗号，所以
D删除第一行（已打印的）并从该行的其余部分重新开始

4

dave_thompson_085 · Answer 3 · 2022-04-29T20:04:12+08:00

awk（或perl在awk模式下）可能是最好的标准解决方案，但您可以在大多数 shell 中合理有效地执行此操作，尤其是那些带有数组 ( ksh, bash, zsh) 的 shell：

set -f # split but don't glob unquoted substitutions
#bash
while IFS=$'\t' read -ra ary; do 
#ksh
while read -r line; do IFS=$'\t'; ary=($line)
#zsh I haven't worked out

  IFS=,; for v in ${ary[17]}; do 
    ary[17]=$v; IFS=$'\t'; printf '%s\n' "${ary[*]}"
  done
  # bash,ksh arrays are 0-origin versus 1-origin fields in awk
  # we don't need to special-case no-comma, it splits to a single value
done <input >output

对于没有数组的旧/有限外壳，请改用位置参数，例如（可能会有所不同）：

set -f
while read -r line; do IFS=$'\t'; set -- $line
  IFS=,; for v in ${18}; do
    # can't alter $num so yucky
    for i in $(seq $#); do
      case $i in (1);; (*) printf '\t';; esac
      case $i in (18) printf %s "$v";; (*) eval printf %s \"\${$i}\";; esac
    done
    # or maybe i=1; while [ $i -le $# ]; do ... i=$((i+1)); done
    # where [/test is likely shell builtin and seq is unlikely 
  done
done <input >output

guest_7 · Answer 4 · 2022-05-01T00:59:40+08:00

使用 perl

## Column of Interest
CoI=2 
perl -sF'\t' -aple '$"="\t";
  $_ = join $\, map { $F[$I]=$_;"@F" } split /,/, $F[$I]
' -- -I="$((CoI-1))" file

perl使用的选项：-
- -p这使 Perl 逐行读取文件并在下一个循环之前输出该行。
- -a打开自动拆分模式，输入记录 ($_) 被拆分，组件放入数组 (@F)
- -l使输入记录分隔符和输出记录分隔符成为换行符。
- -s打开基本的开关处理。在它的帮助下，我们在命令行上设置了一个全局变量 $I。
- -F此处指定的字段分隔符
- -e此处指定的 Perl 代码。
perl使用的内置变量：-
- $_当前正在处理的记录。
- $"数组元素连接器
- @F当前记录拆分字段存储在此数组中。它是零索引的
- $\ 输出记录分隔符
Perl代码：-
- 在逗号周围拆分感兴趣的列 $F[$I] 并逐步将拆分元素分配给感兴趣的列，并将数组 @F 与 $" 变量连接并分配给输入记录（$_）
- 默认操作是自动打印输入记录。

CoI=2
awk -F '\t' -v coi="$CoI" '
BEGIN { OFS=FS;s[1]=ORS }
NF >= coi {
  split($(coi),a,",")
  for (i=t=""; ++i in a;) {
    $(coi) = a[i]
    t = t s[i>1] $0
  }
  $0=t
}1
' file

在扩展正则表达式模式下使用GNU sed (-E)：

CoI=2
sed -E '
  s/[^\t]+/\n&\n/'"$CoI"'
  s/(\n.*)(\n.*)/\2\1,/
  :loop
    s/\n(.*\n)([^,]+),/\2\1/
    P;/\n$/d
    s/[^\t]+/\n/'"$CoI"'
  tloop
' file

这是记录 sed 正在做什么的进度操作：

   pat_spc       output
 a b,c,d e         -
 a \nb,c,d\n e     -
 a \n e\nb,c,d,    -
 a b e\nc,d,      a b e
 a \n e\c,d,        -
 a c e\nd,        a c e
 a \n e\nd,         -
 a d e\n          a d e
/\n$/ stop, fetch next line

CoI=2
python3 -c 'import sys

ifile,coi = sys.argv[1:]
coi = int(coi)-1
fs,rs,ofs,ors = ("\t","\n") * 2

with open(ifile) as f:
  for l in f:
    F = l.rstrip(rs).split(fs)
    for e in F[coi].split(","):
      F[coi] = e
      print(*F,sep=ofs)
' file "$CoI"

使用位置参数数组使用 bourne shell 内置函数

cleanup() {
  echo cleaning up temp files... >&2
  /bin/rm -f -- "$temp"
}
trap cleanup EXIT
set -u

#-------------------+
# user input section
#-------------------+
CoI=2
inp='inmuebles.txt'
#-------------------+

: <<\_README_
1. CoI standing for column of interest.
2. CoT must be a positive integer.
3. CoI must not be more than the number of fields in the unput.
4. inp stores the input file name, 
possibly with full or relative path, to make it accessible to the script.
5. All lines must have same number of fields.
6. No field to have TAB and/or newline.
7. Field separator is TAB.
8. File should be readable by the user and be a regular ascii text file.
_README_

IFS=$(printf '\t')
temp=$(mktemp) del=

while IFS= read -r line <&3
do
  set -f;set -- $line;set +f

  : ${del:=$(dc <<eof
1 $# $CoI-+n
eof
)}

  while case $# in "$del") break;; esac
  do
    printf '%s\t' "$1"
    shift
  done > "$temp"

  x1=$1;shift

  for csv in $(set -f;IFS=',';set -- $x1;printf '%s\t' "$@")
  do
    printf '%s' "$(cat < "$temp")" "$csv"
    case $# in 0) echo; break;; esac
    printf '\t%s\n' "$*"
  done
done 3< "$inp"

Praveen Kumar BS · Answer 5 · 2022-04-28T23:22:31+08:00

Praveen Kumar BS

2022-04-28T23:22:31+08:002022-04-28T23:22:31+08:00

 while read line
 do
 fic=$(echo $line | awk '{print $1}')
 laco=$(echo $line | awk '{print $NF}')
 secon_colu=$(echo $line| awk '$2 ~ /,/{print $2}')
 if [[ "$secon_colu" =~ "," ]]
 then
 for ko in $(echo $line | awk '$2 ~ /,/{print $2}'| sed 's/,/ /g')
 do
 echo "$fic $ko  $laco"
 done
 else
 echo $line
 fi
 done<file.txt

输出

C1 C2 C3
a b  d
a c  d
e f  i
e g  i
e h  i
j k l

-2

如果列有多个值，则分别复制每个值的行

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

如果列有多个值，则分别复制每个值的行

5 个回答

相关问题