Hot JAMS提出的问题 -computer

Hot JAMS

Asked: 2021-07-24 03:13:39 +0800 CST

在 bash 中使用 find 根据它们的扩展模式来操作填充

6

我正在使用“查找”命令从位于执行命令的目录中的所有子文件夹中删除给定扩展名的一些无用填充。

find -type f -name *.txt -delete
find -type f -name *.xml -delete
find -type f -name *.pdbqt -delete
find -type f -name *.txt -delete

是否可以将这 4 个命令组合成一个命令以通过 1 个命令删除 4 种类型的填充？是否可以使用 find 来删除所有不属于 *.dlg 扩展的填充？

Hot JAMS

Asked: 2021-07-23 23:40:24 +0800 CST

用于操作大量 tar.gz 档案的 bash 工作流程

5

ls -t
pnmrnp40_to_69  pnmrnp9028_to_9100  pnmrnp00_to_39  pnmrnp70_to_9028

在每个 prmnp* 子目录中，有许多填充属于 *.tar.gz 存档或 *.md5sub （我不知道它是什么，所以应该删除它）。

charlie@Precision-7920-Tower:~/Documents/script/mega_data/pnmrnp/pnmrnp40_to_69$ ls -t
ligands57_dir_results.tar.gz.md5sum  ligands40_dir_results.tar.gz.md5sum
ligands57_dir_results.tar.gz         ligands69_dir_results.tar.gz
ligands69_dir_results.tar.gz.md5sum  ligands68_dir_results.tar.gz
ligands68_dir_results.tar.gz.md5sum  ligands67_dir_results.tar.gz
ligands67_dir_results.tar.gz.md5sum  ligands66_dir_results.tar.gz
ligands66_dir_results.tar.gz.md5sum  ligands65_dir_results.tar.gz

我需要一个简单的 bash 工作流程，它将移动到每个子目录

删除所有 *.md5sub
将所有 *.tar.gz 解压到同一个子文件夹（保留原始存档的名称）。

这是我在 bash 中的工作流程：

#!/bin/bash
# assuming that the script is in the folder contained all subdirectories
dir="$PWD"

# loop each subdirectory
for subdir in ${dir}
cd ${subdir}
# unzip each archive to the same place
for tar in *.tar.gz; do
tar xzvf $tar
done
# return to initial dir
cd ..
done

有没有可能使这个脚本更有效，以便它可以适应大量的档案？

Hot JAMS

Asked: 2021-05-29 00:17:47 +0800 CST

多列CSV的后处理：删除重复行+排序

5

我正在处理由几个 CSV 的串联（通过 cat）生成的 csv：

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
1000,   lig40,  1,  0.805136,   -5.5200,    79
1000,   lig868, 1,  0.933209,   -5.6100,    42
1000,   lig278, 1,  0.933689,   -5.7600,    40
1000,   lig619, 3,  0.946354,   -7.6100,    20
1000,   lig211, 1,  0.960048,   -5.2800,    39
1000,   lig40,  2,  0.971051,   -4.9900,    40
1000,   lig868, 3,  0.986384,   -5.5000,    29
1000,   lig12,  3,  0.988506,   -6.7100,    16
1000,   lig800, 16, 0.995574,   -4.5300,    40
1000,   lig800, 1,  0.999935,   -5.7900,    22
1000,   lig619, 1,  1.00876,    -7.9000,    3
1000,   lig619, 2,  1.02254,    -7.6400,    1
1000,   lig12,  1,  1.02723,    -6.8600,    5
1000,   lig12,  2,  1.03273,    -6.8100,    4
1000,   lig211, 2,  1.03722,    -5.2000,    19
1000,   lig211, 3,  1.03738,    -5.0400,    21
ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V1,   lig40,  1,  0.513472,   -6.4600,    150
10V1,   lig211, 2,  0.695981,   -6.8200,    91
10V1,   lig278, 1,  0.764432,   -7.0900,    70
10V1,   lig868, 1,  0.787698,   -7.3100,    62
10V1,   lig211, 1,  0.83416,    -6.8800,    54
10V1,   lig868, 3,  0.888408,   -6.4700,    44
10V1,   lig278, 2,  0.915932,   -6.6600,    35
10V1,   lig12,  1,  0.922741,   -9.3600,    19
10V1,   lig12,  8,  0.934144,   -7.4600,    24
10V1,   lig40,  2,  0.949955,   -5.9000,    34
10V1,   lig800, 5,  0.964194,   -5.9200,    30
10V1,   lig868, 2,  0.966243,   -6.9100,    20
10V1,   lig12,  2,  0.972575,   -8.3000,    10
10V1,   lig619, 6,  0.979168,   -8.1600,    9
10V1,   lig619, 4,  0.986202,   -8.7800,    5
10V1,   lig800, 2,  0.989599,   -6.2400,    20
10V1,   lig619, 1,  0.989725,   -9.2900,    3
10V1,   lig12,  7,  0.991535,   -7.5800,    9
ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V2,   lig40,  1,  0.525767,   -6.4600,    146
10V2,   lig211, 2,  0.744702,   -6.8200,    78
10V2,   lig278, 1,  0.749015,   -7.0900,    74
10V2,   lig868, 1,  0.772025,   -7.3100,    66
10V2,   lig211, 1,  0.799829,   -6.8700,    63
10V2,   lig12,  1,  0.899345,   -9.1600,    25
10V2,   lig12,  4,  0.899606,   -7.5500,    32
10V2,   lig868, 3,  0.903364,   -6.4800,    40
10V2,   lig278, 3,  0.913145,   -6.6300,    36
10V2,   lig800, 5,  0.94576,    -5.9100,    35

要后处理此 CSV，我需要 1) 删除标题行的重复

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)

仅将标题保留在融合 csv 的开头（在第一行！）

然后我需要根据第 4 (dG(rescored)) 列中的数字对所有行（忽略第一行标题）进行排序。

为了完成第一项任务，我尝试使用以下 awk one-liner 它正在寻找第一行，然后删除其重复

 awk '{first=$1;gsub("ID(Prot)","");print first,$0}' mycsv.csv > csv_without_repeats.csv

但是它无法识别标题行，这意味着模式定义不正确。

然后根据我使用的第 4 列中的值对数据进行排序：

LC_ALL=C sort -k4,4g

如何将它通过管道传输到我的 AWK 代码或其他由 AWK 直接完成的所有事情？

例如我试过

awk '{first=$1;gsub(/ID(Prot)?(\([-azA-Z]+\))?/,"");print first,$0}' | LC_ALL=C sort -k4,4g input.csv > sorted_and_without_repeats.csv

但是脚本可以被终止，同时正确地生成分类的 CSV（由于 awk 部分的问题，仍然有重复）。

Hot JAMS

Asked: 2021-05-28 07:11:18 +0800 CST

根据列值对行进行排序

5

我需要根据从 min 到 max 的第二列中包含的数字对 csv 填充中的所有 linnes 进行排序，忽略第一列（标题行）：

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
10V1,   lig1,   1,  0.893101,   -7.2300,    36
10V1,   lig1,   3,  1.04024,    -6.5800,    4
10V1,   lig1,   4,  1.03044,    -6.5200,    7
10V1,   lig10,  1,  0.895754,   -6.0300,    47
10V1,   lig10,  2,  0.668236,   -5.9500,    112
10V1,   lig10,  3,  1.0103, -5.8200,    19
10V1,   lig1001,    1,  0.594972,   -5.6500,    142
10V1,   lig1001,    2,  1.05779,    -5.5000,    10
10V1,   lig1001,    3,  1.11195,    -4.9500,    2
10V1,   lig3,   1,  1.01583,    -5.6000,    20
10V1,   lig3,   2,  0.972203,   -5.2600,    36
10V1,   lig3,   3,  0.694967,   -5.2400,    118
10V1,   lig8,   1,  0.931977,   -7.4000,    25
10V1,   lig8,   2,  1.00413,    -7.1100,    9

应该像 lig1、lig3、lig8、lig10、lig1001 等那样排序：

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
    10V1,   lig1,   1,  0.893101,   -7.2300,    36
    10V1,   lig1,   3,  1.04024,    -6.5800,    4
    10V1,   lig1,   4,  1.03044,    -6.5200,    7
    10V1,   lig3,   1,  1.01583,    -5.6000,    20
    10V1,   lig3,   2,  0.972203,   -5.2600,    36
    10V1,   lig3,   3,  0.694967,   -5.2400,    118
    10V1,   lig8,   1,  0.931977,   -7.4000,    25
    10V1,   lig8,   2,  1.00413,    -7.1100,    9
    10V1,   lig10,  1,  0.895754,   -6.0300,    47
    10V1,   lig10,  2,  0.668236,   -5.9500,    112
    10V1,   lig10,  3,  1.0103, -5.8200,    19
    10V1,   lig1001,    1,  0.594972,   -5.6500,    142
    10V1,   lig1001,    2,  1.05779,    -5.5000,    10
    10V1,   lig1001,    3,  1.11195,    -4.9500,    2

我试过了

sort -k2.4,2n "${csv}" > sorted.csv

但它没有正确识别第二个值..

Hot JAMS

Asked: 2021-05-26 05:30:43 +0800 CST

bash：计算目录数

5

作为我的 bash 例程的一部分，我试图找到位于目录 $storage 中的子目录数量并将其关联到某个变量，该变量将在同一脚本中使用

number_dirs=$(ls -ld "${storage}"/* | wc -l)
  printf >&2 '%s is the number of the directories... ' "${number_dirs}" ;sleep 0.2
  printf >&2 "Keep calm!\n"

这适用于 2-4K 左右的目录数量，但不适用于大量数量。我怎么能以同样的方式使用 find 命令呢？

Hot JAMS

Asked: 2021-05-19 03:23:34 +0800 CST

awk：使用大量输入文件计算多列数据的最小/最大值

6

我正在处理位于不同目录中的大量 csv 数据文件的后处理。每个 csv 文件具有以下 3 列格式：

ID, POP, dG
1, 24, -6.8100
2, 22, -6.7900
3, 11, -6.6800
4, 18, -6.1100
5, 5, -6.0700
6, 1, -6.0600
7, 11, -6.0300
8, 36, -6.0100

以下 bash 函数包含 awk 代码，该代码一次计算所有已处理 CSV的 dG（第 3 列，始终为负浮点数）的最小值以及 POP（第 2 列 2，为正）值的最大值并存储它在第二个 awk 脚本使用的新 bash 变量最高POP最低DG中（此处不考虑）：

home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results_bench
cd "${storage}"
# pattern of the csv file located inside each of sub-directory of "${storage}"
str='*str1.csv'
rescore_data4 () {
str_name=$(basename "${str}" .csv)
mkdir -p "${rescore}"/"${str_name}"
# 1- calculate max POP and dGmin for ALL rescored CSVs at once
read highestPOP lowestDG < <(
    awk -F ', ' '
        FNR == 1 {
            next
            }
        NR == 2 || $2 > popMAX {popMAX = $2}
        NR == 2 || $3 < dGmin  {dGmin  = $3}
        END {printf "%d %.2f\n", popMAX, dGmin}
    ' "${storage}"/*_*_*/${str}
)
#
# 2- run rescoring routine using the min/max values
awk -F', *' -v OFS=', ' -v highest_POP="${highestPOP}" -v lowest_dG="${lowestDG}" '
   ... some awk code
'
}

在第一个 awk 脚本中，$str 是位于不同目录中的目标 csv 文件的 glob 掩码（匹配 glob 模式“ _ _*”）虽然这通常有效，但第一个 AWK 代码中有一个错误（用于计算 min/所有已处理 CSV 的最大值）：有时在输入 CSV 数量大/包含许多行的情况下，无法计算最低 DG 的值。问题总是与计算 dg 变量（始终为负）有关，脚本报告 dg=0.000，这是不正确的。

为了解决这个问题，我尝试修改 AWK 代码，在开始时定义两个新变量（具有最小值和最大值），然后将列中的每个值与它们进行比较：

   read highestPOP lowestDG < <(
    awk -F ', ' '
        FNR == 1 {
            dGmin = ""                              # initialize the min value
            POPmax = ""   
            next
            }
        NR == 2 || POPmax == "" || $2 > POPmax {POPmax = $2 }
        NR == 2 || dGmin == "" || $3 < dGmin  {dGmin  = $3 }
        END {printf "%d %.2f\n", POPmax, dGmin}
    ' "${storage}"/*_*_*/${str}
)

现在，从技术上讲，它可以工作，但似乎第二个解决方案没有正确报告最小值和最大值。如何正确修复 awk 脚本？

Hot JAMS

Asked: 2021-05-18 04:34:41 +0800 CST

在 awk 代码中使用 bash 变量

5

以下用 bash 编写的函数包含对多列数据执行数学运算的 AWK 代码，并最终将结果保存在所有已处理 CSV 的输出文件中。

home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results_bench
cd "${storage}"
# pattern of the csv file located inside each of sub-directory of "${storage}"
str='*str1.csv'

     rescore_data3 () {
str_name=$(basename "${str}" .csv)
mkdir -p "${rescore}"/"${str_name}"
# loop all directories contained target csv file
while read -r d; do
awk -F', *' -v OFS=', ' '
    FNR==1 {
        if (suffix)                             # suppress the empty line
            printf "%s %.3f (%d)\n", suffix, dGmin, dGminid
                                                # report the results for dGmin
        dGmin = ""                              # initialize the min value
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        if (FNR==NR)
            print "lig(CNE)" " " "dG(" prefix ")" " " "ClusterID"        # print the header line
        next
    }
    {
        dG = sqrt((($3+10)/10)^2+(($2-100)/100)^2)
        if (dGmin == "" || dG < dGmin) {
            dGmin = dG                          # update the min dG value
            dGminid = $1                        # update the ID with the min dG
        }
    }
    END {
        printf "%s %.3f (%d)\n", suffix, dGmin, dGminid # report results for dGmin
    }
' "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}

基本上每个处理的 CSV 包含 3 列：

#input_str1.csv located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # the line with ID=1, has lowest value in dG
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200

对 5 个 CSV 文件应用 rescore_data3() 会产生以下输出（单行包含有关单个 csv 的信息）：

# 10V1.csv
lig dG(10V1) ID
lig12 0.947 (1)
lig40 0.595 (1)
lig199 1.060 (1)
lig211 0.756 (2)
lig278 0.818 (1)

我需要修改 AWK 代码的数学方程中的常数（10 和 100），以便在为所有已处理的 csv 文件计算的灵活变量上使用它们替换：10 应替换为 dG 的最小值（每个输入的第 3 列.csv) 和 100 应替换为 POP 的最大值（每个 input.csv 的第 2 列）。最终，AWK 脚本中修改后的数学方程仍应包含 $2 和 $3 变量（针对特定 csv 获取的信息）以及 ${the_lowest_dG} 和 ${the_highest_POP}（在开始时仅对所有 CSV 计算一次）：

dG = sqrt((($3-{the_lowest_dG})/{the_lowest_dG})^2+(($2-{the_highest_POP})/{the_highest_POP})^2)

已编辑： 这是一个可能的解决方案，它基于 glenn jackman 提出的 AWK 代码集成到我的函数中。为了计算所有输入 CSV 的最低 dG 和最高 POP ，我在我的 AWK 函数之前使用了这个 awk 代码（它也已经更新以接受这两个变量并在数学方程中进一步使用它）：

rescore_data4 () {
# name of the target CSV file to be rescored
str_name=$(basename "${str}" .csv)
#make dir for output 
mkdir -p "${rescore}"/"${str_name}"
**# 1 - calculate max POP and dGmin for ALL rescored CSVs at once**
read highestPOP lowestDG < <(
    awk -F ', ' '
        FNR == 1 {next}
        NR == 2 || $2 > pop {pop = $2}
        NR == 2 || $3 < dg  {dg  = $3}
        END {print pop, dg}
    ' "${storage}"/*_*_*/${str} ## < applied on all *.csv files in each of the subdirectory matching *_*_* pattern
)
printf >&2 'DEBUG INFO: this is topPOP= %d and dGmin= %.1f computed for %s...  ' "${highestPOP}" "${lowestDG}" "${str_name}"; sleep 0.1 
#
# 2- Apply the following AWK code for rescoring and final data collecting
while read -r d; do
# run rescoring routine using the min/max values 
awk -F', *' -v OFS=', ' -v highest_POP="$highest_POP" -v lowest_dG="${lowestDG}" '
    FNR==1 {
        if (suffix)                             # suppress the empty line
            #print suffix " " dGmin " (" dGminid ")"
            printf "%s %.3f (%d)\n", suffix, dGmin, dGminid
            #printf "%s %.3f (%d) %.3f (%d)\n", suffix, dGmin, dGminid, dGmax, dGmaxid
                                                # report the results
        dGmin = ""                              # initialize the min value
        
        path=FILENAME
        sub(/\/[^/]+$/,"",path)
        prefix=suffix=FILENAME
        sub(/_.*/, "", prefix)
        sub(/\/[^\/]+$/, "", suffix); sub(/^.*_/, "", suffix)
        if (FNR==NR)
            print "lig(CNE)" " " "dG(" prefix ")" " " "ClusterID"        # print the header line
            #print "lig(CNE)" " " "dGmin(" prefix ")" " " "ID(dGmin)" " " "dGmax(" prefix ")" " " "ID(dGmax)"         # print the header line
        next
    }
    {
        dG = sqrt((($3-lowest_dG)/lowest_dG)^2+(($2-240)/240)^2)
        if (dGmin == "" || dG < dGmin) {
            dGmin = dG                          # update the min dG value
            dGminid = $1                        # update the ID with the min dG
        }
    }
    END {
        #print suffix " " dGmin " (" dGminid ")"    # report the results
        printf "%s %.3f (%d)\n", suffix, dGmin, dGminid
        #printf "%s %.3f (%d) %.3f (%d)\n", suffix, dGmin, dGminid, dGmax, dGmaxid
    }
' "${d}_"*/${str} > "${rescore}/"${str_name}"/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
}

虽然这通常效果很好，但新引入的 awk 部分存在一个错误：在包含超过 10 行的大量输入 CSV 的情况下，有时无法计算最低 DG 的值。

Hot JAMS

Asked: 2021-05-13 02:27:12 +0800 CST

sed/awk：多列数据的转换

6

我正在使用一些 AWK 脚本生成的多列数据文件：

# output.csv
lig12, dG(rescored)
1, 0.596625
2, 1.05873
3, 1.11285
4, 0.697402

我需要将此 output.csv 修改为单行格式，其中包含：第一行的第一列（lig12）、在第二列中检测到的最小值（在所有行中，这里是 0.596625）及其对应的 ID第一列的数字（这里是 1）：

lig12, 0.596625 (1)

sed/awk 的哪种组合可能有用，可以将其添加到我最初的 AWK 脚本生成的 output.csv 中？

在 bash 中使用 find 根据它们的扩展模式来操作填充

用于操作大量 tar.gz 档案的 bash 工作流程

多列CSV的后处理：删除重复行+排序

根据列值对行进行排序

bash：计算目录数

awk：使用大量输入文件计算多列数据的最小/最大值

在 awk 代码中使用 bash 变量

sed/awk：多列数据的转换

如何减少“vmmem”进程的消耗？

从 Microsoft Stream 下载视频

Google Chrome DevTools 无法解析 SourceMap：chrome-extension

Windows 照片查看器因为内存不足而无法运行？

支持结束后如何激活 WindowsXP？

远程桌面间歇性冻结

子网掩码 /32 是什么意思？

鼠标指针在 Windows 中按下的箭头键上移动？

VirtualBox 无法以 VERR_NEM_VM_CREATE_FAILED 启动

应用程序不会出现在 MacBook 的摄像头和麦克风隐私设置中

Hot JAMS's questions