如何将变量字符串放入powershell中的数组？

Question

Ralf_Reddings

Asked: 2024-11-11 00:00:08 +0800 CST2024-11-11 00:00:08 +0800 CST 2024-11-11 00:00:08 +0800 CST

使用 wget 下载时使用页面标题作为 HTML 文件名

772

我可以使用以下命令下载单个独立的 HTML 文件：

wget https://www.bbc.co.uk/news/articles/c99rgj0xkryo

但 wget 会将文件另存为index.html而不是Nation falls silent as King leads Remembrance ceremony.html。如何让 wget 使用页面标题？

在这种情况下，我并不关心离线文件的链接是否被破坏。我只关心下载独立的页面。

我在：

Windows 11
普什尔 7.4

2 个回答

Voted

JayCravens · Answer 1 · 2024-11-11T03:32:33+08:00

旗帜已-O悬挂wget。

wget -O "Nation falls silent as King leads Remembrance ceremony.html" https://www.bbc.co.uk/news/articles/c99rgj0xkryo

这是一个自动使用标题作为文件名的脚本。

你需要https://www.html-tidy.org。它位于大多数发行版的存储库中。

#!/bin/bash

url="$1"

wget -O "temp_index.html" "$url"

tidy -m "temp_index.html"

title_data=$(grep "<title" "temp_index.html" | head -20 | cut -d'>' -f2-)

if [[ "$title_data" =~ "</title>" ]]; then
    title_data=$(echo "$title_data" | sed 's/........$//')
fi

mv "temp_index.html" "$title_data".html

exit 0

经过几次测试，我注意到页面的格式有时会</title>出现在下一行，但有时却不是。因此，我添加了一个检查来处理这两种情况。

另存为：html_to_title.sh
更改模式可执行文件：chmod +x html_to_title.sh
用法：./html_to_title.sh www.example.com

我已经很久没有使用 Windows 了，这可能已经过时了，而且我也没有办法测试它，但这里有一个尝试过的 powershell 版本。

param (
    [string]$url
)

$tempFile = "temp_index.html"
Invoke-WebRequest -Uri $url -OutFile $tempFile

# Pretty print HTML... Can windows use tidy for this?
$content = Get-Content -Path $tempFile -Raw
$cleanContent = $content

# Compacted HTML will not work, somehow you must use $cleanContent for beautification

Set-Content -Path $tempFile -Value $cleanContent

$titleData = ($cleanContent -match '<title(.*?)</title>') ? $matches[1] : "Untitled"

# Sanitize title for filename, if missed characters, add them here
  
$titleData = $titleData -replace '[<>:"/\\|?*]', '_'  
$newFileName = "$titleData.html"
Rename-Item -Path $tempFile -NewName $newFileName

exit 0

用法：.\Get-PageTitle.ps1 "http://example.com"

就 Windows 而言，这是我能做的最好的事情。

Destroy666 · Answer 2 · 2024-11-16T22:54:19+08:00

Best Answer

Destroy666

2024-11-16T22:54:19+08:002024-11-16T22:54:19+08:00

使用 PowerShell 5，获取标题变得更加简单：

(Invoke-WebRequest -Uri https://www.bbc.co.uk/news/articles/c99rgj0xkryo).ParsedHtml.title

但是，ParsedHtml由于请求总是使用，因此已从 PowerShell 7 中基本删除-UseBasicParsing。因此您需要使用外部库或解决方法，直到他们重新实现它。

然后，为了清理文件名，有一些辅助函数可以覆盖所有字符，例如GetInvalidFileNameChars()可以在这样的函数中使用：

Function Remove-InvalidFileNameChars {
 param(
    [Parameter(Mandatory=$true,
      Position=0,
      ValueFromPipeline=$true,
      ValueFromPipelineByPropertyName=$true)]
    [String]$Name
 )

  $invalidChars = [IO.Path]::GetInvalidFileNameChars() -join ''
  $re = "[{0}]" -f [RegEx]::Escape($invalidChars)
  return ($Name -replace $re)
}

来源。

1

使用 wget 下载时使用页面标题作为 HTML 文件名

如何减少“vmmem”进程的消耗？

从 Microsoft Stream 下载视频

Google Chrome DevTools 无法解析 SourceMap：chrome-extension

Windows 照片查看器因为内存不足而无法运行？

支持结束后如何激活 WindowsXP？

远程桌面间歇性冻结

子网掩码 /32 是什么意思？

鼠标指针在 Windows 中按下的箭头键上移动？

VirtualBox 无法以 VERR_NEM_VM_CREATE_FAILED 启动

应用程序不会出现在 MacBook 的摄像头和麦克风隐私设置中

使用 wget 下载时使用页面标题作为 HTML 文件名

2 个回答

相关问题