如何将 for 循环拆分为 3 个单独的数据框？

Question

ViSa

Asked: 2024-10-20 16:25:35 +0800 CST2024-10-20 16:25:35 +0800 CST 2024-10-20 16:25:35 +0800 CST

如何正确地使用 beautifulsoup 来抓取网页元素？

772

我不是网页设计或网站/html 背景人士，并且是该领域的新手。

尝试从包含容器/卡片的此链接抓取元素。

我尝试了下面的代码并取得了一点成功，但不确定如何正确地执行它才能获取信息内容，而不会在结果中获取 html/css 元素。

from bs4 import BeautifulSoup as bs
import requests

url = 'https://ihgfdelhifair.in/mis/Exhibitors'

page = requests.get(url)
soup = bs(page.text, 'html')

我希望从以下内容中提取（作为实践）信息：

cards = soup.find_all('div', class_="row Exhibitor-Listing-box")
cards

它显示的内容如下：

[<div class="row Exhibitor-Listing-box">
 <div class="col-md-3">
 <div class="card">
 <div class="container">
 <h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>                                                   SHEENU</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>State : </span> UTTAR PRADESH</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>City : </span> AGRA</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Hall No. : </span> 12</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Stand No. : </span> G-15/43</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Mobile No. : </span> +91-5624010111, +91-7055166000</p>
 <p style="margin-bottom: 5px!important; font-size: 11px;"><span>Website : </span> www.artifactdecor.com</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Source Retail : </span> Y</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Vriksh Certified : </span> N</p>
 </div>

现在，当我使用下面的代码来提取元素时：

for element in cards:
    title = element.find_all('h4')
    email = element.find_all('p')
    print(title)
    print(email)

输出：它给了我所需的信息，但其中包含我不想要的 html/css 内容

[<h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>, <h4><b>  10G HOUSE OF CRAFT</b></h4>, <h4><b>  2 S COLLECTION</b></h4>, <h4><b>  ........]
[<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> [email protected]</p>, <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>        ..................]

那么，我怎样才能从结果中取出标题、电子邮件、联系人、州、城市元素，而不包含 html/css？

2 个回答

Voted

Alex Duchnowski · Answer 1 · 2024-10-20T16:53:17+08:00

Best Answer

Alex Duchnowski

2024-10-20T16:53:17+08:002024-10-20T16:53:17+08:00

正如 Manos Kounelakis 所建议的，您可能正在寻找的是BeautifulSoup HTML 元素的属性。此外，根据类而不是text元素来拆分 html 更为自然，因为元素对应于屏幕上的每个可视卡片单元。以下是一些可以相当漂亮地打印信息的代码：cardrowcard

import requests
from bs4 import BeautifulSoup as bs

url = "https://ihgfdelhifair.in/mis/Exhibitors"

page = requests.get(url)
soup = bs(page.text, features="html5lib")

cards = soup.find_all("div", class_="card")

for element in cards:
    title = element.find("h4").text
    other_info = [" ".join(elem.text.split()) for elem in element.find_all("p")]
    print("Title:", title)
    for info in other_info:
        print(info)
    print("-" * 80)

0

SIGHUP · Answer 2 · 2024-10-20T18:56:30+08:00

您正在抓取的页面使用 JavaScript 呈现。这意味着当您通过 HTTP(S) GET 访问 HTML 时，它可能尚未完全准备好进行解析。

如果您使用硒，您可能会获得更好的结果，您可以这样做：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions

options = ChromeOptions()
options.add_argument("--headless=true")
url = "https://ihgfdelhifair.in/mis/Exhibitors"

labels = {
    "email",
    "contact",
    "state",
    "city"
}

def text(e):
    if t := e.text.strip():
        return t
    if (t := e.get_attribute("textContent")) is not None:
        return t.strip()
    return ""

with webdriver.Chrome(options) as driver:
    driver.get(url)
    for card in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.card"))):
        wait = WebDriverWait(card, 10)
        if title := wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h4"))):
            print("Title :", text(title))
        for p in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "p"))):
            t = text(p).split()
            if t[0].lower() in labels:
                print(*t)
        print()

示例输出（为简洁起见，不完整）：

Title : 1 ARTIFACT DECOR (INDIA)
Email : [email protected]
Contact Person : SHEENU
State : UTTAR PRADESH
City : AGRA

Title : 10G HOUSE OF CRAFT
Email : [email protected]
Contact Person : MR. SUSHIL KUMAR
State : UTTAR PRADESH
City : MORADABAD

Title : 2 S COLLECTION
Email : [email protected],[email protected]
Contact Person : MR. SHARIQ AHMAD SIDDIQUI
State : UTTAR PRADESH
City : MORADABAD

如何正确地使用 beautifulsoup 来抓取网页元素？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

如果 T 既不可构造、不可复制、也不可移动，那么我可以拥有 std::optional<T> 吗？

为什么我可以定义一个 constinit 的 std::string 实例？如果对象需要动态初始化，constinit 不是被禁止的吗？

如何分配以后放置的新“如同新”

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

将 NULL 和 nullptr 传递给模板参数有什么区别？

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

如何正确地使用 beautifulsoup 来抓取网页元素？

2 个回答

相关问题