如何将 for 循环拆分为 3 个单独的数据框？

Question

Sebastian Meckovski

Asked: 2024-10-25 22:03:01 +0800 CST2024-10-25 22:03:01 +0800 CST 2024-10-25 22:03:01 +0800 CST

如何有效地从数据集中删除重叠的圆

772

我有一个包含约 20,000 条记录的数据集，这些记录代表全球人口超过 20,000 的城市。我估算了半径，它或多或少描述了城市的大小。它并不完全准确，但对于我的目的来说已经足够了。

我正在将其加载到我的 Panda 数据框对象中。以下是示例

name_city,country_code,latitude,longitude,geohash,estimated_radius,population
Vitry-sur-Seine,FR,48.78716,2.40332,u09tw9qjc3v3,1000,81001
Vincennes,FR,48.8486,2.43769,u09tzkx5dr13,500,45923
Villeneuve-Saint-Georges,FR,48.73219,2.44925,u09tpxrmxdth,500,30881
Villejuif,FR,48.7939,2.35992,u09ttdwmn45z,500,48048
Vigneux-sur-Seine,FR,48.70291,2.41357,u09tnfje022n,500,26692
Versailles,FR,48.80359,2.13424,u09t8s6je2sd,1000,85416
Vélizy-Villacoublay,FR,48.78198,2.19395,u09t9bmxdspt,500,21741
Vanves,FR,48.82345,2.29025,u09tu059nwwp,500,26068
Thiais,FR,48.76496,2.3961,u09tqt2u3pmt,500,29724
Sèvres,FR,48.82292,2.21757,u09tdryy15un,500,23724
Sceaux,FR,48.77644,2.29026,u09tkp7xqgmw,500,21511
Saint-Mandé,FR,48.83864,2.41579,u09tyfz1eyre,500,21261
Saint-Cloud,FR,48.84598,2.20289,u09tfhhh7n9u,500,28839
Paris,FR,48.85341,2.3488,u09tvmqrep8n,12000,2138551
Orly,FR,48.74792,2.39253,u09tq6q1jyzt,500,20528
Montrouge,FR,48.8162,2.31393,u09tswsyyrpr,500,38708
Montreuil,FR,48.86415,2.44322,u09tzx7n71ub,2000,111240
Montgeron,FR,48.70543,2.45039,u09tpf83dnpn,500,22843
Meudon,FR,48.81381,2.235,u09tdy73p38y,500,44652
Massy,FR,48.72692,2.28301,u09t5yqqvupx,500,38768
Malakoff,FR,48.81999,2.29998,u09tsr6v13tr,500,29420
Maisons-Alfort,FR,48.81171,2.43945,u09txtbkg61z,1000,53964
Longjumeau,FR,48.69307,2.29431,u09th0q9tq1s,500,20771
Le Plessis-Robinson,FR,48.78889,2.27078,u09te9txch23,500,22510
Le Kremlin-Bicêtre,FR,48.81471,2.36073,u09ttwrn2crz,500,27867
Le Chesnay,FR,48.8222,2.12213,u09t8rc3cjwz,500,29154
La Celle-Saint-Cloud,FR,48.85029,2.14523,u09tbufje6p6,500,21539
Ivry-sur-Seine,FR,48.81568,2.38487,u09twq8egqrc,1000,57897
Issy-les-Moulineaux,FR,48.82104,2.27718,u09tezd5njkr,1000,61447
Fresnes,FR,48.75568,2.32241,u09tkgenkj6r,500,24803
Fontenay-aux-Roses,FR,48.79325,2.29275,u09ts4t92cn3,500,24680
Clamart,FR,48.80299,2.26692,u09tes6dp0dn,1000,51400
Choisy-le-Roi,FR,48.76846,2.41874,u09trn12bez7,500,35590
Chevilly-Larue,FR,48.76476,2.3503,u09tmmr7mfns,500,20125
Châtillon,FR,48.8024,2.29346,u09tshnn96xx,500,32383
Châtenay-Malabry,FR,48.76507,2.26655,u09t7t6mn7yj,500,32715
Charenton-le-Pont,FR,48.82209,2.41217,u09twzu3r9hq,500,30910
Cachan,FR,48.79632,2.33661,u09tt5j7nvqd,500,26540
Bagnolet,FR,48.86667,2.41667,u09tyzzubrxb,500,33504
Bagneux,FR,48.79565,2.30796,u09tsdbx727w,500,38900
Athis-Mons,FR,48.70522,2.39147,u09tn6t2mr16,500,31225
Alfortville,FR,48.80575,2.4204,u09txhf6p7jp,500,37290
Quinze-Vingts,FR,48.84656,2.37439,u09tyh0zz6c8,500,26265
Croulebarbe,FR,48.81003,2.35403,u09tttd5hc5f,500,20062
Gare,FR,48.83337,2.37513,u09ty1cdbxcq,1000,75580
Maison Blanche,FR,48.82586,2.3508,u09tv2rz1xgx,1000,64302

以下是数据样本的直观表示：

我的目标是找到一种有效的算法，可以去除相交的圆，只保留最大的圆population。

我最初的方法是使用半正矢公式确定哪些圆相交。问题是，为了检查每条记录是否与其他记录相交，需要遍历整个数据集。这样做的时间复杂度太高了。

我的第二种方法是根据国家代码分离数据集，然后按块进行比较：

  def _remove_intersecting_circles_for_country(df_country):
    """Helper function to remove intersections within a single country."""
    indices_to_remove = set()
    for i in range(len(df_country)):
      for j in range(i + 1, len(df_country)):
        distance = haversine(df_country['latitude'].iloc[i], df_country['longitude'].iloc[i],
                            df_country['latitude'].iloc[j], df_country['longitude'].iloc[j])
        if distance < df_country['estimated_radius'].iloc[i] + df_country['estimated_radius'].iloc[j]:
          if df_country['population'].iloc[i] < df_country['population'].iloc[j]:
            indices_to_remove.add(df_country.index[i]) 
          else:
            indices_to_remove.add(df_country.index[j])
    return indices_to_remove

  all_indices_to_remove = set()
  for country_code in df['country_code'].unique():
    df_country = df[df['country_code'] == country_code]
    indices_to_remove = _remove_intersecting_circles_for_country(df_country)
    all_indices_to_remove.update(indices_to_remove)

  new_df = df.drop(index=all_indices_to_remove)
  return new_df

这显著提高了性能，因为要检查每条记录，我们只需要检查所有具有相同的记录country_code。但这仍然会产生很多不必要的比较

3 个回答

Voted

Pieter · Answer 1 · 2024-10-26T05:56:41+08:00

一旦将圆变成多边形，如果使用空间索引来执行此操作，则确定多边形之间的交点会非常快。

因此，您可以：

将点缓冲到圆。在 WGS84 中，这会不精确，因此需要通过回避等距投影来完成缓冲。
使用空间索引可以非常快速地计算哪些圆相交。例如geopandas.sjoin在后台使用空间索引。
现在您可以确定哪些圆圈与代表人口较多城市的圆圈相交，然后将其删除。

结果（红色：原有城市，蓝色：保留城市）：

代码：

from pathlib import Path
import geopandas as gpd
from matplotlib import pyplot as plt
from pyproj import CRS, Transformer
from shapely.geometry import Point
from shapely.ops import transform


def geodesic_point_buffer(lat, lon, distance):
    # Azimuthal equidistant projection
    aeqd_proj = CRS.from_proj4(
        f"+proj=aeqd +lat_0={lat} +lon_0={lon} +x_0=0 +y_0=0")
    tfmr = Transformer.from_proj(aeqd_proj, aeqd_proj.geodetic_crs)
    buf = Point(0, 0).buffer(distance)  # distance in metres
    return transform(tfmr.transform, buf)

# Read the csv file with data
#csv_path = Path(__file__).resolve().with_suffix(".csv")
csv_path = "https://raw.githubusercontent.com/theroggy/pysnippets/refs/heads/main/pysnippets/stackoverflow_questions/2024/Q4/2024-10-25_circles.csv"
points_df = gpd.read_file(csv_path, autodetect_type=True)

# Convert the points to circles by buffering them
points_buffer_gdf = gpd.GeoDataFrame(
    points_df,
    geometry=points_df.apply(
        lambda row : geodesic_point_buffer(row.latitude, row.longitude, row.estimated_radius), axis=1
    ),
    crs=4326,
)

# Determine the intersecting city buffers (result includes self-intersections)
intersecting_gdf = points_buffer_gdf.sjoin(points_buffer_gdf)

# Get all city buffers that intersect a city with a larger population
intersecting_larger_population_df = intersecting_gdf.loc[
    intersecting_gdf.population_left < intersecting_gdf.population_right
]

# Remove the city buffers that intersect with a larger population city buffer
result_gdf = points_buffer_gdf[
    ~points_buffer_gdf.index.isin(intersecting_larger_population_df.index)
]

# Plot result
ax = points_buffer_gdf.boundary.plot(color="red")
result_gdf.boundary.plot(color="blue", ax=ax)
plt.show()

mosfet · Answer 2 · 2024-10-26T00:14:26+08:00

从视觉表现来看，首先想到的是使用具有最外轮廓的图像处理方法（并将地理坐标映射到像素坐标）。不确定是否适合城市之间间距很大的非常大的数据集（但为什么要查看里昂是否与巴黎重叠？），但在提供的近巴黎样本上，它的时间比当前解决方案更快。

与当前的 O(n^2) 相比，我们的时间复杂度为 O(n log(n)) + O(n)。然而，主要的权衡可能是创建具有足够分辨率的图像。考虑到底层优化的 C++ OpenCV，它的轮廓处理应该还不错

import numpy as np
import cv2
import pandas as pd
from typing import Tuple
from io import StringIO

def detect_non_overlapping_circles(df: pd.DataFrame, image_size:Tuple[int, int] = (128, 128)):
    img = np.zeros(image_size, dtype=np.uint8) # Black image

    lat_min, lat_max = df['latitude'].min(), df['latitude'].max()
    lon_min, lon_max = df['longitude'].min(), df['longitude'].max()

    def geo_to_pixel(lat: float, lon: float):
        x = int((lon - lon_min) / (lon_max - lon_min) * (image_size[0] - 100) + 50)
        y = int((lat - lat_min) / (lat_max - lat_min) * (image_size[1] - 100) + 50)
        return (x, y)

    # Convert radius to pixels based on the image size and coordinate range
    def radius_to_pixel(radius_meters: float):
        # Using an approximate conversion factor
        earth_circumference = 40075000  # meters
        meters_per_degree = earth_circumference / 360
        lat_range = lat_max - lat_min
        pixels_per_meter = (image_size[1] - 100) / (lat_range * meters_per_degree)
        return int(radius_meters * pixels_per_meter)

    df_sorted = df.sort_values('population', ascending=False)

    kept_indices = []

    for idx, row in df_sorted.iterrows():
        test_img = img.copy()
    
        # Current circle
        center = geo_to_pixel(row['latitude'], row['longitude'])
        radius = radius_to_pixel(row['estimated_radius'])
    
        # Find outermost contour at each time
        cv2.circle(test_img, center, radius, 255, -1)    
        contours, _ = cv2.findContours(test_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
        # If exactly one contour, this circle dont overlap with any one
        if len(contours) == len(kept_indices) + 1:
            kept_indices.append(idx)
            img = test_img  # Update the main image

    return df.loc[kept_indices]

def main(df):
    result_df = detect_non_overlapping_circles(df)
    
    print(f"Original number of cities: {len(df)}")
    print(f"Cities after removing overlaps: {len(result_df)}")
    print(result_df[['name_city', 'population', 'estimated_radius']].to_string())

    return result_df

# Not sure if you're working with csv but from what provided
data = 
"""name_city,country_code,latitude,longitude,geohash,estimated_radius,population
Vitry-sur-Seine,FR,48.78716,2.40332,u09tw9qjc3v3,1000,81001
Vincennes,FR,48.8486,2.43769,u09tzkx5dr13,500,45923
Villeneuve-Saint-Georges,FR,48.73219,2.44925,u09tpxrmxdth,500,30881
Villejuif,FR,48.7939,2.35992,u09ttdwmn45z,500,48048
Vigneux-sur-Seine,FR,48.70291,2.41357,u09tnfje022n,500,26692
""" # ...

df = pd.read_csv(StringIO(data))
df.dropna()

proc_df = main(df)

使用提供的 csv 和城市间距，它可以很好地处理 (128, 128) 图像分辨率。在我的计算机上，处理 (1024, 1024) 图像的执行时间仍然比_remove_intersecting_circles_for_country

Jim Mischel · Answer 3 · 2024-10-26T04:23:00+08:00

降低复杂性的常用方法是在地图上叠加网格。例如，叠加一个边长为 100 英里的方格网格。创建一个数据结构来告诉您：

城市位于哪个网格区域（城市中心）
哪些城市（同样是城市圆的中心）位于该网格范围内

假设没有一个城市的半径大于 100 英里，那么一个城市只能与以下中心的城市相交：

在自己的网格方格内
在相邻的网格方块中
对于大城市来说，相邻的网格方块相邻。

在下图中，标记为“0”的网格是包含您当前正在检查的城市的网格。它唯一可能与之相交的城市是网格位置标记为“1”或“2”的城市。同样，假设您的网格大小大于任何单个城市的规模。

    X X X X X X X
    X 2 2 2 2 2 X
    X 2 1 1 1 2 X
    X 2 1 O 1 2 X
    X 2 1 1 1 2 X
    X 2 2 2 2 2 X
    X X X X X X X

这大大减少了您在尝试确定重叠时需要查看的城市数量。您将有...十几个？一百个？而不是 20,000 个城市？

构建该网格以便您可以分类存放物品的开销很小，但性能优势却是巨大的。

在构建该数据结构时，为每个网格方格不仅包含哪些城市位于其中心，还包含哪些城市侵占了它，这可能是有意义的。因此，如果“超人前传”位于您标记为 C2 的网格方格中，并且其周长侵占了 C3，则 C2 应该包含一个条目，实际上显示“超人前传位于 C2”。C3 有一个条目显示“超人前传侵占了网格方格 C3。”

不过，在需要时，这些信息在运行时很容易计算。也许保留它是不必要的。

更新

想想看，因为您有每个城市的纬度/经度，所以您已经拥有了构建网格所需的大部分信息。让您的网格纬度为一度，经度为一度。在赤道处，您的方格面积约为 69 英里（111 公里）见方。当您靠近两极时，面积会压缩，但这并不重要。要确定城市属于哪个网格，您只需将纬度/经度和索引的小数部分切掉即可。您可能需要对数字进行一些标准化（即从 -180...180 转换为 0...360）。

无论如何，这可以让你将问题缩小到更易于管理的规模。

如何有效地从数据集中删除重叠的圆

更新

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

如何有效地从数据集中删除重叠的圆

3 个回答

更新

相关问题