假设有一家医院诊所,其中有一份每天有多少患者到医院就诊的清单。我有 10 年的数据 - 但患者并非每天都到诊所就诊。举个例子,数据如下所示(在 R 中):
library(dplyr)
set.seed(123)
start_date <- as.Date("2010-01-01")
end_date <- as.Date("2019-12-31")
all_dates <- seq.Date(start_date, end_date, by="day")
num_visits <- sample(1:length(all_dates), size = 3000, replace = FALSE)
visit_dates <- all_dates[num_visits]
num_patients <- sample(1:100, size = length(visit_dates), replace = TRUE)
clinic_data <- data.frame(date = visit_dates, num_patients = num_patients)
hospital_data <- clinic_data %>% arrange(date)
date num_patients
2010-01-01 90
2010-01-02 96
2010-01-04 65
2010-01-05 80
2010-01-06 15
2010-01-07 87
我想尝试回答以下问题: 平均而言 - 对于任何给定的月份,该月所有患者中有多少百分比会在 $y$ 天之前到访过诊所?例如,假设我知道在某个月份有 900 人到访过医院 - 我想知道到 19 日为止,根据之前的趋势,这 900 人中有多少百分比(累计)可能在那时之前到访过医院?
我尝试通过手动识别不同的逻辑步骤来做到这一点:
library(ggplot2)
hospital_data$year <- as.numeric(format(as.Date(hospital_data$date), "%Y"))
hospital_data$month <- as.numeric(format(as.Date(hospital_data$date), "%m"))
hospital_data$day <- as.numeric(format(as.Date(hospital_data$date), "%d"))
hospital_data <- hospital_data[order(hospital_data$date), ]
yearly_totals <- aggregate(num_patients ~ year, data = hospital_data, FUN = sum)
names(yearly_totals)[2] <- "yearly_total"
hospital_data <- merge(hospital_data, yearly_totals, by = "year")
results <- by(hospital_data, hospital_data$year, function(df) {
df$cumulative_patients <- cumsum(df$num_patients)
df$cumulative_percentage <- df$cumulative_patients / df$yearly_total * 100
return(df)
})
results <- do.call(rbind, results)
avg_results <- aggregate(cumulative_percentage ~ day, data = results, FUN = mean, na.rm = TRUE)
avg_results <- avg_results[order(avg_results$day), ]
ggplot(avg_results, aes(x = day, y = cumulative_percentage)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1, 31, by = 5)) +
scale_y_continuous(limits = c(0, 100)) +
labs(title = "Average Cumulative Percentage of Yearly Patients by Day",
x = "Day of Month",
y = "Average Cumulative Percentage of Patients") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
但我的图表没有显示这个累积百分比:
有人知道我把事情搞砸了吗?
编辑:
library(tidyverse)
result <- hospital_data %>%
mutate(month = floor_date(date, "month"),
day = day(date)) %>%
group_by(month) %>%
arrange(month, day) %>%
mutate(month_total = sum(num_patients),
cuml = cumsum(num_patients),
cuml_pct = cuml / month_total) %>%
ungroup() %>%
group_by(day) %>%
summarize(avg_cuml_pct = mean(cuml_pct, na.rm = TRUE)) %>%
arrange(day)
result <- result %>%
mutate(avg_cuml_pct = cummax(avg_cuml_pct))
ggplot(result, aes(day, avg_cuml_pct)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
scale_x_continuous(breaks = seq(0, 31, by = 5)) +
labs(x = "Day of Month",
y = "Average Cumulative Percentage of Monthly Patients",
title = "Average Cumulative Patient Percentage by Day of Month") +
theme_minimal()
也许是这样?每条浅灰色线代表每月按天计算的患者累计百分比。深色线是这些平均值的未加权平均值。您可能需要加权平均值,但考虑到许多月份的规模相似,这里差别不大。
或者我们可以在加权基础上做同样的事情,但请注意,由于有些月份有 31 天,这意味着我们需要到任何月份的 31 号(即使是那些有 28/29/30 天的月份)才能达到 100%。