出于报告目的,我们有一个 DWH(数据仓库)执行 ETL(提取-转换-加载),以从生产 OLTP(在线事务处理)数据库中的选定表中检索数据。
ETL以增量方式提取数据,因此只提取数据发生变化的部分。我们暂时认为这不会影响数据的大小。
这是一个简单的映射,因此对于选定的表,DWH 具有与 OLTP 相同的列。DWH是SQL Server,OLTP数据库是MySQL。当然,MySQL的数据类型需要转换为SQL Server上下文中的相应类型,我们遵循Microsoft SSMA(SQL Server Migration Assistant)中的标准。
我们注意到 SQL Server 中的数据比 MySQL 中的数据大几倍。例如,在电子商务 Magento 应用程序中:
- 该
sales_order
表包含7'100'000
大小为 的行5.5GB
。 - 然而,在数据仓库中,相同的表大小
20GB
具有相同的行数。
请参阅下面的部分表定义。
我们检查了SQL Server数据库,它有SQL_Latin1_General_CP1_CI_AS
排序规则和Simple
恢复模型。MySQL OLTP 有默认排序规则latin1_swedish_ci
。
我们的问题:
- 在我们的设置中,为什么对于相同的数据,SQL Server 比 MySQL 大数倍?如果我们遗漏了什么,请指出,DWH 可以变小。
- OLTP 和 DWH 之间的直接映射实现起来很简单,而且到目前为止效果很好。但是,我们知道有许多列已获取但从未在报告中使用。因此,我们想知道数据仓库是否有更好的设计或最佳实践。
我们非常感谢任何提示和建议。
示例部分表定义的详细信息:
- MySQL OLTP,同样参考Magento开源库中的模型:
-- msab_magento.sales_order definition
CREATE TABLE `sales_order` (
`entity_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Entity ID',
`state` varchar(32) DEFAULT NULL COMMENT 'State',
`status` varchar(32) DEFAULT NULL COMMENT 'Status',
`coupon_code` varchar(255) DEFAULT NULL COMMENT 'Coupon Code',
`protect_code` varchar(255) DEFAULT NULL COMMENT 'Protect Code',
`shipping_description` varchar(255) DEFAULT NULL COMMENT 'Shipping Description',
`is_virtual` smallint(5) unsigned DEFAULT NULL COMMENT 'Is Virtual',
`store_id` smallint(5) unsigned DEFAULT NULL COMMENT 'Store ID',
`customer_id` int(10) unsigned DEFAULT NULL COMMENT 'Customer ID',
`base_discount_amount` decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Amount',
`base_discount_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Canceled',
`base_discount_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Invoiced',
`base_discount_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Refunded',
`base_grand_total` decimal(20,4) DEFAULT NULL COMMENT 'Base Grand Total',
`base_shipping_amount` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Amount',
`base_shipping_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Canceled',
`base_shipping_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Invoiced',
`base_shipping_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Refunded',
`base_shipping_tax_amount` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Tax Amount',
`base_shipping_tax_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Tax Refunded',
`base_subtotal` decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal',
`base_subtotal_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Canceled',
`base_subtotal_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Invoiced',
`base_subtotal_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Refunded',
`base_tax_amount` decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Amount',
`base_tax_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Canceled',
`base_tax_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Invoiced',
`base_tax_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Refunded',
`base_to_global_rate` decimal(20,4) DEFAULT NULL COMMENT 'Base To Global Rate',
`base_to_order_rate` decimal(20,4) DEFAULT NULL COMMENT 'Base To Order Rate',
`base_total_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Canceled',
`base_total_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Invoiced',
`base_total_invoiced_cost` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Invoiced Cost',
`base_total_offline_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Offline Refunded',
`base_total_online_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Online Refunded',
`base_total_paid` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Paid',
`base_total_qty_ordered` decimal(12,4) DEFAULT NULL COMMENT 'Base Total Qty Ordered',
`base_total_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Base Total Refunded',
`discount_amount` decimal(20,4) DEFAULT NULL COMMENT 'Discount Amount',
`discount_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Discount Canceled',
`discount_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Discount Invoiced',
`discount_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Discount Refunded',
`grand_total` decimal(20,4) DEFAULT NULL COMMENT 'Grand Total',
`shipping_amount` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Amount',
`shipping_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Canceled',
`shipping_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Invoiced',
`shipping_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Refunded',
`shipping_tax_amount` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Tax Amount',
`shipping_tax_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Shipping Tax Refunded',
`store_to_base_rate` decimal(12,4) DEFAULT NULL COMMENT 'Store To Base Rate',
`store_to_order_rate` decimal(12,4) DEFAULT NULL COMMENT 'Store To Order Rate',
`subtotal` decimal(20,4) DEFAULT NULL COMMENT 'Subtotal',
`subtotal_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Canceled',
`subtotal_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Invoiced',
`subtotal_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Refunded',
`tax_amount` decimal(20,4) DEFAULT NULL COMMENT 'Tax Amount',
`tax_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Tax Canceled',
`tax_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Tax Invoiced',
`tax_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Tax Refunded',
`total_canceled` decimal(20,4) DEFAULT NULL COMMENT 'Total Canceled',
`total_invoiced` decimal(20,4) DEFAULT NULL COMMENT 'Total Invoiced',
`total_offline_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Total Offline Refunded',
`total_online_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Total Online Refunded',
`total_paid` decimal(20,4) DEFAULT NULL COMMENT 'Total Paid',
`total_qty_ordered` decimal(12,4) DEFAULT NULL COMMENT 'Total Qty Ordered',
`total_refunded` decimal(20,4) DEFAULT NULL COMMENT 'Total Refunded',
`can_ship_partially` smallint(5) unsigned DEFAULT NULL COMMENT 'Can Ship Partially',
`can_ship_partially_item` smallint(5) unsigned DEFAULT NULL COMMENT 'Can Ship Partially Item',
`customer_is_guest` smallint(5) unsigned DEFAULT NULL COMMENT 'Customer Is Guest',
`customer_note_notify` smallint(5) unsigned DEFAULT NULL COMMENT 'Customer Note Notify',
`billing_address_id` int(11) DEFAULT NULL COMMENT 'Billing Address ID',
`customer_group_id` int(11) DEFAULT NULL,
...
`reward_points_balance_refund` int(11) DEFAULT NULL COMMENT 'Reward Points Balance Refund',
PRIMARY KEY (`entity_id`),
UNIQUE KEY `SALES_ORDER_INCREMENT_ID_STORE_ID` (`increment_id`,`store_id`),
KEY `SALES_ORDER_STATUS` (`status`),
KEY `SALES_ORDER_STATE` (`state`),
KEY `SALES_ORDER_STORE_ID` (`store_id`),
KEY `SALES_ORDER_CREATED_AT` (`created_at`),
KEY `SALES_ORDER_CUSTOMER_ID` (`customer_id`),
KEY `SALES_ORDER_EXT_ORDER_ID` (`ext_order_id`),
KEY `SALES_ORDER_QUOTE_ID` (`quote_id`),
KEY `SALES_ORDER_UPDATED_AT` (`updated_at`),
KEY `SALES_ORDER_SEND_EMAIL` (`send_email`),
KEY `SALES_ORDER_EMAIL_SENT` (`email_sent`),
CONSTRAINT `SALES_ORDER_CUSTOMER_ID_CUSTOMER_ENTITY_ENTITY_ID` FOREIGN KEY (`customer_id`) REFERENCES `customer_entity` (`entity_id`) ON DELETE SET NULL,
CONSTRAINT `SALES_ORDER_STORE_ID_STORE_STORE_ID` FOREIGN KEY (`store_id`) REFERENCES `store` (`store_id`) ON DELETE SET NULL
) ENGINE=InnoDB AUTO_INCREMENT=71xxxxx DEFAULT CHARSET=utf8 COMMENT='Sales Flat Order';
- SQL Server DWH,由 Microsoft SSMA for MySQL 生成:
/****** Object: Table [msab_magento].[sales_order] Script Date: 10/11/2023 3:17:43 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [msab_magento].[sales_order](
[entity_id] [bigint] IDENTITY(2956088,1) NOT NULL,
[state] [nvarchar](32) NULL,
[status] [nvarchar](32) NULL,
[coupon_code] [nvarchar](255) NULL,
[protect_code] [nvarchar](255) NULL,
[shipping_description] [nvarchar](255) NULL,
[is_virtual] [int] NULL,
[store_id] [int] NULL,
[customer_id] [bigint] NULL,
[discount_amount] [decimal](20, 4) NULL,
[discount_canceled] [decimal](20, 4) NULL,
[discount_invoiced] [decimal](20, 4) NULL,
[discount_refunded] [decimal](20, 4) NULL,
[grand_total] [decimal](20, 4) NULL,
[shipping_amount] [decimal](20, 4) NULL,
[shipping_canceled] [decimal](20, 4) NULL,
[shipping_invoiced] [decimal](20, 4) NULL,
[shipping_refunded] [decimal](20, 4) NULL,
[shipping_tax_amount] [decimal](20, 4) NULL,
[shipping_tax_refunded] [decimal](20, 4) NULL,
[store_to_base_rate] [decimal](12, 4) NULL,
[store_to_order_rate] [decimal](12, 4) NULL,
[subtotal] [decimal](20, 4) NULL,
[subtotal_canceled] [decimal](20, 4) NULL,
[subtotal_invoiced] [decimal](20, 4) NULL,
[subtotal_refunded] [decimal](20, 4) NULL,
[tax_amount] [decimal](20, 4) NULL,
[tax_canceled] [decimal](20, 4) NULL,
[tax_invoiced] [decimal](20, 4) NULL,
[tax_refunded] [decimal](20, 4) NULL,
[total_canceled] [decimal](20, 4) NULL,
[total_invoiced] [decimal](20, 4) NULL,
[total_offline_refunded] [decimal](20, 4) NULL,
[total_online_refunded] [decimal](20, 4) NULL,
[total_paid] [decimal](20, 4) NULL,
[total_qty_ordered] [decimal](12, 4) NULL,
[total_refunded] [decimal](20, 4) NULL,
[can_ship_partially] [int] NULL,
[can_ship_partially_item] [int] NULL,
[customer_is_guest] [int] NULL,
[customer_note_notify] [int] NULL,
[billing_address_id] [int] NULL,
[customer_group_id] [int] NULL,
[edit_increment] [int] NULL,
...
[shipping_incl_tax] [decimal](20, 4) NULL,
CONSTRAINT [PK_sales_order_entity_id] PRIMARY KEY CLUSTERED
(
[entity_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY],
CONSTRAINT [sales_order$SALES_ORDER_INCREMENT_ID_STORE_ID] UNIQUE NONCLUSTERED
(
[increment_id] ASC,
[store_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [state]
GO
ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [status]
GO
...
ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [shipping_incl_tax]
GO
EXEC sys.sp_addextendedproperty @name=N'MS_SSMA_SOURCE', @value=N'msab_magento.sales_order' , @level0type=N'SCHEMA',@level0name=N'msab_magento', @level1type=N'TABLE',@level1name=N'sales_order'
GO
SQL Server 有多种不同的表压缩选项。大型数据仓库表最常用的一种是列存储,它可以对具有数百万行的表产生 10 倍的压缩。
但是 ROW 和 PAGE 压缩都会将所有 DECIMAL 列的存储格式从固定宽度更改为可变宽度。未压缩的表中
DECIMAL(20,4)
有一个13 字节的固定宽度列。