AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / dba / 问题 / 7160
Accepted
Dolan Antenucci
Dolan Antenucci
Asked: 2011-10-25 09:11:28 +0800 CST2011-10-25 09:11:28 +0800 CST 2011-10-25 09:11:28 +0800 CST

如何最好地存储 Google Web Ngram 数据?

  • 772

这是How to best store Google ngrams in a database?的延续。,其中介绍了如何存储Google Ngram Book 数据。

我正在寻找存储格式略有不同的Google NGram Web 数据(没有页面/年份信息;只是计数):

...
ceramics collectables collectibles 55
ceramics collectables fine 130
...
serve as the incoming 92
serve as the incubator 99

由于这是一个非常简单的数据结构,什么是存储此数据的好方法,可以相当快速地导入,并快速检索特定 ngram 的计数?

我喜欢关系数据库的想法,仅仅是因为访问它的常用方法,但我猜大多数其他非关系数据库(例如 tokyo hashtable)也有非常常用的方法。

更新

查询示例:

# primary query
> SELECT ngram_count FROM ngram_table WHERE ngram = 'ceramics collectables fine';

ceramics collectables collectibles 55
ceramics collectables fine 130

# secondary query (not needed, but nice if have option)
SELECT ngram_count FROM ngram_table WHERE ngram LIKE '%collectables%';

ceramics collectables collectibles 55
mysql partitioning
  • 1 1 个回答
  • 1408 Views

1 个回答

  • Voted
  1. Best Answer
    RolandoMySQLDBA
    2011-10-25T09:59:39+08:002011-10-25T09:59:39+08:00

    我这里有你需要的脚本

    USE test
    DROP TABLE IF EXISTS ngram_key;
    DROP TABLE IF EXISTS ngram_rec;
    DROP TABLE IF EXISTS ngram_blk;
    CREATE TABLE ngram_key
    (
        NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
        NGRAM VARCHAR(64) NOT NULL,
        PRIMARY KEY (NGRAM),
        KEY (NGRAM_ID)
    ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
    CREATE TABLE ngram_rec
    (
        NGRAM_ID BIGINT UNSIGNED NOT NULL,
        NGRAM_COUNT SMALLINT NOT NULL,
        PRIMARY KEY (NGRAM_ID)
    ) ENGINE=MyISAM ROW_FORMAT=FIXED;
    CREATE TABLE ngram_blk
    (
        NGRAM VARCHAR(64) NOT NULL,
        NGRAM_COUNT SMALLINT NOT NULL
    ) ENGINE=BLACKHOLE;
    DELIMITER $$
    CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
    BEGIN
        DECLARE NEW_ID BIGINT;
        INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
        SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
        INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
    END; $$
    DELIMITER ;
    INSERT INTO ngram_blk VALUES
    ('rolando',85),
    ('pamela',86),
    ('dominique',87),
    ('diamond',88),
    ('rolando edwards',185),
    ('pamela edwards',186),
    ('dominique edwards',187),
    ('diamond edwards',188),
    ('rolando angel edwards',285),
    ('pamela claricia edwards',286),
    ('dominique sharlisee edwards',287),
    ('diamond ashley edwards',288);
    SELECT * FROM ngram_key;
    SELECT * FROM ngram_rec;
    SELECT A.ngram NGram,B.* FROM 
    ngram_key A,ngram_rec B
    WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
    AND A.ngram_id=B.ngram_id;
    

    这是示例数据生成的内容:

    mysql> USE test
    Database changed
    mysql> DROP TABLE IF EXISTS ngram_key;
    Query OK, 0 rows affected, 1 warning (0.00 sec)
    
    mysql> DROP TABLE IF EXISTS ngram_rec;
    Query OK, 0 rows affected, 1 warning (0.00 sec)
    
    mysql> DROP TABLE IF EXISTS ngram_blk;
    Query OK, 0 rows affected, 1 warning (0.00 sec)
    
    mysql> CREATE TABLE ngram_key
        -> (
        ->     NGRAM_ID BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
        ->     NGRAM VARCHAR(64) NOT NULL,
        ->     PRIMARY KEY (NGRAM),
        ->     KEY (NGRAM_ID)
        -> ) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
    Query OK, 0 rows affected (0.53 sec)
    
    mysql> CREATE TABLE ngram_rec
        -> (
        ->     NGRAM_ID BIGINT UNSIGNED NOT NULL,
        ->     NGRAM_COUNT SMALLINT NOT NULL,
        ->     PRIMARY KEY (NGRAM_ID)
        -> ) ENGINE=MyISAM ROW_FORMAT=FIXED;
    Query OK, 0 rows affected (0.04 sec)
    
    mysql> CREATE TABLE ngram_blk
        -> (
        ->     NGRAM VARCHAR(64) NOT NULL,
        ->     NGRAM_COUNT SMALLINT NOT NULL
        -> ) ENGINE=BLACKHOLE;
    Query OK, 0 rows affected (0.05 sec)
    
    mysql> DELIMITER $$
    mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
        -> BEGIN
        ->     DECLARE NEW_ID BIGINT;
        ->     INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
        ->     SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
        ->     INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.NGRAM_COUNT);
        -> END; $$
    Query OK, 0 rows affected (0.08 sec)
    
    mysql> DELIMITER ;
    mysql> INSERT INTO ngram_blk VALUES
        -> ('rolando',85),
        -> ('pamela',86),
        -> ('dominique',87),
        -> ('diamond',88),
        -> ('rolando edwards',185),
        -> ('pamela edwards',186),
        -> ('dominique edwards',187),
        -> ('diamond edwards',188),
        -> ('rolando angel edwards',285),
        -> ('pamela claricia edwards',286),
        -> ('dominique sharlisee edwards',287),
        -> ('diamond ashley edwards',288);
    Query OK, 12 rows affected (0.10 sec)
    Records: 12  Duplicates: 0  Warnings: 0
    
    mysql> SELECT * FROM ngram_key;
    +----------+-----------------------------+
    | NGRAM_ID | NGRAM                       |
    +----------+-----------------------------+
    |       11 | dominique sharlisee edwards |
    |        1 | rolando                     |
    |        9 | rolando angel edwards       |
    |        4 | diamond                     |
    |        8 | diamond edwards             |
    |        2 | pamela                      |
    |        3 | dominique                   |
    |        6 | pamela edwards              |
    |        5 | rolando edwards             |
    |       12 | diamond ashley edwards      |
    |        7 | dominique edwards           |
    |       10 | pamela claricia edwards     |
    +----------+-----------------------------+
    12 rows in set (0.00 sec)
    
    mysql> SELECT * FROM ngram_rec;
    +----------+-------------+
    | NGRAM_ID | NGRAM_COUNT |
    +----------+-------------+
    |        1 |          85 |
    |        2 |          86 |
    |        3 |          87 |
    |        4 |          88 |
    |        5 |         185 |
    |        6 |         186 |
    |        7 |         187 |
    |        8 |         188 |
    |        9 |         285 |
    |       10 |         286 |
    |       11 |         287 |
    |       12 |         288 |
    +----------+-------------+
    12 rows in set (0.00 sec)
    
    mysql> SELECT A.ngram NGram,B.* FROM
        -> ngram_key A,ngram_rec B
        -> WHERE A.ngram IN ('rolando angel edwards','rolando edwards','rolando')
        -> AND A.ngram_id=B.ngram_id;
    +-----------------------+----------+-------------+
    | NGram                 | NGRAM_ID | NGRAM_COUNT |
    +-----------------------+----------+-------------+
    | rolando               |        1 |          85 |
    | rolando angel edwards |        9 |         285 |
    | rolando edwards       |        5 |         185 |
    +-----------------------+----------+-------------+
    3 rows in set (0.00 sec)
    

    试试看 !!!

    • 2

相关问题

  • 是否有任何 MySQL 基准测试工具?[关闭]

  • 我在哪里可以找到mysql慢日志?

  • 如何优化大型数据库的 mysqldump?

  • 什么时候是使用 MariaDB 而不是 MySQL 的合适时机,为什么?

  • 组如何跟踪数据库架构更改?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    你如何mysqldump特定的表?

    • 4 个回答
  • Marko Smith

    您如何显示在 Oracle 数据库上执行的 SQL?

    • 2 个回答
  • Marko Smith

    如何选择每组的第一行?

    • 6 个回答
  • Marko Smith

    使用 psql 列出数据库权限

    • 10 个回答
  • Marko Smith

    我可以查看在 SQL Server 数据库上运行的历史查询吗?

    • 6 个回答
  • Marko Smith

    如何在 PostgreSQL 中使用 currval() 来获取最后插入的 id?

    • 10 个回答
  • Marko Smith

    如何在 Mac OS X 上运行 psql?

    • 11 个回答
  • Marko Smith

    如何从 PostgreSQL 中的选择查询中将值插入表中?

    • 4 个回答
  • Marko Smith

    如何使用 psql 列出所有数据库和表?

    • 7 个回答
  • Marko Smith

    将数组参数传递给存储过程

    • 12 个回答
  • Martin Hope
    Manuel Leduc PostgreSQL 多列唯一约束和 NULL 值 2011-12-28 01:10:21 +0800 CST
  • Martin Hope
    markdorison 你如何mysqldump特定的表? 2011-12-17 12:39:37 +0800 CST
  • Martin Hope
    Stuart Blackler 什么时候应该将主键声明为非聚集的? 2011-11-11 13:31:59 +0800 CST
  • Martin Hope
    pedrosanta 使用 psql 列出数据库权限 2011-08-04 11:01:21 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 对 SQL 查询进行计时? 2011-06-04 02:22:54 +0800 CST
  • Martin Hope
    Jonas 如何从 PostgreSQL 中的选择查询中将值插入表中? 2011-05-28 00:33:05 +0800 CST
  • Martin Hope
    Jonas 如何使用 psql 列出所有数据库和表? 2011-02-18 00:45:49 +0800 CST
  • Martin Hope
    BrunoLM Guid vs INT - 哪个更好作为主键? 2011-01-05 23:46:34 +0800 CST
  • Martin Hope
    bernd_k 什么时候应该使用唯一约束而不是唯一索引? 2011-01-05 02:32:27 +0800 CST
  • Martin Hope
    Patrick 如何优化大型数据库的 mysqldump? 2011-01-04 13:13:48 +0800 CST

热门标签

sql-server mysql postgresql sql-server-2014 sql-server-2016 oracle sql-server-2008 database-design query-performance sql-server-2017

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve