目标
我在使用 AI 构建用于查找押韵词的 Trie 方面取得了很大进展。基本上,假设您有 1000 万个使用CMU 发音系统编写的英语单词,其中每个音素都是 Trie 中的一个节点(ARPABET
在此图中)。
以下是主要特点:
- Trie 节点是 1+ ASCII 符号。例如
B
“b”音和CH
“ch”音等。 - Trie 可以分页并从特定页面跳转。因此它需要一个
limit
, 和page
属性。 - 分页可以选择性地限制为特定音素序列长度的匹配。例如,音素的个数
g l a d
为 4。 - Trie 搜索输入(由用户提供)是音素数组。音素是 ASCII 中的声音单位(如上所示
ARPABET
)。 - 输入到 trie 中会扩展为所有可能的韵律。它使用音素到音素数组的映射进行扩展(仅在下面的函数中部分实现,需要数周时间才能正确微调此处的值,我很快就需要 TODO)。
- 我们可以将扩展的输入称为“
rhymingPhonemeArray
”。 - 中的每个音素序列都
rhymingPhonemeArray
有权重。所谓“加权”,这个权重基本上就是“映射的音素与原始音素的接近程度”。这样我们就可以判断“这个音素序列(具有较低的累积权重)比这个音素序列押韵更好rhymingPhonemeArray
”。
问题
我现在面临的问题(我们找到的解决方案,下面用 JavaScript 分享)是:
- 遍历整个 trie 来查找所有可能的匹配,这是未优化的。理想情况下,它只遍历它需要的内容(页面/限制数量和偏移量)。
- 之后 Trie 会对整个匹配集进行排序,而不是按照我们的要求排序,也就是获取已排序的页面/限制数量。这是关键问题,不确定是否/如何以优化的方式做到这一点,或者是否可能。
- 是
rhymingPhonemeArray
迭代的,所以如果我们分页并且rhymingPhonemeArray
是这样的[G-A-D (cum-weight: 10), G-L-A-D (cum-weight: 24 or whatever), G-R-A-D (cum-weight: 29), etc.]
,你会G-A-D
先找到与押韵的所有内容,然后再分页,分页G-L-A-D
等等。我想避免这种分组。相反,累积权重需要通过所有 10m 个单词的“全局集”进行排序和分页。
因此对于 (3),它应该找到(类似这样的内容):
input: G-A-D
matches:
G-A-D
G-L-A-D
A-G-A-D
A-R-G-A-D
G-R-A-D
A-G-R-A-D
A-R-G-L-A-D
A-R-G-R-A-D
...
我说的“类似这样的事情”的意思是,请注意它并不像这样(首先,G-A-D
找到所有匹配项,然后G-L-A-D
找到所有匹配项等等):
input: G-A-D
matches:
G-A-D
A-G-A-D
A-R-G-A-D
G-L-A-D
A-R-G-L-A-D
G-R-A-D
A-G-R-A-D
A-R-G-R-A-D
...
相反,在第一种情况下matches
,它更加交织在一起,应该基于每个单词的全局累积权重。
问题
如何修改以下“Trie 实现”来解决上述 1、2 和 3 的问题? 注意:它不需要 100% 准确,只需寻找如何正确解决此分页问题的关键见解(即使只是在高级/伪代码级别)。它们都是同一潜在问题的各个方面,这是我已经说过的,但我会再次说明:
下面的 Trie 实现不能正确且有点高效地通过按全局累积权重排序的词进行分页,这些词与
rhymingPhonemeArray
(从输入生成的phonemes
)完全匹配。
有没有可能解决这个问题,而不必遍历整个 Trie?考虑到(记住),输入被扩展为rhymingPhonemeArray
(可能有很多种可能性,但我们实际上会将输入限制为 3 个音节,这与主题无关)。如果不可能,你能解释一下为什么不可能吗?
如果可能的话,您将如何修改此 trie 以支持分页并跳转到特定页面,而不必遍历所有内容,同时分页结果根据每个单词的累积权重进行全局排序?
Trie 实现
我们最终得到的 Trie 如下:
class TrieNode {
constructor() {
this.children = {}; // Store child nodes
this.isWord = false; // Flag to check if node marks the end of a word
this.word = null; // Store the word if this is an end node
this.cumulativeWeight = 0; // Store the cumulative weight for sorting
this.phonemeLength = 0; // Length of the phoneme sequence
}
}
class PhoneticTrie {
constructor() {
this.root = new TrieNode(); // Root node of the trie
}
// Insert words and phoneme clusters into the Trie
insert(word, phonemes, weight) {
let node = this.root;
for (let phoneme of phonemes) {
if (!node.children[phoneme]) {
node.children[phoneme] = new TrieNode();
}
node = node.children[phoneme];
}
node.isWord = true;
node.word = word;
node.cumulativeWeight = weight; // Store the cumulative weight at this node
node.phonemeLength = phonemes.length; // Store the length of the phoneme sequence
}
// Global sorting of rhyming phoneme matches using a min-heap
searchWithPagination({ phonemes, limit, page, length }) {
const minHeap = new MinHeap(); // Min-Heap to store sorted results
// Generate rhyming phoneme variations
const rhymingPhonemesArray = expandToRhymingPhonemes(phonemes);
// Search the Trie for all rhyming phoneme sequences and insert results into the global heap
for (let rhyme of rhymingPhonemesArray) {
this.searchAndInsertToHeap(rhyme.phonemes, this.root, rhyme.weight, minHeap, length);
}
// Paginate results directly from the globally sorted heap
const paginatedResults = [];
const startIndex = (page - 1) * limit;
let index = 0;
while (minHeap.size() > 0 && paginatedResults.length < limit) {
const wordData = minHeap.extractMin();
if (index >= startIndex) {
paginatedResults.push(wordData.word);
}
index++;
}
return paginatedResults;
}
// Search a specific phoneme sequence in the Trie and insert matches into the heap
searchAndInsertToHeap(phonemes, node, rhymeWeight, heap, targetLength, depth = 0, phonemeSeq = []) {
if (depth === phonemes.length) {
if (node.isWord && node.word) {
// If length filtering is specified, ensure the word matches the phoneme length
if (!targetLength || node.phonemeLength === targetLength) {
heap.insert({
word: node.word,
cumulativeWeight: node.cumulativeWeight + rhymeWeight, // Add the rhyme weight here
phonemeLength: node.phonemeLength, // Include phoneme length for sorting
});
}
}
return;
}
const phoneme = phonemes[depth];
if (!node.children[phoneme]) return; // No match
this.searchAndInsertToHeap(phonemes, node.children[phoneme], rhymeWeight, heap, targetLength, depth + 1, [...phonemeSeq, phoneme]);
}
}
class MinHeap {
constructor() {
this.heap = []; // Store the heap elements
}
// Insert an item into the heap based on its weight and phoneme length
insert({ word, cumulativeWeight, phonemeLength }) {
this.heap.push({ word, cumulativeWeight, phonemeLength });
this.bubbleUp(this.heap.length - 1);
}
bubbleUp(index) {
let currentIndex = index;
while (currentIndex > 0) {
const parentIndex = Math.floor((currentIndex - 1) / 2);
// Sort primarily by cumulative weight, secondarily by phoneme length
if (
this.heap[currentIndex].cumulativeWeight > this.heap[parentIndex].cumulativeWeight ||
(this.heap[currentIndex].cumulativeWeight === this.heap[parentIndex].cumulativeWeight &&
this.heap[currentIndex].phonemeLength > this.heap[parentIndex].phonemeLength)
) {
break;
}
[this.heap[currentIndex], this.heap[parentIndex]] = [this.heap[parentIndex], this.heap[currentIndex]];
currentIndex = parentIndex;
}
}
// Extract the item with the lowest weight
extractMin() {
if (this.heap.length === 0) return null;
if (this.heap.length === 1) return this.heap.pop();
const min = this.heap[0];
this.heap[0] = this.heap.pop();
this.bubbleDown(0);
return min;
}
bubbleDown(index) {
const lastIndex = this.heap.length - 1;
while (true) {
let leftChildIdx = 2 * index + 1;
let rightChildIdx = 2 * index + 2;
let smallestIdx = index;
if (
leftChildIdx <= lastIndex &&
(this.heap[leftChildIdx].cumulativeWeight < this.heap[smallestIdx].cumulativeWeight ||
(this.heap[leftChildIdx].cumulativeWeight === this.heap[smallestIdx].cumulativeWeight &&
this.heap[leftChildIdx].phonemeLength < this.heap[smallestIdx].phonemeLength))
) {
smallestIdx = leftChildIdx;
}
if (
rightChildIdx <= lastIndex &&
(this.heap[rightChildIdx].cumulativeWeight < this.heap[smallestIdx].cumulativeWeight ||
(this.heap[rightChildIdx].cumulativeWeight === this.heap[smallestIdx].cumulativeWeight &&
this.heap[rightChildIdx].phonemeLength < this.heap[smallestIdx].phonemeLength))
) {
smallestIdx = rightChildIdx;
}
if (smallestIdx === index) break;
[this.heap[index], this.heap[smallestIdx]] = [this.heap[smallestIdx], this.heap[index]];
index = smallestIdx;
}
}
size() {
return this.heap.length;
}
}
function expandToRhymingPhonemes(phonemes) {
// todo: this function would have a huge map of phoneme substitutions...
const phonemeSubstitutions = {
"k": ["g", "p"], // Example of substitutions for the phoneme "k"
"æ": ["a", "e"], // Example for "æ"
"t": ["d", "s"], // Example for "t"
};
const rhymingPhonemesArray = [];
function generateRhymes(sequence, depth = 0, currentWeight = 0) {
if (depth === sequence.length) {
rhymingPhonemesArray.push({ phonemes: sequence.slice(), weight: currentWeight });
return;
}
const phoneme = sequence[depth];
const substitutions = phonemeSubstitutions[phoneme] || [phoneme];
for (const sub of substitutions) {
generateRhymes(
[...sequence.slice(0, depth), sub, ...sequence.slice(depth + 1)],
depth + 1,
currentWeight + (sub === phoneme ? 0 : 1) // Substitution adds to weight
);
}
}
generateRhymes(phonemes);
return rhymingPhonemesArray;
}
const trie = new PhoneticTrie();
trie.insert("glad", ["g", "l", "a", "d"], 3);
trie.insert("grad", ["g", "r", "a", "d"], 2);
trie.insert("blad", ["b", "l", "a", "d"], 4);
trie.insert("grin", ["g", "r", "i", "n"], 5);
// Search for similar words to "g-l-a-d" and paginate the results, with optional length filtering
const resultsPage1 = trie.searchWithPagination({
phonemes: ["g", "l", "a", "d"],
limit: 2,
page: 1,
length: 4, // Only consider words with exactly 4 phonemes
});
const resultsPage2 = trie.searchWithPagination({
phonemes: ["g", "l", "a", "d"],
limit: 2,
page: 2,
length: 4,
});
console.log(resultsPage1); // Output: ["grad", "glad"] (sorted by weight and length)
console.log(resultsPage2); // Output: ["blad", "grin"]
更新
- 这里是对 Trie 进行进一步扩展/尝试以解决这一关键问题的要点。