一时兴起想写一个随机句子生成器。正巧在学上下文无关语言 (Context-Free Language, CFL),意识到句子的结构很大程度上是一个上下文无关语法。查阅 X-bar 理论得到也确实类似如此。
同时正好在学 Rust,所以试了一下拿 Rust 编写这个生成器,并编译到 WebAssembly (WASM)。效果很成功!
Context Free Language ¶
A context-free grammar is a notation for describing languages.
It is more powerful than finite automata or RE’s, but still cannot define all possible languages.
Useful for nested structures, e.g., parentheses in programming languages. But it also works for natural language.
The basic idea of CFL to use “variables” to stand for sets of strings (i.e., languages). These variables are defined recursively, in terms of one another. The recursive rules (“productions”) involve only concatenation.
Lets start with a example. Consider these productions:
It means is in the language. And recursively, if is in the language, then so is .
Definations:
- Terminals: symbols of the alphabet of the language being defined.
- Variables (nonterminals): a finite set of other symbols, each of which represents a language.
- Start symbol: the variable whose language is the one being defined.
A production has the form . Then we derive strings in the language of a CFG by starting with the start symbol, and repeatedly replacing some variable by the body of one of its productions.
For example,
We are not going to introduce CFL in detail, so let’s talk about how does it related to natural language. Let us consider a classic subject-verb-object structure (SVO), where a sentence consists of a subject, a verb, and an object in that order.
CFG of natural language grammar ¶
现在继续考虑自然语言。我们只考虑最简单的句子结构,不考虑时态、时体的话,可以构造这样一个 CFG
- (句子 → 名词短语 + 动词短语)
- (纯名词)
- (形容词 + 名词)
- (不及物动词 e.g. 上班)
- (及物动词 e.g. 打宿傩)
- (双宾语及物动词 e.g. 给它一本书)
现在考虑一个从句也可以修饰一个名词,例如 “打宿傩的我”,因此又可以构造回环
Implement by Rust ¶
现在已经有了 CFG,需要考虑的是如何实现它。显而易见的,我们可以用一个 enum
存放可能出现的 Variables,并用 match
匹配 terminals. 因此我们能写出类似这样的代码:
enum CFGSymbol {
Sentence,
NP,
VP,
N,
VInt,
VTra,
VDit,
ADJ,
ADV,
}
fn expand_str(symbol: &CFGSymbol) -> Option<&'static [&'static str]> {
Some(match symbol {
N => &["小豆泥", "薯条", "葡萄", "咖啡"],
VInt => &["上班", "发疯"],
VTra => &["吃", "摸"],
VDit => &["给", "送给"],
ADJ => &["可爱的", "好吃的"],
ADV => &["超大声地", "超小声地"],
_ => return None,
})
}
关于 Rules,如果打算用常量存储,就不能用 Vec
。但没关系,我们可以用固定的数组配合元组描述一个规则。可以让
写作
(S, &[A1, A2, /* ... */, An])
于是我们可以写出 CFG 规则
const CFG_RULES: &[(CFGSymbol, &[CFGSymbol])] = &[
(Sentence, &[NP, VP]),
(NP, &[Ns]),
(Ns, &[N]),
(Ns, &[ADJ, N]),
(Ns, &[VP, CharDe, N]),
(Ns, &[ADJ, N]),
(VP, &[Vs]),
(VP, &[ADV, Vs]),
(Vs, &[VInts]),
(Vs, &[VTras]),
(Vs, &[VDits]),
(VInts, &[VInt]),
(VTras, &[VTra, NP]),
(VDits, &[VDitts, NP, NP]),
];
剩下的工作就简单了:从规则一条条推导,随机选择一条推导规则继续,直到所有 Symbol 都是 terminal 就生成了一条随机的句子。
完整代码(没必要看的)
extern crate rand;
extern crate wasm_bindgen;
use rand::seq::SliceRandom;
use wasm_bindgen::prelude::*;
#[derive(std::fmt::Debug, PartialEq, Eq)]
enum CFGSymbol {
/// 句子
Sentence,
/// 名词短语
NP,
Ns,
/// noun. 名词
N,
/// 动词短语
VP,
Vs,
/// adj. 形容词
ADJ,
/// Transitive (单宾语)及物动词
VTra,
/// 单宾语及物动词,后置
VTra2,
VTras,
/// Intransitive 不及物动词
VInt,
VInts,
/// Ditransitive 双宾语及物
VDit,
VDitts,
VDits,
/// adv. 副词
ADV,
// 把/被
CharBaBei,
// 的
CharDe,
}
use crate::CFGSymbol::*;
// S -> SS, + 概率权重
const CFG_RULES: &[(CFGSymbol, &[CFGSymbol], i32)] = &[
(Sentence, &[NP, VP], 5), // S V O 结构
(Sentence, &[NP, CharBaBei, NP, VTra2], 1), // noun. 把/被 noun. verb-tra.
(Sentence, &[NP, CharBaBei, NP, VDit, NP], 1), // noun. 把/被 noun. verb-tra.
(NP, &[Ns], 3),
(Ns, &[N], 6),
(Ns, &[ADJ, N], 3),
(Ns, &[VP, CharDe, N], 1),
(Ns, &[ADJ, N], 1),
(VP, &[ADV, Vs], 1),
(VP, &[Vs], 3),
(Vs, &[VInts], 1),
(Vs, &[VTras], 1),
(Vs, &[VDits], 1),
(VInts, &[VInt], 1),
(VTras, &[VTra, NP], 1),
(VDits, &[VDitts, NP], 1),
(VDitts, &[VDit, NP], 1),
];
fn expand_str(symbol: &CFGSymbol) -> Option<&'static [&'static str]> {
Some(match symbol {
// nouns
N => &[
"小豆泥",
"猫",
"纸",
"鬼",
"薯条",
"KFC",
"麦当劳",
"霸王龙",
"饮料",
"皇上",
"甜点",
"毛球",
"狸花",
"葡萄",
"咖啡",
"茶",
"兔子",
"鳄鱼",
"小黑猫",
"undefined",
"错误",
"芝士",
"自助餐",
"蛋挞",
"Windows11",
],
VTra => &[
"打",
"吃",
"啃",
"摸",
"揉",
"喝",
"买",
"是",
"吃掉了",
"拿走了",
],
VTra2 => &["吃掉了", "拿走了"],
VInt => &[
"喵喵叫",
"睡觉",
"做深蹲",
"写作业",
"学习",
"画画",
"emo",
"吃饭",
"蹦蹦跳跳",
"摸鱼",
"看电视",
"ypt启动",
"关闭ypt",
"打maimai",
"写代码",
"不知道在干什么",
"思考中",
"呜呜呜",
"刷TL",
"上班",
"工作",
"发疯",
"正在更新",
"看同人",
],
VDit => &["给", "送给"],
ADJ => &[
"可爱的",
"好吃的",
"无聊的",
"开心的",
"美味的",
"刚重生的",
"你主页上的",
"谁的",
"终末时的",
"无限的",
"超厉害的",
],
ADV => &[
"悄悄",
"静静",
"想",
"轻轻地",
"不想",
"正在",
"在",
"在你身后",
"几乎处处",
"随时",
"超大声地",
"超小声地",
"又开始",
"重新",
],
CharBaBei => &["把", "被"],
CharDe => &["的"],
_ => return None,
})
}
const TO_STR_POSSIBILITY: f64 = 0.5;
fn generate(ctx: &mut String, symbol: &CFGSymbol) -> String {
ctx.push_str(&format!("[{:?} ", symbol));
let mut symbol_rules = Vec::<&[CFGSymbol]>::new();
for (sym, rule, weight) in CFG_RULES.iter() {
for _ in 0..if sym == symbol { *weight } else { 0 } {
symbol_rules.push(rule);
}
}
let force_expand = symbol_rules.len() == 0;
let res = match expand_str(symbol) {
Some(strs) if force_expand || rand::random::<f64>() < TO_STR_POSSIBILITY => {
let res = strs
.choose(&mut rand::thread_rng())
.unwrap_or_else(|| &"")
.to_string();
ctx.push_str(&format!("{}", res));
res
}
Some(_) | None => {
if let Some(chosen) = symbol_rules.choose(&mut rand::thread_rng()) {
chosen
.iter()
.map(|sym| generate(ctx, sym))
.collect::<Vec<String>>()
.join("")
} else {
"".to_string()
}
}
};
ctx.push_str(&format!("] "));
res
}
#[wasm_bindgen]
pub struct SentenceRes {
ctx: String,
res: String,
}
#[wasm_bindgen]
impl SentenceRes {
#[wasm_bindgen(getter)]
pub fn ctx(&self) -> String {
self.ctx.clone()
}
#[wasm_bindgen(getter)]
pub fn res(&self) -> String {
self.res.clone()
}
}
#[wasm_bindgen]
pub fn generate_sentence() -> SentenceRes {
let mut ctx = String::new();
let res = generate(&mut ctx, &Sentence);
SentenceRes { ctx, res }
}
WASM ¶
上述的 Rust 代码依照 MDN web docs 的教程编译成了 WASM,可以在下面按按钮体验一下随机句子生成器。
[Sentence [NP [Ns [N 芝士] ] ] [VP [Vs [VInts [VInt 喵喵叫] ] ] ] ]
芝士喵喵叫
Copyright
上图的 Syntax Generator 代码来自 https://github.com/mshang/syntree ,许可证如下
Copyright (C) 2011 by Miles Shang <mail@mshang.ca>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
This software uses the following libraries:
base64.js
Copyright (C) 1999 Masanao Izumo <iz@onicos.co.jp>
Version: 1.0
LastModified: Dec 25 1999
This library is free. You can redistribute it and/or modify it.
canvas2image.js
Canvas2Image v0.1
Copyright (c) 2008 Jacob Seidelin, jseidelin@nihilogic.dk
MIT License [http://www.opensource.org/licenses/mit-license.php]
jQuery v1.7.1
Copyright (c) 2011 John Resig, http://jquery.com/
MIT license [jquery.org/license]
jQuery UI v1.8.18
Copyright (c) 2011 Paul Bakaus, http://jqueryui.com/
MIT license [jquery.org/license]