一时兴起想写一个随机句子生成器。正巧在学上下文无关语言 (Context-Free Language, CFL),意识到句子的结构很大程度上是一个上下文无关语法。查阅 X-bar 理论得到也确实类似如此。
同时正好在学 Rust,所以试了一下拿 Rust 编写这个生成器,并编译到 WebAssembly (WASM)。效果很成功!
Context Free Language ¶
A context-free grammar is a notation for describing languages.
It is more powerful than finite automata or RE’s, but still cannot define all possible languages.
Useful for nested structures, e.g., parentheses in programming languages. But it also works for natural language.
The basic idea of CFL to use “variables” to stand for sets of strings (i.e., languages). These variables are defined recursively, in terms of one another. The recursive rules (“productions”) involve only concatenation.
Lets start with a example. Consider these productions:
It means is in the language. And recursively, if is in the language, then so is .
Definations:
- Terminals: symbols of the alphabet of the language being defined.
- Variables (nonterminals): a finite set of other symbols, each of which represents a language.
- Start symbol: the variable whose language is the one being defined.
A production has the form . Then we derive strings in the language of a CFG by starting with the start symbol, and repeatedly replacing some variable by the body of one of its productions.
For example,
We are not going to introduce CFL in detail, so let’s talk about how does it related to natural language. Let us consider a classic subject-verb-object structure (SVO), where a sentence consists of a subject, a verb, and an object in that order.
CFG of natural language grammar ¶
现在继续考虑自然语言。我们只考虑最简单的句子结构,不考虑时态、时体的话,可以构造这样一个 CFG
- (句子 → 名词短语 + 动词短语)
- (纯名词)
- (形容词 + 名词)
- (不及物动词 e.g. 上班)
- (及物动词 e.g. 打宿傩)
- (双宾语及物动词 e.g. 给它一本书)
现在考虑一个从句也可以修饰一个名词,例如 “打宿傩的我”,因此又可以构造回环
Implement by Rust ¶
现在已经有了 CFG,需要考虑的是如何实现它。显而易见的,我们可以用一个 enum
存放可能出现的 Variables,并用 match
匹配 terminals. 因此我们能写出类似这样的代码:
enum CFGSymbol { |
关于 Rules,如果打算用常量存储,就不能用 Vec
。但没关系,我们可以用固定的数组配合元组描述一个规则。可以让
写作
(S, &[A1, A2, /* ... */, An]) |
于是我们可以写出 CFG 规则
const CFG_RULES: &[(CFGSymbol, &[CFGSymbol])] = &[ |
剩下的工作就简单了:从规则一条条推导,随机选择一条推导规则继续,直到所有 Symbol 都是 terminal 就生成了一条随机的句子。
完整代码(没必要看的)
extern crate rand; |
WASM ¶
上述的 Rust 代码依照 MDN web docs 的教程编译成了 WASM,可以在下面按按钮体验一下随机句子生成器。
[Sentence [NP [Ns [N 芝士] ] ] [VP [Vs [VInts [VInt 喵喵叫] ] ] ] ]
芝士喵喵叫
Copyright
上图的 Syntax Generator 代码来自 https://github.com/mshang/syntree ,许可证如下
Copyright (C) 2011 by Miles Shang <mail@mshang.ca> |