DaCiDian is an open-sourced lexicon for Chinese Automatic Speech Recognition(ASR)
In mainstream ASR system, lexicon is a core component, that maps word into acoustic modeling units(such as phone). In DaCiDian, we break the mapping into 2 independent layers:
word --> PinYin syllable --> phoneme
The purpose of this design is as follows:
Anyone who is familiar with PinYin (basically every mandarin speaker), can enrich DaCiDian's vocabulary, by adding new entry(word) into the layer-1 mapping.
ASR system developers can easily adapt DaCiDian to their own phone set by defining their own layer-2 mapping.
... 裤子 KU_4 ZI_0 好事 HAO_4 SHI_4;HAO_3 SHI_4 教授 JIAO_1 SHOU_4;JIAO_4 SHOU_4 ... 语音识别 YU_3 YIN_1 SHI_2 BIE_2 傅里叶变换 FU_4 LI_3 YE_4 BIAN_4 HUAN_4
pinyin_to_phone is a user-defined mapping from PinYin syllables to target phone set
Take traditional PinYin's Initial-Final structure for example, a mapping should be defined as follows:
A $0 a AI $0 ai AN $0 an ANG $0 ang AO $0 ao BA b a BAI b ai BAN b an BANG b ang BAO b ao ... ... ... ZONG z ong ZOU z ou ZU z u ZUAN z uan ZUI z ui ZUN z un ZUO z uo