Options
Statistical learning of sub-words in Vietnamese language
Journal
THE 1ST INTERNATIONAL CONFERENCE ON INNOVATIONS FOR COMPUTING, ENGINEERING AND MATERIALS, 2021: ICEM, 2021
AIP Conference Proceedings
ISSN
0094-243X
Date Issued
2021
Author(s)
D. Q. Nguyen
T. H. Le
DOI
10.1063/5.0068414
Abstract
Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The sub- word units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction
File(s)