Statistical learning of sub-words in Vietnamese language

D. Q. NguyenT. H. Le2022-11-232022-11-232021https://repository.vlu.edu.vn/handle/123456789/156010.1063/5.0068414Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The sub- word units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correctionen-US"Learning and learning modelsEducational assessmentNatural language processing"Statistical learning of sub-words in Vietnamese languageproceedings-article