A Comparison of several Deep Learning based Models for Diacritic Restoration Problem in Vietnamese Text

Tran, Quang Linh; Lam, Gia Huy; Duong, Van Binh; Vuong, Cong Dat; Do, Trong Hop

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/1902

Full metadata record

DC Field	Value	Language
dc.contributor.author	Tran, Quang Linh	-
dc.contributor.author	Lam, Gia Huy	-
dc.contributor.author	Duong, Van Binh	-
dc.contributor.author	Vuong, Cong Dat	-
dc.contributor.author	Do, Trong Hop	-
dc.date.accessioned	2021-12-27T10:05:30Z	-
dc.date.available	2021-12-27T10:05:30Z	-
dc.date.issued	2021	-
dc.identifier.uri	http://elib.vku.udn.vn/handle/123456789/1902	-
dc.description	The 10th Conference on Information Technology and its Applications; Topic: Image and Natural Language Poster; pp. 65-74.	vi_VN
dc.description.abstract	Diacritic restoration is a challenging problem in natural lan- guage processing (NLP). With diacritic restoration, one can text faster and easier. Diacritic restoration is also helpful in making use of diacritic- missing texts, which are normally discarded in many NLP applications. This paper deals with the diacritic restoration problem for Vietnamese text. Three state-of-the-art deep learning models including Gated Re- current Unit, Bidirectional Long-short Term Memory and Bidirectional Gated Recurrent Unit have been examined for the problem and the last one turned out to be the best among them. Besides deep learning models, it was found in this paper that word tokenization, which is the final pre-processing step applied on the data before feeding it to deep learning models also have influences on the final accuracy. Between two examined word tokenization methods: morpheme-based tokenization and phrasebased tokenization, the former yield better results regardless of the applied deep learning models. The experimental results show that the combination of morpheme-based tokenization and Bidirectional-GRU achieve the best performance of diacritic restoration with the Bleu-score of 88.06%.	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Da Nang Publishing House	vi_VN
dc.subject	Diacritic Restoration	vi_VN
dc.subject	Neuron Network	vi_VN
dc.subject	Machine Translation	vi_VN
dc.subject	Natural Language Processing	vi_VN
dc.subject	Word Tokenization	vi_VN
dc.title	A Comparison of several Deep Learning based Models for Diacritic Restoration Problem in Vietnamese Text	vi_VN
dc.type	Working Paper	vi_VN
Appears in Collections:	CITA 2021

Files in This Item:

Sign in to read

Show simple item record