DaNangNLP Toolkit for Vietnamese Text Preprocessing and Word Segmentation

Nguyen, Ket Doan; Nguyen, Tran Tien; Nguyen, Duc Bao; Ton, That Ron; Vo, Van Nam; Pham, Van Nam; Phung, Anh Sang; Huynh, Cong Phap; Nguyen, Huu Nhat Minh

Please use this identifier to cite or link to this item: https://elib.vku.udn.vn/handle/123456789/4041

Full metadata record

DC Field	Value	Language
dc.contributor.author	Nguyen, Ket Doan	-
dc.contributor.author	Nguyen, Tran Tien	-
dc.contributor.author	Nguyen, Duc Bao	-
dc.contributor.author	Ton, That Ron	-
dc.contributor.author	Vo, Van Nam	-
dc.contributor.author	Pham, Van Nam	-
dc.contributor.author	Phung, Anh Sang	-
dc.contributor.author	Huynh, Cong Phap	-
dc.contributor.author	Nguyen, Huu Nhat Minh	-
dc.date.accessioned	2024-07-31T03:50:00Z	-
dc.date.available	2024-07-31T03:50:00Z	-
dc.date.issued	2024-07	-
dc.identifier.isbn	978-604-80-9774-5	-
dc.identifier.uri	https://elib.vku.udn.vn/handle/123456789/4041	-
dc.description	Proceedings of the 13th International Conference on Information Technology and Its Applications (CITA 2024); pp: 296-307	vi_VN
dc.description.abstract	Recent research has focused on Vietnamese large language models, however, the preprocessing steps play important complementary roles in the future success of Vietnamese language processing. In this paper, we design and develop a novel DaNangNLP toolkit that could cope with important Vietnamese language preprocessing steps. Although there have been many successful modules on Vietnamese language processing, existing toolkits still exhibit certain shortcomings, especially for word segmentation in complex Vietnamese sentences. Therefore, we have developed a practical and robust natural language processing pipeline specifically tailored for the Vietnamese language to address the challenging issues present in previous Vietnamese processing toolkits. The DaNangNLP pipeline based on the novel built-in word dictionaries is designed to handle Vietnamese text for typical preprocessing steps such as sentence segmentation, word regex, word normalization, and word segmentation. Throughout the evaluation, the proposed semantic-based word segmentation has outperformed the frequency-based word segmentation and existing toolkits in complex sentences.	vi_VN
dc.language.iso	en	vi_VN
dc.publisher	Vietnam-Korea University of Information and Communication Technology	vi_VN
dc.relation.ispartofseries	CITA;	-
dc.subject	Sentence Segmentation	vi_VN
dc.subject	Regular Expression	vi_VN
dc.subject	Word Segmentation	vi_VN
dc.subject	Word Normalization	vi_VN
dc.subject	Vietnamese Language Processing	vi_VN
dc.title	DaNangNLP Toolkit for Vietnamese Text Preprocessing and Word Segmentation	vi_VN
Appears in Collections:	CITA 2024 (Proceeding - Vol 2)

Files in This Item:

Sign in to read

Show simple item record