Luận án Enhancing the quality of machine translation system using cross-lingual word embedding models

54 trang Khánh Chi 13/09/2025 1310

Download

Bạn đang xem 30 trang mẫu của tài liệu "Luận án Enhancing the quality of machine translation system using cross-lingual word embedding models", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

luan_an_enhancing_the_quality_of_machine_translation_system.pdf

Nội dung tài liệu: Luận án Enhancing the quality of machine translation system using cross-lingual word embedding models

Enhancing the quality of Machine Translation System Using Cross-Lingual Word Embedding Models Nguyen Minh Thuan Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Associate Professor. Nguyen Phuong Thai A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science November 2018
ORIGINALITY STATEMENT `I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substan- tial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.' Hanoi, November 15th, 2018 Signed ........................................................................ i
ii ABSTRACT In recent years, Machine Translation has shown promising results and received much interest of researchers. Two approaches that have been widely used for machine trans- lation are Phrase-based Statistical Machine Translation (PBSMT) and Neural Ma- chine Translation (NMT). During translation, both approaches rely heavily on large amounts of bilingual corpora which require much effort and financial support. The lack of bilingual data leads to a poor phrase-table, which is one of the main compo- nents of PBSMT, and the unknown word problem in NMT. In contrast, monolingual data are available for most of the languages. Thanks to the advantage, many models of word embedding and cross-lingual word embedding have been appeared to improve the quality of various tasks in natural language processing. The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT. Publications: ? Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-Thai Nguyen and Chi-Mai Luong. Enhancing the quality of Phrase-table in Statistical Machine Translation for Less-Common and Low-Resource Languages. In the 2018 International Conference on Asian Language Processing (IALP 2018).
iii ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my lecturers in university, and especially to my supervisors - Assoc.Prof. Nguyen Phuong Thai, Dr. Nguyen Van Vinh and MSc. Vu Huy Hien. They are my inspiration, guiding me to get the better of many obstacles in the completion this thesis. I am grateful to my family. They usually encourage, motivate and create the best conditions for me to accomplish this thesis. I would like to also thank my brother, Nguyen Minh Thong, my friends, Tran Minh Luyen, Hoang Cong Tuan Anh, for giving me many useful advices and supporting my thesis, my studying and my living. Finally, I sincerely acknowledge the Vietnam National University, Hanoi and especially, TC.02-2016-03 project named “Building a machine translation system to support translation of documents between Vietnamese and Japanese to help managers and businesses in Hanoi approach Japanese market” for supporting finance to my master study.
To my family ~ iv
Table of Contents 1 Introduction1 2 Literature review4 2.1 Machine Translation...........................4 2.1.1 History...............................4 2.1.2 Approaches............................5 2.1.3 Evaluation.............................7 2.1.4 Open-Source Machine Translation................8 2.1.4.1 Moses - an Open Statistical Machine Translation System.........................9 2.1.4.2 OpenNMT - an Open Neural Machine Translation System......................... 10 2.2 Word Embedding............................. 11 2.2.1 Monolingual Word Embedding Models............. 12 2.2.2 Cross-Lingual Word Embedding Models............ 13 3 Using Cross-Lingual Word Embedding Models for Machine Trans- lation Systems 17 3.1 Enhancing the quality of Phrase-table in SMT Using Cross-Lingual Word Embedding............................. 17 3.1.1 Recomputing Phrase-table weights............... 18 3.1.2 Generating new phrase pairs................... 19 3.2 Addressing the Unknown Word Problem in NMT Using Cross-Lingual Word Embedding Models......................... 21 4 Experiments and Results 27 4.1 Settings.................................. 27 4.2 Results................................... 31 v
TABLE OF CONTENTS vi 4.2.1 Word Translation Task...................... 31 4.2.2 Impact of Enriching the Phrase-table on SMT system..... 32 4.2.3 Impact of Removing the Unknown Words on NMT system.. 35 5 Conclusion 38
List of Figures 2.1 The CBOW model predicts the current word based on the context, and the Skip-gram predicts surrounding words based on the current word..................................... 13 2.2 Toy illustration of the cross-lingual embedding model.......... 14 3.1 Flow of training phrase.......................... 22 3.2 Flow of testing phrase........................... 23 3.3 Example in testing phrase......................... 25 vii
List of Tables 3.1 The sample of new phrase pairs generated by using projections of word vector representations....................... 21 4.1 Monolingual corpora........................... 28 4.2 Bilingual corpora............................. 28 4.3 Bilingual dictionaries........................... 29 4.4 The precision of word translation retrieval top-k nearest neighbors in Vietnamese-English and Japanese-Vietnamese language pairs..... 32 4.5 Results on UET and TED dataset in the PBSMT system for Vietnamese- English and Japanese-Vietnamese respectively............. 33 4.6 Translation examples of the PBSMT in Vietnamese-English..... 34 4.7 Results of removing unknown words on UET and TED dataset in the NMT system for Vietnamese-English and Japanese-Vietnamese respectively................................ 35 4.8 Translation examples of the NMT system in Vietnamese-English... 37 viii