蒙汉神经机器翻译
数据增强与伪平行语料构建
这些文献主要关注通过数据扩充、回译、语义挖掘等手段解决蒙汉翻译中数据稀缺的问题,通过构建伪平行语料来提升模型性能。
- Improving Mongolian-Chinese Translation Quality Using Noise-Enhanced mBART(Bailun Wang, Yatu Ji, Nier Wu, 2025, Lecture Notes in Computer Science)
- A Semantic Uncertainty Sampling Strategy for Back-Translation in Low-Resources Neural Machine Translation(Yepai Jia, Yatu Ji, Xiang Xue, Lei Shi, Qing-dao-er-ji Ren, Nier Wu, Na Liu, Chen Zhao, Fu Liu, 2025, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop))
- Combining Discrete Lexicon Probabilities with NMT for Low-Resource Mongolian-Chinese Translation(Jinting Li, H. Hou, Jing Wu, Hongbin Wang, Wenting Fan, Zhong Ren, 2017, 2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT))
- A Mongolian–Chinese Neural Machine Translation Method Based on Semantic-Context Data Augmentation(Huinuan Zhang, Yatu Ji, Nier Wu, Min Lu, 2024, Applied Sciences)
- Research on Mongolian-Chinese Translation Model Based on Transformer with Soft Context Data Augmentation Technique(Qing-dao-er-ji Ren, Yuan Li, Shi Bao, Yong-chao Liu, Xiuli Chen, 2022, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences)
- A Language-Driven Data Augmentation Method for Mongolian-Chinese Neural Machine Translation(Xuerong Wei, Qing-Dao-Er-Ji Ren, 2024, 2024 International Conference on Asian Language Processing (IALP))
- Research on Dynamic Curriculum Learning in Mongolian-Chinese Neural Machine Translation(Chunyue Hu, Qintu Si, Siriguleng Wang, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- Data Augmentation Under Scarce Condition for Neural Machine Translation(Dan Luo, Shumin Shi, Rihai Su, Heyan Huang, 2019, 2019 IEEE 6th International Conference on Cloud Computing and Intelligence Systems (CCIS))
- Research on the Application of BERT in Mongolian-Chinese Neural Machine Translation(Xiu Zhi, Siriguleng Wang, 2021, 2021 13th International Conference on Machine Learning and Computing)
- Constraint-Augmented Mongolian-Chinese Neural Machine Translation Based on Dynamic Feedback Alignment (Student Abstract)(Shuting Dai, Yatu Ji, Yanli Wang, Lei Shi, Qing-dao-er-ji Ren, Nier Wu, Na Liu, 2026, Proceedings of the AAAI Conference on Artificial Intelligence)
模型架构优化与预处理技术
这些研究致力于通过模型结构改进(如DCLA、混合编码、解码器优化)以及针对蒙语黏着语特性的预处理方法来提升翻译质量与效率。
- Research on Mongolian-Chinese machine translation based on the end-to-end neural network(Qing-dao-er-ji Ren, Y. Su, Nier Wu, 2019, International Journal of Wavelets, Multiresolution and Information Processing)
- Lite Mongolian-Chinese Neural Machine Translation: Dynamic Convolution with Long-Range Attention(Chen Zhao, Yatu Ji, Qing-dao-er-ji Ren, Min Lu, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- A Mongolian-Chinese neural machine translation model based on Transformer’s two-branch gating structure(Genmao Zhang, Yonghong Tian, Jia Hao, Junjin Zhang, 2022, 2022 4th International Conference on Intelligent Information Processing (IIP))
- Key Research of Pre-processing on Mongolian-Chinese Neural Machine Translation(Jian Du, H. Hou, Jing Wu, Zhipeng Shen, Jinting Li, Hongbin Wang, 2016, Proceedings of the 2016 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE 2016))
- A Study on Non-Autoregressive Mongolian-Chinese Neural Machine Translation for Multilingual Pre-Training(Xiaoli Zheng, Yonghong Tian, Chang Ma, Kangkang Sun, 2024, 2024 7th International Conference on Machine Learning and Natural Language Processing (MLNLP))
- Research on Traditional Mongolian-Chinese Neural Machine Translation Based on Dependency Syntactic Information and Transformer Model(Qing-dao-er-ji Ren, Kun Cheng, Rui Pang, 2022, Applied Sciences)
- Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation(Jing Wu, H. Hou, Zhipeng Shen, Jian Du, Jinting Li, 2016, Lecture Notes in Computer Science)
- Mongolian-Chinese Machine Translation Based on Text Context Information(Junjin Zhang, Yonghong Tian, Zheyu Song, Yufeng Hao, 2023, 2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP))
迁移学习与多语言协同建模
这些文献利用多语言预训练模型、跨语言知识迁移或联合建模方法,在跨语系或多资源辅助下提升蒙汉翻译效果。
- An Enhanced Method for Mongolian-Chinese Neural Machine Translation Using Multilingual Datastores and Chinese-Centric Methods(Bailun Wang, Yatu Ji, Nier Wu, Xu Liu, Yanli Wang, Rui Mao, Chao Zhou, Yepai Jia, Chen Zhao, Qing-dao-er-ji Ren, Na Liu, 2024, Lecture Notes in Computer Science)
- Research on Mongolian-Chinese Machine Translation Based on Dual-Learning(Shuo Sun, H. Hou, Nier Wu, Ziyue Guo, 2020, 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications( AEECA))
- Mongolian-Chinese Neural Machine Translation Based on Sustained Transfer Learning(Bailun Wang, Yatu Ji, Nier Wu, Xu Liu, Yanli Wang, Rui Mao, Shuai Yuan, Qing-dao-er-ji Ren, Na Liu, Xufei Zhuang, Min Lu, 2024, Lecture Notes in Computer Science)
- Multi-Round Transfer Learning for Low-Resource NMT Using Multiple High-Resource Languages(M. Maimaiti, Yang Liu, Huanbo Luan, Maosong Sun, 2019, ACM Transactions on Asian and Low-Resource Language Information Processing)
- Hot-Start Transfer Learning Combined with Approximate Distillation for Mongolian-Chinese Neural Machine Translation(Pengcong Wang, H. Hou, Shuo Sun, Nier Wu, Weichen Jian, Zongheng Yang, Yisong Wang, 2022, Communications in Computer and Information Science)
- Joint Modeling of Chinese Minority Language Translation Tasks(Yifan Guo, Hongying Zan, Hongfei Xu, 2023, 2023 International Conference on Asian Language Processing (IALP))
- Transferring Zero-shot Multilingual Chinese-Chinese Translation Model for Chinese Minority Language Translation(Ziyue Yan, Hongying Zan, Yifan Guo, Hongfei Xu, 2024, 2024 International Conference on Asian Language Processing (IALP))
- How Large Language Models Enhance Low-Resource Mongolian-Chinese Machine Translation?(Zhenjie Gao, Feilong Bao, Yuan Li, Rui Hou, Yibo Han, 2025, Data Intelligence)
领域综述与前沿探索
这些文献对蒙汉翻译及相关NLP任务进行了整体性回顾、总结或提出了跨模态/多任务的特定研究范式。
- Exploiting Morpheme and Cross-lingual Knowledge to Enhance Mongolian Named Entity Recognition(Songming Zhang, Ying Zhang, Yufeng Chen, Du Wu, Jinan Xu, Jian Liu, 2022, ACM Transactions on Asian and Low-Resource Language Information Processing)
- Multimodal Neural Machine Translation for Mongolian to Chinese(Weichen Jian, H. Hou, Nier Wu, Shuo Sun, Zongheng Yang, Yisong Wang, Pengcong Wang, 2022, 2022 International Joint Conference on Neural Networks (IJCNN))
- Low-resource neural character-based noisy text normalization(Manuel Mager, Mónica Jasso Rosales, Özlem Çetinoğlu, Ivan Vladimir Meza Ruiz, 2019, Journal of Intelligent & Fuzzy Systems)
- SASP-NMT: Syntax-Aware Structured Prompting for Low-Resource Neural Machine Translation(Hao Xing, Nier Wu, Yang Liu, Yatu Ji, Shuo Sun, Min Lu, 2025, Lecture Notes in Computer Science)
- Low-Resource Noisy Transliteration Normalization Using Large-Scale Language Model(Zolzaya Byambadorj, Ulziibayar Sonom-Ochir, Munkhsukh Enkhbayar, Hyun-chul Kim, Altangerel Ayush, 2025, IEEE Access)
- A Review of Mongolian Neural Machine Translation from the Perspective of Training(Yatu Ji, Huinuan Zhang, Nier Wu, Qing-dao-er-ji Ren, Lu Min, Shi Bao, 2024, 2024 International Joint Conference on Neural Networks (IJCNN))
- A Case Study of Mongolian-Chinese Translation on NMT and SMT(Zhipeng Shen, H. Hou, Jing Wu, Jian Du, Wenting Fan, Jinting Li, Feng Wang, Hongbin Wang, 2017, Artificial Intelligence Science and Technology)
- Overview of Research on Low-Resource Language Machine Translation Based on Artificial Intelligence(Haipeng Sun, Tao Tang, Jiancheng Feng, Zukang Yang, Bohan Guo, 2025, 2025 6th International Conference on Computer Engineering and Application (ICCEA))
- Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A Survey(Jinyi Zhang, Ke Su, Haowei Li, Jiannan Mao, Ye Tian, Feng Wen, Chong Guo, Tadahiro Matsumoto, 2024, ACM Transactions on Asian and Low-Resource Language Information Processing)
- Low-resource Neural Machine Translation: Methods and Trends(Shumin Shi, Xing Wu, Rihai Su, Heyan Huang, 2022, ACM Transactions on Asian and Low-Resource Language Information Processing)
- Improve Mongolian-Chinese translation by Introducing SMT Information into NMT(Wenting Fan, H. Hou, Hongbin Wang, Jinting Li, 2018, DEStech Transactions on Computer Science and Engineering)
针对蒙汉神经机器翻译的研究主要围绕解决低资源场景下的数据稀缺问题展开。现有文献通过数据增强、多语言迁移学习、模型架构优化以及结合预训练模型等多维度策略,有效缓解了语料不足、黏着语形态复杂等挑战,并逐步向大语言模型辅助翻译及跨模态应用演进。
总计38篇相关文献
Neural machine translation (NMT) typically relies on a substantial number of bilingual parallel corpora for effective training. Mongolian, as a low-resource language, has relatively few parallel corpora, resulting in poor translation performance. Data augmentation (DA) is a practical and promising method to solve problems related to data sparsity and single semantic structure by expanding the size and structure of available data. In order to address the issues of data sparsity and semantic inconsistency in Mongolian–Chinese NMT processes, this paper proposes a new semantic-context DA method. This method adds an additional semantic encoder based on the original translation model, which utilizes both source and target sentences to generate different semantic vectors to enhance each training instance. The results show that this method significantly improves the quality of Mongolian–Chinese NMT tasks, with an increase of approximately 2.5 BLEU values compared to the basic Transformer model. Compared to the basic model, this method can achieve the same translation results with about half of the data, greatly improving translation efficiency.
In neural machine translation (NMT), Mongolian-Chinese NMT (MNMT) encounters significant challenges due to the limited volume and accessibility of parallel corpus data. This leads to slow development and substantial hurdles. The absence of traditional transfer learning, which refers to the transfer of knowledge occurring only once at the beginning of the child model's training, may result in the child model failing to fully assimilate the knowledge from the parent model. This could potentially lead to overfitting when translating …
Aiming at the problem that most of the current Mongolian-Chinese Neural Machine Translation (NMT) adopts autoregressive generation, which is prone to error accumulation and slow generation speed, we propose the study of non-autoregressive Mongolian-Chinese NMT. However, non-autoregressive generation generally suffers from the phenomenon of poor translation quality, and its performance depends largely on the quantity and quality of data, so we resort to the multilingual pre-trained model CeMAT to assist the non-autoregressive Mongolian-Chinese NMT task, and then use CeMAT-based autoregressive Mongolian-Chinese NMT as teacher model, adopt multi-level knowledge distillation to train the CeMAT-based non-autoregressive Mongolian-Chinese NMT model, and finally incorporating Graph Convolutional Networks(GCN) to supplement the semantic information into the word embedding layer by modelling word-to-word relationships. The experimental results show that the methods used obtains a BLEU value of +9.99 and an increase in translation speed by a factor of 1.94 compared to the autoregressive model, i.e., it ensures an increase in the translation speed of the research model along with a significant increase in the quality of the translated text.
… and Chinese. In this paper we train a NMT model with Mongolian-Chinese parallel dataset from … Based on the characteristics of Mongolia, different preprocessing has been done to get a …
The characteristics of neural machine translation require training and updating through a large number of corpus for hundreds of millions of parameters. In this situation, Mongolian Neural Machine Translation(MNMT) needs a variety of targeted training techniques and additional strategies to alleviate the problems caused by resource scarcity. These problems run through the whole translation process. taking the key steps of model training as a clue, This paper conducted detailed experiments and analysis on the main training content, and summarizes the relevant research and key issues according to the mainstream training processes such as ‘corpus processing→word embedding training→parameter pre-training→end-to-end model training→translation key problem analysis’. On this basis, this paper is committed to some long-standing stubborn problems to give a review of the treatment methods and training suggestions, and to provide some references for other researchers.
In this paper, we improved the final accuracy of statistic alignment of phrase-based statistic machine translation (PBSMT) and introduced the improved alignment result into NMT. Moreover, we investigated syntax reordering and data augmentation for improving SMT alignments, which eventually leads to better NMT performance. The experiment results show that applying our approach to Mongolian-Chinese translation demonstrates promising improvements. With annotation alignment and compared to NMT baseline, we obtained up to 2.98 BLEU improvement on the low-resource language pair Mongolian-Chinese.
Neural machine translation has recently achieved promising results with the big scale corpus. But there is little research on the small scale corpus, such as Mongolian. Mongolian belongs to the agglutinative language while Chinese is a pictograph. It is necessary to do some pre-processing for both Mongolian and Chinese before training the machine translation. In this paper, we successfully build an attention-based neural machine translation to do the CWMT2009 Mongolian to Chinese translation task. We also use four different approaches, respectively, to do the pre-processing for both Mongolian and Chinese, including segmenting Chinese into character, separating the Mongolian stem from the suffixes, addressing the case suffix and converting Mongolian into Latin. We carry out a lot of experiments to evaluate the approaches. We achieve the best BLEU with the score of 29.56. It is 1.82 points in BLEU score higher than the baseline which is trained with the original Mongolian and the general word segmentation of Chinese. Keywords-Mongolian-Chinese translation; neural machine translation; pre-processing; attention-based
… Chinese NMT correction model to enhance the translation performance. The experiments show that the adapted Mongolian-Chinese attention-based NMT … over normal NMT baseline on …
Traditional data augmentation methods are typically computation-driven, randomly selecting words for modification with equal probability for each word. However, these methods do not take into account the linguistic information conveyed between words, which can disrupt the grammatical structure of the sentence and reduce text quality. In this paper, a language-driven data augmentation method for Mongolian-Chinese neural machine translation is proposed to address these problems. Specifically, the Stanford CoreNLP is first used to construct the dependency tree of the Chinese sentences. Next, the PageRank algorithm is employed to calculate the importance of each word in the sentence, and standard word-level data augmentation is performed on words with lower importance to generate new sentences. These new sentences maintain the same Mongolian alignment as the original sentences. Finally, the newly generated parallel corpus is combined with the original parallel corpus to form a pseudo-parallel corpus, which aids in the training of the neural machine translation model. The experimental results on the Mongolian-Chinese parallel corpus presented in this paper show that the data augmentation method proposed in this paper has a significant improvement in BLEU values over the baseline translation model.
Neural machine translation (NMT) is a data-driven machine translation approach that has proven its superiority in large corpora, but it still has much room for improvement when the corpus resources are not abundant. This work aims to improve the translation quality of Traditional Mongolian-Chinese (MN-CH). First, the baseline model is constructed based on the Transformer model, and then two different syntax-assisted learning units are added to the encoder and decoder. Finally, the encoder’s ability to learn Traditional Mongolian syntax is implicitly strengthened, and the knowledge of Chinese-dependent syntax is taken as prior knowledge to explicitly guide the decoder to learn Chinese syntax. The average BLEU values measured under two experimental conditions showed that the proposed improved model improved by 6.706 (45.141–38.435) and 5.409 (41.930–36.521) compared with the baseline model. The analysis of the experimental results also revealed that the proposed improved model was still deficient in learning Chinese syntax, and then the Primer-EZ method was introduced to ameliorate this problem, leading to faster convergence and better translation quality. The final improved model had an average BLEU value increase of 9.113 (45.634–36.521) compared with the baseline model at experimental conditions of N = 5 and epochs = 35. The experiments showed that both the proposed model architecture and prior knowledge could effectively lead to an increase in BLEU value, and the addition of syntactic-assisted learning units not only corrected the initial association but also alleviated the long-term dependence between words.
… In order to prove the versatility of the method, we have verified the method proposed in this paper in the Mongolian-Chinese bilingual corpus of 0.2 million mixed fields provided by …
Mongolian-Chinese neural machine translation has the problem that it cannot make full use of context information for document-level translation. In order to solve this problem, a Mongolian-Chinese neural machine translation model using the context information of the passage is proposed.The model makes better use of context information for document-level translation by introducing local encoders and global encoders in the encoder and caching mechanisms in the decoder. Through experiments, the document-level Mongolian-Chinese machine translation model integrated with context information is compared with the sentence-level Mongolian-Chinese machine translation model based on Transformer. The experimental results verify the advantages of the document-level Mongolian-Chinese machine translation model integrated with context information in translation performance.
… Consequently, mBART achieves superior performance in multilingual NMT tasks, making it … cover Mongolian, we fine-tune it for the Mongolian to Chinese and Chinese to Mongolian …
The rapid development of Inner Mongolia has led to a growing demand for Mongolian-Chinese translation. However, Mongolian presents significant challenges for machine translation, such as data scarcity and its complex syntactic structures. Consequently, traditional Mongolian-Chinese machine translation methods often produce outputs with poor fluency and su ff er from the loss of critical semantic information. In this paper, we propose a Mongolian-Chinese machine translation method based on large language models (LLM-CPT-SymFT). Specifically, LLM-CPT-SymFT involves continual pre-training (CPT) of the base model using Mongolian and Chinese corpora, followed by symmetrical fine-tuning (SymFT), which leverages a mix of original Mongolian-Chinese parallel data and its synthetically reversed counterpart, primarily to enhance Mongolian-to-Chinese translation performance. We evaluated our method on two datasets, achieving an average BLEU score improvement of 28.89 compared to baseline models. Results confirm our method’s strong potential for improving Mongolian-Chinese machine translation.
The scarcity of parallel corpora for Mongolian and Chinese constrains the performance of Mongolian-Chinese neural machine translation (NMT), particularly manifesting in inadequate accuracy in translating specialized terminology. To address this limitation, this study adopts a lexically constrained augmentation strategy that constructs pseudo-source sentences by appending Chinese constraint words to Mongolian source texts, while enforcing the inclusion of these constraints in the output to improve translation accuracy. However, this approach presents two inherent drawbacks: processing pseudo-sentences with a single encoder tends to induce semantic interference, while the introduced constraint words may exacerbate alignment errors during decoding. To overcome these limitations, this paper propose a Constraint-Augmented Mongolian-Chinese NMT method (CANMT) based on dynamic feedback alignment. The method employs a dual-encoder architecture to isolate bilingual representations, coupled with a dynamic feedback alignment module that progressively reduces alignment errors through iterative reffnement, thereby enhancing overall translation performance.
To address the performance limitations caused by scarce parallel corpora in low-resource neural machine translation (e.g., Mongolian-Chinese), this paper proposes a neural machine translation method incorporating dynamic curriculum learning. First, based on the agglutinative characteristics of Mongolian, partial word segmentation is applied to preprocess Mongolian text. The original 1.27 million parallel sentence pairs are expanded to 2.54 million via back-translation, mitigating data sparsity. Further, a dynamic curriculum learning framework is proposed. Sample difficulty evaluation is performed through the model’s own translation predictions and uncertainty estimation using the Monte Carlo Dropout method, establishing a dynamic difficulty scoring mechanism. A data selection strategy is then introduced. By analyzing the impact of multiple data selection methods on the curriculum learning model—combined with the Baby Step training scheduling strategy—a dynamic curriculum learning Mongolian-Chinese neural machine translation model integrating data selection is proposed. Experiments on the CCMT2024 dataset show that the proposed method achieves a 2.16-point improvement in BLEU-4 over the baseline model, outperforming curriculum learning methods based on sentence length (1.73 points) and word frequency (1.52 points). Dynamic curriculum learning adaptively adjusts training sample distributions through model self-feedback, optimizes data utilization efficiency, and effectively resolves the mismatch between data distribution and model capabilities in low-resource scenarios. This approach provides a scalable solution for low-resource language machine translation.
Sequence-to-sequence neural machine translation (NMT) has achieved great success with many language pairs. However, its performance remains constrained in low-resource settings such as Mongolian–Chinese translation due to its strong reliance on large-scale parallel corpora. To address this issue, we propose ILFDN-Transformer, a Mongolian–Chinese NMT model that integrates implicit language features and a deliberation network to improve translation quality under limited-resource conditions. Specifically, we leverage the BART pre-trained language model to capture deep semantic representations of source sentences and apply knowledge distillation to integrate the resulting implicit linguistic features into the Transformer encoder to provide enhanced semantic support. During decoding, we introduce a deliberation mechanism that guides the generation process by referencing linguistic knowledge encoded in a multilingual pre-trained model, therefore improving the fluency and coherence of target translations. Furthermore, considering the flexible word order characteristics of the Mongolian language, we propose a Mixed Positional Encoding (MPE) method that combines absolute positional encoding with LSTM-based dynamic encoding, enabling the model to better adapt to complex syntactic variations. Experimental results show that ILFDN-Transformer achieves a BLEU score improvement of 3.53 compared to the baseline Transformer model, fully demonstrating the effectiveness of our proposed method.
… -source languages, but performance of NMT for low-resource languages ae … Mongolian-Chinese pseudo parallel corpus, so as to improve the translation ability of Mongolian-Chinese …
Machine translation–the automatic transformation of one natural language (source language) into another (target language) through computational means–occupies a central role in computational linguistics and stands as a cornerstone of research within the field of Natural Language Processing (NLP). In recent years, the prominence of Neural Machine Translation (NMT) has grown exponentially, offering an advanced framework for machine translation research. It is noted for its superior translation performance, especially when tackling the challenges posed by low-resource language pairs that suffer from a limited corpus of data resources. This article offers an exhaustive exploration of the historical trajectory and advancements in NMT, accompanied by an analysis of the underlying foundational concepts. It subsequently provides a concise demarcation of the unique characteristics associated with low-resource languages and presents a succinct review of pertinent translation models and their applications, specifically within the context of languages with low-resources. Moreover, this article delves deeply into machine translation techniques, highlighting approaches tailored for Chinese-centric low-resource languages. Ultimately, it anticipates upcoming research directions in the realm of low-resource language translation.
The Deep Learning method sits on the field of machine translation by virtue of its ability to understand semantics, especially in the field of large languages. However, for low-resource languages, a difficult problem is the lack of large-scale bilingual corpus which leads to over-fitting of the model. In this paper, for languages with few data resources, Round-Trip Translation(RTT) is combined to expand the scale of pseudo-parallel corpus, while dual learning method is used to semi-supervised the model in two directions: source-target language and target-source language, then the updating of model parameters is guided by reward feedback. In addition, in order to reduce the "noise" effect of pseudo-corpus, an iterative attenuation method is proposed to refine the training data. Then, the CWMT2018 Mongolian-Chinese translation task is used to test the model. The results show that the BLEU value of the model is 2.1 higher than that of the traditional method, and the validity of the method is fully proved.
Neural machine translation (NMT) has made remarkable progress in recent years, but the performance of NMT suffers from a data sparsity problem since large-scale parallel corpora are only readily available for high-resource languages (HRLs). In recent days, transfer learning (TL) has been used widely in low-resource languages (LRLs) machine translation, while TL is becoming one of the vital directions for addressing the data sparsity problem in low-resource NMT. As a solution, a transfer learning method in NMT is generally obtained via initializing the low-resource model (child) with the high-resource model (parent). However, leveraging the original TL to low-resource models is neither able to make full use of highly related multiple HRLs nor to receive different parameters from the same parents. In order to exploit multiple HRLs effectively, we present a language-independent and straightforward multi-round transfer learning (MRTL) approach to low-resource NMT. Besides, with the intention of reducing the differences between high-resource and low-resource languages at the character level, we introduce a unified transliteration method for various language families, which are both semantically and syntactically highly analogous with each other. Experiments on low-resource datasets show that our approaches are effective, significantly outperform the state-of-the-art methods, and yield improvements of up to 5.63 BLEU points.
Neural Machine Translation (NMT) brings promising improvements in translation quality, but until recently, these models rely on large-scale parallel corpora. As such corpora only exist on a handful of language pairs, the translation performance is far from the desired effect in the majority of low-resource languages. Thus, developing low-resource language translation techniques is crucial and it has become a popular research field in neural machine translation. In this article, we make an overall review of existing deep learning techniques in low-resource NMT. We first show the research status as well as some widely used low-resource datasets. Then, we categorize the existing methods and show some representative works detailedly. Finally, we summarize the common characters among them and outline the future directions in this field.
This study examines the role of artificial intelligence in low-resource language translation, focusing on preserving endangered languages and cultural heritage. It reviews foundational Neural Machine Translation (NMT) theories, evaluates advancements in key technologies like data augmentation and transfer learning, and assesses their impact on translation quality. Given the scarcity of machine translation methods for low-resource languages, the paper reviews current research in multimodal, paragraph-level, and multilingual machine translation, offering insights for future studies. Case studies are included to demonstrate these technologies' practical applications and challenges. Finally, the paper explores opportunities and challenges NMT models face in low-resource contexts, aiming to enhance translation quality and support cultural heritage preservation.
Neural Machine Translation (NMT) has achieved significant progress for high-resource language pairs but still faces challenges with low-resource pairs like Mongolian-Chinese. Mongolian presents unique grammatical and semantic modeling difficulties as an agglutinative language with rich morphology and SOV word order (Subject-Object-Verb). Taking Convolutional Neural Networks (CNNs) and Transformers as examples: CNNs struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, impose high computational costs. Existing Mongolian-Chinese NMT methods often prioritize translation quality but overlook computational efficiency, limiting their applicability on resource-constrained devices. This paper proposes a lightweight model, Dynamic Convolution with Long-Range Attention (DCLA), which balances translation quality and efficiency. DCLA uses a recognizer module to analyze Mongolian-specific features, such as sentence length, structural complexity, and low-frequency word distribution. It effectively addresses syntactic challenges caused by SOV-SVO word order differences and semantic difficulties posed by polysemous and low-frequency words. DCLA adopts two strategies based on sample complexity: applying convolution for simpler samples to reduce computational cost and multi-head attention for complex samples to enhance modeling. Experiments demonstrate that DCLA outperforms the state-of-the-art Transformer model in Mongolian-Chinese translation, with a 2.7-point BLEU improvement, a 21% reduction in parameters, and a 40% decrease in computational cost.
… In this study, we adopt the official test sets from the CCMT2019 evaluation tasks for Mongolian-Chinese, Uyghur-Chinese, and Tibetan-Chinese machine translation as our benchmark …
Transfer learning is an effective method to improve the performance of low-resource translation, but its effectiveness heavily relies on specific languages, and transferring between similar languages usually leads to better results. Large-scale English (en) parallel data is normally easier to obtain for Chinese (zh) than the other languages, but when translating Chinese minority languages into Chinese Han, there is a large discrepancy between English and Chinese minority languages, and transferring from Chinese Han seems to be a better choice than from English. As Chinese Han is also the target language, training a zh→zh translation model is problematic due to the lack of such parallel data. In this paper, we propose to obtain the zh→zh translation model using Multilingual Neural Machine Translation (MNMT) by involving Chinese in both the source and the target sides of different translation tasks, and using the zero-shot zhOur experiments on the CCMT 2023 Chinese minority language→zh translation direction for transfer learning. translation tasks show that transferring from the zh→zh mode can lead to significant improvements (+1.49, +4.65, +0 82 lBLEU on the Tibetan (ti), Mongolian (mn) and Uyghur (uy) to Chinese Han tasks respectively) than from the en→zh model.
Neural Machine Translation (NMT) normally requires a large amount of parallel corpus to obtain good performance, which is often unavailable for minority languages. Current methods normally pre-train seq2seq models on monolingual data in a denoising manner and then fine-tune the parallel data to improve the performance of low-resource translation. But minority languages used in adjacent areas may co-relate with each other, and jointly modeling them may lead to better performance. In this paper, we propose to improve the performance of Chinese minority language translation with Multilingual NMT (MNMT). As the tokens of the minority languages are not covered by either Chinese BART or mBART and the vocabulary size of the multilingual data exceeds that of the pre-trained model, we map the vocabulary of minority languages to that of the pre-trained BART according to the frequency and enlarge the BART vocabulary by repeating low-frequency tokens respectively to address them. Our experiment results on the CCMT 2023 Chinese minority language translation tasks show that joint modeling can improve the Uyghur-to-Chinese and the Tibetan-to-Chinese tasks by +2.85 and +1.30 BLEU respectively with BART base, and lead to BLEU scores of 55.48, 53.52, and 48.26 on the Mongolian-to-Chinese, Tibetan-to-Chinese and Uyghur-to-Chinese translation tasks respectively with BART large.
… the transfer object. If the words of the subject and object can be correctly corresponding during the transfer, the performance of transfer learning … to adapt the transfer learning of neural …
With the development of natural language processing and neural machine translation, the neural machine translation method of end-to-end (E2E) neural network model has gradually become the focus of research because of its high translation accuracy and strong semantics of translation. However, there are still problems such as limited vocabulary and low translation loyalty, etc. In this paper, the discriminant method and the Conditional Random Field (CRF) model were used to segment and label the stem and affixes of Mongolian in the preprocessing stage of Mongolian-Chinese bilingual corpus. Aiming at the low translation loyalty problem, a decoding model combining Convolution Neural Network (CNN) and Gated Recurrent Unit (GRU) was constructed. The target language decoding was performed by using the GRU. A global attention model was used to obtain the bilingual word alignment information in the process of bilingual word alignment processing. Finally, the quality of the translation was evaluated by Bilingual Evaluation Understudy (BLEU) values and Perplexity (PPL) values. The improved model yields a BLEU value of 25.13 and a PPL value of [Formula: see text]. The experimental results show that the E2E Mongolian-Chinese neural machine translation model was improved in terms of translation quality and semantic confusion compared with traditional statistical methods and machine translation models based on Recurrent Neural Networks (RNN).
Mongolian named entity recognition (NER) is not only one of the most crucial and fundamental tasks in Mongolian natural language processing, but also an important step to improve the performance of downstream tasks such as information retrieval, machine translation, and dialog system. However, traditional Mongolian NER models heavily rely on the feature engineering. Even worse, the complex morphological structure of Mongolian words makes the data sparser. To alleviate the feature engineering and data sparsity in Mongolian named entity recognition, we propose a novel NER framework with Multi-Knowledge Enhancement (MKE-NER). Specifically, we introduce both linguistic knowledge through Mongolian morpheme representation and cross-lingual knowledge from Mongolian-Chinese parallel corpus. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation and cross-lingual annotation projection. Experimental results demonstrate the effectiveness of our MKE-NER model, which outperforms strong baselines and achieves the best performance (94.04% F1 score) on the traditional Mongolian benchmark. Particularly, extensive experiments with different data scales highlight the superiority of our method in low-resource scenarios.
Transliteration normalization is a crucial task for low-resource languages, particularly for Mongolian, where noisy text from social media presents significant challenges. The frequent use of non-standard transliteration can contribute to the gradual erosion of linguistic knowledge, particularly among young users, making it harder to maintain proficiency in their native language. Therefore, developing robust methods for normalizing such text is essential. In this paper, we propose a novel approach leveraging large-scale neural models, specifically GPT-2, to normalize noisy transliterated Mongolian text. Our study explores a data-driven approach, including word pairs, sentence pairs, and synthetic data, to enhance model performance. To further improve accuracy, we introduce a post-processing module that integrates Edit Distance-based corrections with a context-aware ranking mechanism using the Mongolian BERT model. Experimental results demonstrate that our approach (M10: 16.42%) improves overall accuracy by approximately 4.91%, while achieving a 10.44% increase in out-of-vocabulary (OOV) word normalization compared to baseline models. Our proposed approach demonstrates effectiveness in normalizing noisy transliterated text under low-resource conditions.
… Improving neural machine translation (NMT) for the Mongolian-Chinese language pair is challenging due to the lack of high-quality parallel data. This study explores various noise …
Multimodal Machine Translation (MMT) aims to enhance translation quality by incorporating information from other modalities (usually images). However, dominant MMT models do not consider that visual features not only provide supplementary information also introduce much noise. In this paper, we propose the visual features filter to solve this issue. Specifically, we adopt a soft-lookup function to select the visual features relevant to the text and then use these visual features as pseudo-words concatenating with a text representation. In addition, our model conducts two-pass decoding. The secondarypass decoding amounts to polishing which can identify errors in draft translations. The reason is that polishing expands the view in the process of decoding each target token, providing more contextual information. Besides, since most words in draft translations can be copied to final translations, we further equip our model with the copying mechanism to reserve those words that do not need to be corrected. MMT has achieved success in some mainstream languages at present. In order to promote the development of MMT in low-resource languages such as Mongolian, we deploy our model to the Mongolian→Chinese translation task. We expand Multi30k dataset to synthetic Mongolian and Chinese descriptions. Experiments on synthetic Mongolian and Chinese datasets demonstrate that our model can bring significant improvements.
Neural Machine Translation (NMT) has achieved state-of-the-art performance depending on the availability of copious parallel corpora. However, for low-resource NMT task, the scarcity of training data will inevitably lead to poor translation performance. In order to relieve the dependence on scale of bilingual corpus and to cut down training time, we propose a novel data augmentation method named SMC under scarce condition that can Sample Monolingual Corpus containing difficult words only in back-translation process for Mongolian-Chinese (Mn-Ch) and English-Chinese (En-Ch) NMT. Inspired by work in curriculum learning, our approach takes into account the various difficulty-degree of the sample and the corresponding model capabilities. Experimental results show that our method improves translation quality respectively by up to 2.4 and 1.72 BLEU points over the baselines on En-Ch and Mn-Ch datasets while greatly reducing training time.
User generated data in social networks is often not written in its standard form. This kind of text can lead to large dispersion in the datasets and can lead to inconsistent data. Therefore, normalization of such kind of texts is a crucial preprocessing step for common Natural Language Processing tools. In this paper we explore the state-of-the-art of the machine translation approach to normalize text under low-resource conditions. We also propose an auxiliary task for the sequence-to-sequence (seq2seq) neural architecture novel to the text normalization task, that improves the base seq2seq model up to 5%. This increase of performance closes the gap between statistical machine translation approaches and neural ones for low-resource text normalization.
Back-translation has been proven effective in enhancing the performance of Neural Machine Translation (NMT), with its core mechanism relying on synthesizing parallel corpora to strengthen model training. However, while traditional back-translation methods alleviate the data scarcity in low-resource machine translation, their dependence on random sampling strategies ignores the semantic quality of mono-lingual data. This results in the contamination of model training through the inclusion of substantial low-quality samples in the generated corpora. To mitigate noise interference, additional training iterations or model scaling are required, significantly increasing computational costs. To address this challenge, this study proposes a Semantic Uncertainty Sampling strategy, which prioritizes sentences with higher semantic uncertainty as training samples by computationally evaluating the complexity of unan-notated monolingual data. Experiments were conducted on three typical low-resource agglu-tinative language pairs: Mongolian-Chinese, Uyghur-Chinese, and Korean-Chinese. Results demonstrate an average BLEU score improvement of +1.7 on test sets across all three translation tasks, confirming the methods effectiveness in enhancing translation accuracy and flu-ency. This approach provides a novel pathway for the efficient utilization of unannotated data in low-resource language scenarios.
In recent years, the research of neural networks has brought new solutions to machine translation. The application of sequence-to-sequence model has made a qualitative leap in the performance of machine translation. The training of neural machine translation model depends on large-scale bilingual parallel corpus, the size of corpus directly affects the performance of neural machine translation. Under the guidance of BERT (Bidirectional Encoder) model to calculate the semantic similarity degree for the extension of training corpus in this paper. The scores of two sentences were calculated by using dot product and cosine similarity, and then the sentences with high scores were expanded to the training corpus with a scale of 540,000 sentence pairs. Finally, Transformer was used to train the Mongolian and Chinese neural machine translation system, which was 0.91 percentage points higher than the BLEU value in the baseline experiment.
To design a new two-branch gating structure for the existence of Transformer structure, by dividing the attention mechanism into two pieces, one part adopts the attention mechanism for global information capture and the other part adopts dynamic convolution for local information capture, and the captured two-branch is fused with features by way of gating mechanism to replace the attention mechanism and feedforward neural network, which makes the model fewer parameters and higher ability of capturing information. The experimental results show that our BLEU4 value is improved by 3. 07 compared to the Transformer structure.
针对蒙汉神经机器翻译的研究主要围绕解决低资源场景下的数据稀缺问题展开。现有文献通过数据增强、多语言迁移学习、模型架构优化以及结合预训练模型等多维度策略,有效缓解了语料不足、黏着语形态复杂等挑战,并逐步向大语言模型辅助翻译及跨模态应用演进。