Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Published in NAACL, 2024

This research addresses the limited capabilities of existing large language models in processing Vietnamese, a language spoken by over 100 million people. We developed and released five open-source Vietnamese language models by adapting well-known models like LLaMa-2, Mixtral, and Gemma through specialized training on Vietnamese text from Wikipedia, news articles, and student essays. To properly assess these models, we created a comprehensive evaluation framework covering ten practical use cases—from question-answering and summarization to detecting toxic content and logical reasoning. Our evaluation of 14 different models revealed several important insights: training data quality matters more than model size alone, and while larger models can be more powerful, they can also exhibit more biases and be harder to control. We also found that models can effectively transfer knowledge across languages, making targeted training an efficient approach for improving performance in lower-resource languages. All our models and evaluation tools are publicly available to support further research and development in Vietnamese language technology.

Press & Recognition: Featured in The New York Times and Stanford HAI.