[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Yuanhao Yue, Chengyu Wang, Jun Huang, Peng Wang


Abstract
Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.
Anthology ID:
2025.coling-industry.37
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Kareem Darwish, Apoorv Agarwal
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
431–444
Language:
URL:
https://aclanthology.org/2025.coling-industry.37/
DOI:
Bibkey:
Cite (ACL):
Yuanhao Yue, Chengyu Wang, Jun Huang, and Peng Wang. 2025. Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 431–444, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud (Yue et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-industry.37.pdf