Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China

Cui, Hao, Chang, Xianmin and Gang, Shuang (2026) Optimizing Data Preprocessing and Hyperparameter Tuning for Soil Organic Carbon Content Prediction Using Large Language Models: A Case Study of the Black Soil and Windblown Sandy Soil Regions in Northeast China. Applied Sciences, 16 (7). pp. 1-22.

[img] Text
X Chang with SYU Cuihao Shuang Gang Black Soil.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB)

Abstract

To address the current issues in soil organic carbon (SOC) content prediction where data preprocessing relies on expert experience to formulate fixed rules, resulting in a lack of uniform standards and insufficient consideration of regional soil heterogeneity; while hyperparameter tuning faces problems of high computational costs and excessively long runtimes—this study proposes an intelligent modeling workflow driven by Large Language Models (LLM). This workflow focuses on optimizing two key aspects of SOC Random Forest modeling: data preprocessing and hyperparameter tuning. The results show: The LLM-defined rules achieved sample retention rates of 55.33% and 61.90% in the two regions respectively, showing more significant differences compared to traditional hard-coded rules (56.2% and 59.3%), and the mean soil organic carbon content deviations (30.27% and 20.05%) were both lower than those of traditional hard-coding. At the same time, the mean soil organic carbon content values in both regions closely matched the effectiveness of other methods, indicating that the large language model has effectively captured regional soil differences; With only a single evaluation of hyperparameter optimization, the adaptive model achieved test set R² values of 0.394 and 0.694 in the black soil region and the aeolian sandy soil region respectively, with root mean square error values of 8.76 g/kg and 6.07 g/kg—its performance is comparable to that of Grid Search and Random Search, while computational efficiency improved by over 95%; Performance comparisons with XGBoost and Partial Least Squares Regression (PLSR) show that the LLM-optimized Random Forest achieved R²=0.394 and RMSE=8.76 g/kg in the black soil region, and R²=0.694 and RMSE=6.07 g/kg in the windblown sandy soil region, demonstrating practical application value.

Item Type: Article
Keywords: soil data preprocessing, LLM, hyperparameter tuning, random forest, regional adaptability
Divisions: International
Depositing User: Professor Xianmin Chang
Date Deposited: 31 Mar 2026 14:15
Last Modified: 31 Mar 2026 14:15
URI: https://rau.repository.guildhe.ac.uk/id/eprint/17090

Actions (login required)

Edit Item Edit Item