A random forest long-term precipitation prediction method combined with multiple hypothesis testing and its application
Article
Figures
Metrics
Preview PDF
Reference
Related
Cited by
Materials
Abstract:
Long-term precipitation prediction refers to forecasting precipitation over a period of more than one month. This is a crucial aspect of integrated water resources management. The accuracy of long-term precipitation predictions is low due to various uncertainties. Traditional long-term precipitation prediction methods are mainly divided into dynamical numerical methods and mathematical statistical methods. Dynamical numerical methods simulate future weather conditions using sea-land thermodynamic models for precipitation prediction. This approach has a clear physical mechanism, but the model calculations are complex. Data-driven mathematical-statistical methods simulate the correlation between precipitation and predictors from a statistical perspective to establish a long-term prediction model. However, research on precipitation prediction based on mathematical statistical methods mainly focuses on improving the model, with relatively little emphasis on how to select the predictors. In fact, the predictors affect the accuracy of model predictions. Therefore, the focus and challenge of precipitation prediction lie in selecting the necessary predictors for modeling from the relevant factors. Random forest, as a flexible, efficient, and easy-to-use machine learning algorithm, has been widely used in hydrological prediction. The random forest method calculates the importance scores of various related factors and then selects predictors for the model based on empirical experience. This process can result in a certain error rate issue with the selected predictors. To address the issue of false discovery rate in the random forest algorithm when selecting key predictors, this study employs the false discovery rate control method in multiple hypothesis testing to ensure quality control in predictor selection. This transformation shifts variable selection from being experience-dependent to becoming data-dependent. Finally, the random forest algorithm is used to construct a long-term precipitation prediction model by integrating the selected precipitation predictors. Taking the upper basin of the Parana River in Brazil as the study area, the precipitation from 54 measured rainfall stations and 130 climate system indices was analyzed. The predictors influencing precipitation in the corresponding months of the following year were obtained using the "Model-X Knockoff" method. A monthly precipitation prediction model is established based on the predictors that influence the precipitation for the corresponding month of the following year. The top 5 predictors with the highest importance scores are directly selected for random forest modeling using the traditional random forest method. The validity of the proposed method is subsequently verified using 10-fold cross-validation and a test of the monthly precipitation prediction results from 2018 to 2020. The effect of 10-fold cross-validation for 54 rainfall stations shows that the model prediction pass rate of the method introduced is higher than that of the traditional random forest method from January to December, with the highest pass rate of 77% in June. The results of precipitation prediction from 2018 to 2020 indicate that our method achieved an average pass rate of 66% from January to December, outperforming the traditional random forest method, which scored 64%. In summary, our research combines multiple hypothesis testing with predictor selection and quality control to establish a long-term precipitation prediction model, which differs from the traditional random forest method. This model exhibits a higher prediction pass rate and improved stability, suggesting that this approach can serve as an effective tool for long-term precipitation prediction in a basin.