AbstractSensor networks enable the collection of high-frequency, large water quality datasets that provide valuable information for managing eutrophication, such as chlorophyll a (Chl-a) concentration. Deep learning models have been successfully applied to derive useful insights from large-scale environmental data. However, sensor data often contain missing values, presenting challenges for applying deep learning models. Therefore, we employed the reverse time attention model with a decay mechanism (RETAIN-D) to simultaneously conduct feature engineering, prediction, and interpretation within a single model structure. Various environmental, hydrological, and meteorological variables were utilized as input features to predict the exceedance of Chl-a criteria. Data were collected from 2018 to 2022 at four monitoring sites along the Geum River, South Korea. RETAIN-D demonstrated strong prediction performance (accuracy = 0.84–0.90, AUC = 0.69–0.91, F-measure = 0.89–0.90 on the test set) across varying Chl-a criteria. Environmental variables were more important than hydrological and meteorological for predicting the exceedance of Chl-a criteria. The contribution of input features to the model prediction was generally higher in more recent time steps when the Chl-a criterion of the target site was applied. These results highlight the effectiveness of RETAIN-D in analyzing high-frequency time series data from sensor networks.
Graphical Abstract![]() 1 IntroductionEutrophication is a natural aging process that facilitates the ecological succession of freshwater ecosystems where nutrients accumulate in a water body [1]. However, human-induced eutrophication owing to the increased input of nutrients from urban, industrial, and agricultural watersheds [2] has led to excessive algal growth and promoted anoxia leading to massive fish kills, water quality degradation, and the loss of habitat for various organisms [3, 4]. Consequently, eutrophication has become an important issue for water resources management [5, 6]. Eutrophication is predominantly associated with lentic systems where nutrients accumulate more easily [7]. However, the construction of weirs and dams has disrupted the natural flow regime of rivers, which has created a more stagnant water environment that is susceptible to eutrophication [8, 9]. Consequently, rivers are increasingly being threatened by eutrophication and its associated environmental challenges [3].
Water quality monitoring data have been used for analyzing the trophic status of water bodies. Furthermore, predictive models based on accumulated monitoring data are essential for taking proactive measures and mitigating the detrimental impacts of eutrophication. Du et al. [10] used monthly monitoring data including T-N, T-P, and Chl-a concentrations with numerical simulations to evaluate the trophic status of freshwater lakes. Kim and Ahn [11] compared various machine learning algorithms to predict chlorophyll-a (Chl-a) concentrations—a widely used indicator of eutrophication—using daily monitoring data from the Han River basin, including total nitrogen (T-N), total phosphorus (T-P), water temperature, and precipitation.
However, hydrologic and water quality processes can fluctuate rapidly within timescales of minutes to hours, which can make it difficult to capture dynamic patterns using low-frequency monitoring systems that take samples at timescales of days, weeks, and even months [12, 13]. Moreover, low monitoring frequencies can cause environmental risks to be underestimated because the monitoring system may fail to capture high concentrations of pollutants [14, 15]. Sensor networks allow for automated monitoring of water quality data with a high temporal resolution for accurate detection of any problems [16]. Hence, many researchers and organizations are starting to use sensor networks for high-frequency monitoring of aquatic environments [17]. The high temporal resolution of a sensor network allows for a more accurate analysis of temporal variations in water quality by detecting sudden and rapid fluctuations in the Chl-a concentration, which offers valuable insights for effective water resources management [18, 19].
Deep learning (DL) models have been successfully applied to analyze large-scale multivariate data [20]. In particular, recurrent neural networks (RNNs) such as long short-term memory (LSTM) and the gated recurrent unit (GRU) have been increasingly applied to predict temporal variations in the Chl-a concentration [21–23]. Recently, attention mechanisms are being incorporated with DL models to concentrate on important segments of input data when making predictions, thereby addressing the vanishing gradient problem associated with time series data [24–26]. However, the feature-level interpretation provided by attention-based DL models has limited applicability for the analysis of rapid variations in the Chl-a concentration of water bodies. The reverse time attention mechanism (RETAIN) overcomes these limitations by using two parallel attention-based RNNs at the feature and time levels, thus weighing important features and time steps of multivariate time series data during the prediction process [27].
However, RETAIN has difficulty with handling missing values, and environmental data from a sensor network often contain a number of missing values owing to unstable communication or errors between devices [28]. The reverse time attention model with a decay mechanism (RETAIN-D) was developed for automated feature engineering, predicting, and interpretation at both the feature and time levels within a single model structure. In previous studies, RETAIN-D was successfully applied to impute missing values in environmental data and showed high accuracy and explainability at predicting harmful algal blooms [29]. However, RETAIN-D has rarely been applied to environmental datasets with a high temporal resolution such as an hourly scale.
In this study, RETAIN-D was trained on a high-frequency environmental dataset and was applied to predict the exceedance of Chl-a criteria along Geum River, South Korea. Implementing a stricter Chl-a criterion increases the samples in exceedance and decreases the samples in compliance, which increases the class imbalance ratio (IR) within the dataset. Class imbalance is known to impede the learning process for the minority class, which ultimately degrades the overall prediction performance. Thus, the objectives of this study were (1) to assess the applicability of RETAIN-D to analyze high-frequency water quality data with a high rate of missing values; (2) to examine the impact of class imbalance on the model performance; and (3) to identify time- and space-variant contributions of environmental variables to the prediction performance.
2 Experimental Section2.1. Study Area and Data DescriptionGeum River is one of the five major rivers of South Korea, and it has a length of 397 km and watershed area of 9900 km2. Geum River occupies a large basin in the center of South Korea, and it has several large tributaries including Miho Stream and Gap Stream. Sejong Weir is located in the lower reaches of Geum River, and it was constructed as part of the Four Major Rivers Restoration Project in 2009–2012. Daecheong Lake is located in the upper reaches of Geum River, and it was created by the construction of Daecheong Dam for domestic, industrial, and agricultural water supply and flood control. The study area comprised four monitoring sites along the section of Geum River between Sejong Weir and Daecheong Lake: Hyeondo (HD) and Nam-myeon (NM) on the mainstream and Gapcheon (GC) and Mihogang (MH) on tributaries (Fig. 1).
The dataset was obtained from multiple data sources (Table 1). Environmental, hydrological, and meteorological variables associated with Chl-a were included as input features. The environmental variables included the water temperature (°C), T-N concentration (mg/L), T-P concentration (mg/L), and Chl-a concentration (mg/m3) obtained from the automated monitoring network operated by the National Institute of Environmental Research. Only the water temperature and Chl-a were collected at sites MH and GC because T-N and T-P were not measured there. The water level (m) at site NM was set as the hydrological variable and was provided by the Geum River Flood Control Office. The precipitation (mm) was set as the meteorological variable and was obtained from the nearest Automated Surface Observing System station at Daejeon (36°22′19.16″ N, 127°22′19.56″ E). All data were collected from January 1, 2018, to December 31, 2022, on an hourly scale except for precipitation, which was collected every 3 hours during the winter (i.e., November–March).
2.2. Model DevelopmentRETAIN employs RNN-based attention layers operating in parallel to extract and retain information from both the temporal and feature levels of the input time series separately [27]. In the initial step, RETAIN performs a linear embedding of the input time series, which is denoted as the modeled complete time series (
). Subsequently, these embeddings (vt) are used in the two attention layers to generate separate attention weights (αt and βt) for the temporal and feature levels. Notably, the embeddings are fed to each attention mechanism in reverse time order, which enhances computational stability during the generation of attention weights and improves the time-specific sensitivity of these weights for prediction. The context vector (CTXt) is computed by integrating the generated attention weights and embeddings, and the sum of context vectors across time steps is sent to a fully connected (FC) layer to output the prediction. The contribution (ωs) of input features (
) at each time step to the corresponding prediction value can be determined as [27]:
where ŷn represents the corresponding prediction for
, W represents the aggregated weights of the FC layer, and Wemb represents the embedding matrix.
RETAIN-D incorporates a decay mechanism into the RETAIN architecture to facilitate the modeling of missing values and interpreting modeling outcomes (Figs. 2 and S1). RETAIN-D utilizes complete hourly time series data generated by the decay mechanism to train RETAIN for hourly-scale prediction and time-specific feature importance analysis. The trainable decay mechanism for feature engineering allows unmeasured values to be imputed by capturing missing patterns from multivariate time series data [30]. Consequently, the integration of trainable decays into deep learning models enables hybrid models to automatically handle sparse and irregular time series data, further enhancing temporal resolution in line with the most frequently monitored input features. The decay mechanism in RETAIN-D has been described in detail in a previous paper [29]. RETAIN-D was developed and applied by using the PyTorch [31] and Sklearn [32] libraries in Python 3.11.7 [33].
2.3. Model ImplementationA high-frequency environmental dataset was used to analyze the differing effects of input features at sites HD, MH, GC, and NM on the predicted exceedance of Chl-a criteria at site NM. RETAIN-D was used for simultaneous feature engineering, prediction, and model interpretation. The Chl-a concentration at site NM was classified as in compliance or exceedance of the Korean stream water quality criteria for Class II (“slightly good water” grade) and Class Ia (“very good water” grade). The Class II criterion of 14 mg/m3 corresponded to the target water quality of site NM. The Class Ia criterion of 5 mg/m3 corresponded to the target water quality of site HD, which had the strictest criterion among the monitoring sites. Thirteen input features including the water temperature, Chl-a, T-N, T-P, water level, and precipitation were used to predict whether the Chl-a concentration at site NM would exceed the Class II and Class Ia criteria. Separate models were developed for each criterion.
The dataset, consisting of 29,172 samples, was divided into training (80%, January 1, 2018–December 31, 2021) and test (20%, January 1, 2022–December 31, 2022) periods. The natural base log transformation was applied to all input features except for the water temperature. Min-max normalization was applied to scale each input feature to the range of 0 to 1 to avoid the effects of differing scales among the features on the training process. The unit time and time step of RETAIN-D were set to 1 and 15 hours, respectively, to reflect the travel time from the upstream and tributary sites to the downstream sites. Binary cross-entropy was used as the loss function to train RETAIN-D where the number of epochs, weight decay, and epsilon were set to 1000, 10−6, and 10−5. Hyperparameters (hidden dimension size and learning rate) of RETAIN-D for each class were optimized based on grid search. The objective function for hyperparameter tuning was set as F-measure from four-fold cross validation using training set. During the cross validation, each year of the training set was used as validation set. Consequently, the model targeting the Class II criterion had a hidden dimension of 32 and learning rate of 0.0001 while the model targeting the Class Ia criterion had a hidden dimension of 64 and learning rate of 0.001.
2.4. Performance EvaluationThe prediction results were divided into four categories depending on whether exceedance or compliance of Chl-a criteria was correctly predicted: 1) a true positive (TP) when exceedance was correctly predicted, 2) a true negative (TN) when compliance was correctly predicted, 3) a false positive (FP) when compliance was incorrectly predicted, and 4) a false negative (FN) when exceedance was incorrectly predicted. In the presence of class imbalance, the prediction performance can be biased by the performance for the majority classes, regardless of the performance for the minority class. Therefore, the prediction performance was evaluated based on three different metrics: the accuracy, area under the receiver operating characteristic curve (AUC), and F-measure. These evaluation metrics can be calculated as follows:
3 Results and Discussion3.1. Missing Data and Chlorophyll a TrendsUnlike conventional RNN-based models, the automatic feature engineering of RETAIN-D is particularly effective in addressing variations in missing data proportions across different monitoring sites. By incorporating decay mechanisms into the model architecture, RETAIN-D can directly deal with irregular time series data with missing values without requiring manual temporal aggregation, matching, and interpolation. [29]. In this study, the rates of missing data varied substantially across sites and variables: 0.01%–34.39% at site NM, 28.45%–28.77% at site MH, 10.38%–12.97% at site GC, and 7.15%–17.81% at site HD (Table S1). However, the precipitation data had no missing values. Large fluctuations were observed in the Chl-a concentrations across the monitoring sites (Fig. 3). In general, the Chl-a concentration was highest at site MH (median = 15.8 mg/m3), followed by sites NM (median = 11.7 mg/m3), GC (median = 10.5 mg/m3), and HD (median = 3.3 mg/m3) (Fig. 3 and Table S2). The high Chl-a concentrations at site MH may be attributed to the influence of pollution sources in the Miho River basin, which has recently experienced water quality issues [34]. The maximum Chl-a concentration was also lowest at site HD compared to the other sites (Table S2). Site HD is a representative monitoring site within the Daecheong Dam sub-basin where the water quality is rigorously managed to protect the source water. Hence, the strict water quality criteria for the Daecheong Dam sub-basin may account for the low Chl-a concentration observed at site HD.
The Chl-a concentration demonstrated a distinct seasonality at the monitoring sites except at site HD (Fig. 3 and Table S2). In general, the Chl-a concentration was higher during the summer (median = 17.3 mg/m3) than during other seasons (median = 6.3 mg/m3). This seasonal difference was particularly evident at site MH, where the median Chl-a concentration was 54.5 mg/m3 in the summer and 9.2 mg/m3 in the other seasons. In contrast, site HD showed no major seasonal differences with a median Chl-a concentration of 2.5 mg/m3 in the summer and 3.7 mg/m3 in the other seasons. As the Chl-a criteria became more rigorous, the number of samples in exceedance increased, which increased the imbalance ratio of the dataset. With the Class II criterion (Chl-a < 14 mg/m3), 16,803 samples were in compliance and 12,909 samples were in exceedance for an IR of 1.30. However, with the stricter Class Ia criterion (Chl-a < 5 mg/m3), IR showed more than a threefold increase to 4.37, with 5,532 samples in compliance and 24,180 samples in exceedance.
3.2. Prediction PerformanceRETAIN-D generally demonstrated a good prediction performance (Fig. 4). Throughout the test period (January 1, 2022–December 31, 2022), RETAIN-D successfully identified when the Chl-a criteria were exceeded, although there were occasional mispredictions (Fig. 5). The prediction performance slightly decreased as the IR increased when the stricter Class Ia criterion was used. With the training set, RETAIN-D demonstrated a good prediction performance with both the Class II criterion (accuracy = 0.85, AUC = 0.85, F-measure = 0.83) and Class Ia criterion (accuracy = 0.98, AUC = 0.97, F-measure = 0.99). With the test set, RETAIN-D also yielded strong performance with the Class II criterion (accuracy = 0.90, AUC = 0.91, F-measure = 0.89) but performance slightly decreased with the Class Ia criterion (accuracy = 0.84, AUC = 0.69, F-measure = 0.90). In particular, RETAIN-D predicted the measured Chl-a values that complied with the Class Ia criterion with a relatively low accuracy of 0.43 (Fig. 4d). This can be attributed to the high IR of the training set for the Class Ia criterion, which included 4020 samples in compliance and 18,302 samples in exceedance. Previous studies have demonstrated that class imbalance can adversely affect the training process by impeding the model’s ability to learn the characteristics of the minority class and causing the results to be biased toward the majority class [35, 36]. These results suggest that employing an appropriate data preprocessing method is required to mitigate class imbalance in the dataset and improve the prediction performance of the model [37]. The adverse effect of the class imbalance on the model performance was primarily reflected by the AUC rather than the accuracy and F-measure. The discrepancy in model performance can be attributed to the inherent differences in the evaluation metrics. The F-measure, which is based on the recall and precision, does not reflect TNs [38]. In contrast, the AUC considers specificity and thus reflects TNs [39]. Hence, selecting appropriate evaluation metrics of the model performance is important, particularly to consider the effects of class imbalance in the dataset.
Although RNN-based DL models have been successfully applied across various domains, using low-frequency data may limit their potential advancements [29]. In this study, RETAIN-D was successfully applied to an environmental dataset characterized by a high temporal resolution and substantial missing ratio and accurately predicted when samples exceeded Chl-a criteria. Furthermore, RETAIN-D exhibited a relatively consistent performance even when the dataset had a high IR [40, 41]. These results highlight the robustness of RETAIN-D at addressing the challenges associated with missing data and class imbalance, which enhances its applicability to environmental modeling and monitoring.
3.3. Relative Importance of Environmental VariablesThe mean contributions of input features to the predicted exceedance of Chl-a criteria at site NM were calculated based on the overall mean absolute feature-level attention weights of each feature over all instances and time steps (t = 0 to 14 hours). The relative importance of each input feature based on the mean contributions varied depending on the adopted Chl-a criterion (Fig. 6). Overall, Chl-a was an important variable for predicting the exceedance of both the Class Ia and Class II Chl-a criteria at site NM with the highest importance observed at site MH followed by at sites HD and GC (Fig. 6). The high Chl-a concentration combined with the large temporal variation and distinct seasonality observed at site MH may have increased its importance at this site. Moreover, the close distance between sites MH and NM may have amplified the importance of Chl-a at site MH for predicting the exceedance of Chl-a criteria at site NM. These results imply that Miho Stream, which is one of the main tributaries of the Geum River, may have a major impact on the mainstream and that monitoring and management efforts here should be considered to improve Chl-a concentrations at site NM. In contrast, the precipitation had a relatively low importance (Fig. 6a and c). This suggests that meteorological factors, which are typically effective in capturing temporal variations in algal blooms over longer durations (e.g., 1–2 weeks), may not be effective at predicting rapid variations in Chl-a concentrations over timescales of several hours [42]. Interestingly, T-N was more important than T-P for predicting Chl-a exceedance at site NM. This result may be explained by Microcystis, which is a non-nitrogen-fixing cyanobacteria that require nitrogen sources for growth and that is a dominant genus on rivers and lakes during algal blooms in South Korea [43, 44]. The results of this study also align with the recent paradigm shift in freshwater nutrient management from phosphorus to nitrogen, which has been driven by the numerous algal blooms induced by additional nitrogen inflow [43, 45].
The time-variant contributions of input features for predicting the exceedance of Chl-a criteria at site NM were also evaluated. The relative importance of the environmental variables had differing temporal patterns depending on the Chl-a criterion (Fig. 7). For the Class II criterion, the relative importance of input features was generally higher in recent time steps (Fig. 7a). For the Class Ia criterion, the relative importance of input features did not exhibit a consistent trend with only slightly higher values observed for time steps 12 to 10 (Fig. 7b).
Our results showed that RETAIN-D could be successfully applied to high-frequency monitoring data acquired from a sensor network, allowing exploration of both feature and time level contributions of input features in predicting the exceedance of Chl-a. RETAIN-D utilizes a parallel architecture that produces separate attention weights to provide more explicit interpretations at both the feature and time levels [29]. Furthermore, training the model in reverse time order stabilizes attention generation and improves parameter learning efficiency [27]. Hence, RETAIN-D is potentially applicable for analyzing other environmental data with a high temporal resolution and contributing to the decision-making process for mitigating environmental risks and costs.
4 ConclusionsIn this study, RETAIN-D was trained on a high-frequency environmental dataset and was applied to predict the exceedance of Chl-a criteria at site NM in the Geum River, South Korea. RETAIN-D showed strong prediction performance even when the percentage of missing values was high. RETAIN-D also performed consistently even with class imbalance in the dataset. The contributions of input features could be explored at both the feature and temporal levels within a single model framework. The results showed that the Chl-a concentration from the nearest monitoring site was the most important variable for predicting the exceedance of Chl-a criteria at the target site. In addition, the contributions of input features generally increased for more recent time steps when the Chl-a criterion of the target site was applied. This study demonstrated the effectiveness of RETAIN-D in analyzing high-frequency environmental data to support necessary decision-making. Future studies should investigate data preprocessing methods to address class imbalance and improve prediction performance. Furthermore, various data imputation algorithms should be compared to further enhance the credibility of hybrid DL models.
NotesAcknowledgments This work was supported by the Basic Study and Interdisciplinary R&D Foundation Fund of the University of Seoul (2023). Conflict-of-Interest Statement The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Author Contributions G.L. (Researcher) conducted modeling and wrote manuscript. J.S. (Ph.D.) conducted modeling and revised manuscript. Y.K. (Ph.D. candidate) conducted modeling. E.H. (Researcher) and C.Y. (Research assistant) revised the manuscript. T.K. (Ph.D. candidate) conducted modeling. Y.C. (professor) outlined and revised the manuscript. References1. Rast W, Thornton JA. Trends in eutrophication research and control. Hydrol. Process. 1996;10:295–313. https://doi.org/10.1002/(SICI)1099-1085(199602)10:2<295::AID-HYP360>3.0.CO;2-F
![]() 2. Conley DJ, Paerl HW, Howarth RW, et al. Ecology. Controlling eutrophication: Nitrogen and phosphorus. Science. 2009;323:1014–1015. https://doi.org/10.1126/science.1167755
![]() ![]() 3. Smith VH, Schindler DW. Eutrophication science: where do we go from here? Trends Ecol. Evol. 2009;24:201–207. https://doi.org/10.1016/j.tree.2008.11.009
![]() ![]() 4. Zhang Y, Li M, Dong J, et al. A critical review of methods for analyzing freshwater eutrophication. Water. 2021;13:225. https://doi.org/10.3390/W13020225
![]() 5. Dalu T, Wasserman RJ, Magoro ML, Froneman PW, Weyl OL. River nutrient water and sediment measurements inform on nutrient retention, with implications for eutrophication. Sci. Total Environ. 2019;684:296–302. https://doi.org/10.1016/J.SCITOTENV.2019.05.167
![]() ![]() 6. Wang X, Xu L. Unsteady multi-element time series analysis and prediction based on spatial-temporal attention and error forecast fusion. Future Internet. 2020;12:34. https://doi.org/10.3390/FI12020034
![]() 7. Houser JN, Bierman DW, Burdis RM, Soeken-Gittinger LA. Longitudinal trends and discontinuities in nutrients, chlorophyll, and suspended solids in the Upper Mississippi River: implications for transport, processing, and export by large rivers. Hydrobiologia. 2010;651:127–144. https://doi.org/10.1007/s10750-010-0282-z
![]() 8. Kakade A, Salama ES, Han H, et al. World eutrophic pollution of lake and river: biotreatment potential and future perspectives. Environ. Technol. Innov. 2021;23:101604. https://doi.org/10.1016/J.ETI.2021.101604
![]() 9. Sin Y, Lee H. Changes in hydrology, water quality, and algal blooms in a freshwater system impounded with engineered structures in a temperate monsoon river estuary. J. Hydrol. Reg. Stud. 2020;32:100744. https://doi.org/10.1016/J.EJRH.2020.100744
![]() 10. Du H, Chen Z, Mao G, et al. Evaluation of eutrophication in freshwater lakes: A new non-equilibrium statistical approach. Ecol. Indic. 2019;102:686–692. https://doi.org/10.1016/J.ECOLIND.2019.03.032
![]() 11. Kim KM, Ahn JH. Machine learning predictions of chlorophyll-a in the Han River basin, Korea. J. Environ. Manage. 2022;318:115636. https://doi.org/10.1016/J.JENVMAN.2022.115636
![]() ![]() 12. Bhurtun P, Lesven L, Ruckebusch C, Halkett C, Cornard JP, Billon G. Understanding the impact of the changes in weather conditions on surface water quality. Sci. Total Environ. 2019;652:289–299. https://doi.org/10.1016/J.SCITOTENV.2018.10.246
![]() ![]() 13. Horsburgh JS, Jones AS, Stevens DK, Tarboton DG, Mesner NO. A sensor network for high frequency estimation of water quality constituent fluxes using surrogates. Environ. Modell. Softw. 2010;25:1031–1044. https://doi.org/10.1016/J.ENVSOFT.2009.10.012
![]() 14. Brack W, Dulio V, Ågerstrand M, et al. Towards the review of the European Union water Framework Directive: recommendations for more efficient assessment and management of chemical contamination in European surface water resources. Sci. Total Environ. 2017;576:720–737. https://doi.org/10.1016/J.SCITOTENV.2016.10.104
![]() ![]() ![]() 15. Castrillo M, García ÁL. Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods. Water Res. 2020;172:115490. https://doi.org/10.1016/J.WATRES.2020.115490
![]() ![]() 16. Adu-Manu KS, Tapparello C, Heinzelman W, Katsriku FA, Abdulai JD. Water quality monitoring using wireless sensor networks: Current Trends and Future Research Directions. ACM Trans. Sen. Netw. 2017;13:1–41. https://doi.org/10.1145/3005719
![]() 17. Li T, Xia M, Chen J, Zhao Y, De Silva C. Automated water quality survey and evaluation using an IoT platform with mobile sensor nodes. Sensors. 2017;17:1735. https://doi.org/10.3390/S17081735
![]() ![]() ![]() 18. Birgand F, Aveni-Deforge K, Smith B, et al. First report of a novel multiplexer pumping system coupled to a water quality probe to collect high temporal frequency in situ water chemistry measurements at multiple sites. Limnol. Oceanogr. Methods. 2016;14:767–783. https://doi.org/10.1002/LOM3.10122
![]() 19. Rode M, Wade AJ, Cohen MJ, et al. Sensors in the stream: the high-frequency wave of the present. Environ. Sci. Technol. 2016;50:10297–10307. https://doi.org/10.1021/acs.est.6b02155
![]() ![]() 20. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT press; 2016.
21. Busari I, Sahoo D, Jana RB. Prediction of chlorophyll-a as an indicator of harmful algal blooms using deep learning with Bayesian approximation for uncertainty assessment. J. Hydrol. 2024;630:130627. https://doi.org/10.1016/J.JHYDROL.2024.130627
![]() 22. Cho H, Choi UJ, Park H. Deep learning application to time-series prediction of daily chlorophyll-a concentration. WIT Trans. Ecol. Environ. 2018;215:157–163. https://doi.org/10.2495/EID180141
![]() 23. Wenxiang D, Caiyun Z, Shaoping S, Xueding L. Optimization of deep learning model for coastal chlorophyll a dynamic forecast. Ecol. Modell. 2022;467:109913. https://doi.org/10.1016/J.ECOLMODEL.2022.109913
![]() 24. Kwon DH, Hong SM, Abbas A, et al. Inland harmful algal blooms (HABs) modeling using internet of things (IoT) system and deep learning. Environ. Eng. Res. 2023;28:210280. https://doi.org/10.4491/EER.2021.280
![]() 25. Ni J, Liu R, Tang G, Xie Y. An improved attention-based bidirectional LSTM model for cyanobacterial bloom prediction. Int. J. Control Autom. Syst. 2022;20:3445–3455. https://doi.org/10.1007/s12555-021-0802-9
![]() 26. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62. https://doi.org/10.1016/J.NEUCOM.2021.03.091
![]() 27. Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, Sun J. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Adv Neural Inf Process Syst. 2016;29:https://doi.org/10.48550/arXiv.1608.05745
![]() 28. Choi C, Jung H, Cho J. An ensemble method for missing data of environmental sensor considering univariate and multivariate characteristics. Sensors. 2021;21:7595. https://doi.org/10.3390/S21227595
![]() ![]() ![]() 29. Kim T, Shin J, Lee D, et al. Simultaneous feature engineering and interpretation: Forecasting harmful algal blooms using a deep learning approach. Water Res. 2022;215:118289. https://doi.org/10.1016/J.WATRES.2022.118289
![]() ![]() 30. Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 2018;8:6085. https://doi.org/10.1038/s41598-018-24271-9
![]() ![]() ![]() 31. Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:https://doi.org/10.48550/arXiv.1912.01703
![]() 32. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:
33. Van Rossum G, Drake FL. Python reference manual. Amsterdam: Centrum Voor Wiskunde en Informatica; 1995. p. 1–52.
34. Yu N, Choi B, Seo D. Analysis of water quality characteristics of the Miho River basin using multivariate statistical analysis. J. Korea Water Resour. Assoc. 2024;57:785–795. https://doi.org/10.3741/JKWRA.2024.57.10.785
![]() 35. Brown I, Mues C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 2012;39:3446–3453. https://doi.org/10.1016/J.ESWA.2011.09.033
![]() 36. Shin J, Yoon S, Kim YW, Kim T, Go BG, Cha YK. Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms. Ecol. Inform. 2021;61:101202. https://doi.org/10.1016/J.ECOINF.2020.101202
![]() 37. Felix EA, Lee SP. Systematic literature review of preprocessing techniques for imbalanced data. I.E.T. Softw. 2019;13:479–496. https://doi.org/10.1049/IET-SEN.2018.5193
![]() 38. Van Rijsbergen CJ. Information retrieval. Oxford: Butterworth-Heinemann; 1979.
39. Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. I.E.E.E. Trans. Knowl. Data Eng. 2005;17:299–310. https://doi.org/10.1109/TKDE.2005.50
![]() 40. Ghosh K, Bellinger C, Corizzo R, Branco P, Krawczyk B, Japkowicz N. The class imbalance problem in deep learning. Mach Learn. 2022;1–57. https://doi.org/10.1007/s10994-022-06268-8
![]() 41. Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J. Big Data. 2019;6:1–54. https://doi.org/10.1186/s40537-019-0192-5
![]() 42. Kim TH, Shin J, Cha YK. Incorporation of feature engineering and attention mechanisms into deep learning models to develop an early warning system for harmful algal blooms. J. Clean. Prod. 2023;414:137564. https://doi.org/10.1016/J.JCLEPRO.2023.137564
![]() 43. Kim K, Mun H, Shin H, et al. Nitrogen stimulates Microcystis-dominated blooms more than phosphorus in river conditions that favor non-nitrogen-fixing genera. Environ. Sci. Technol. 2020;54:7185–7193. https://doi.org/10.1021/acs.est.9b07528
![]() ![]() 44. Srivastava A, Ahn CY, Asthana RK, Lee HG, Oh HM. Status, alert system, and prediction of cyanobacterial bloom in South Korea. BioMed Res. Int. 2015;2015:584696. https://doi.org/10.1155/2015/584696
![]() ![]() ![]() 45. Paerl HW, Gardner WS, McCarthy MJ, Peierls BL, Wilhelm SW. Algal blooms: noteworthy nitrogen. Science. 2014;346:175. https://doi.org/10.1126/science.346.6206.175-a
![]() ![]() Fig. 1Map of the study region and monitoring sites including Nam-myeon (NM), Mihogang (MH), Gapcheon (GC), and Hyeondo (HD) along Geum River, South Korea. Black arrows indicate the flow direction of the river. ![]() Fig. 2Modeling procedure for predicting the exceedance of Chl-a criteria at site NM. RETAIN-D: reverse time attention with a decay mechanism. ![]() Fig. 3Variations in Chl-a (mg/m3) during total period of 2018–2022 (grey), summer seasons (black), and non-summer seasons (white) across four monitoring sites (HD (blue), NM (red), GC (green), and MH (yellow)). Whiskers are drawn from minimum to maximum values. ![]() Fig. 4Confusion matrix plot and classification performance of RETAIN-D. The color bar indicates the ratio of the classification results belonging to each category based on measurements. Numbers in parentheses indicate the number of samples classified into each category. ![]() Fig. 5Measured and predicted exceedance of Chl-a criteria at site NM using RETAIN-D during the test period (2022–01–01 to 2022–12–31). Blue circles indicate that the measured Chl-a values complied with the criteria while red circles indicate that the Chl-a values exceeded the criteria. Shaded areas in blue and red depict the prediction results for compliance and exceedance, respectively. ![]() Fig. 6Relative importance of input features for predicting the exceedance of Chl-a criteria at site NM using RETAIN-D (Wtemp: water temperature; WL: water level). The importance of each feature represents overall mean absolute contribution of the feature over all instances and time steps (t = 0–14 hours) to predicting the exceedance of Chl-a criteria at site NM based on measured training and test datasets. Here, the circle sizes depict the relative differences in feature importance. ![]() Fig. 7Time-specific effects of input features for predicting the exceedance of Chl-a criteria at site NM using RETAIN-D. The symbol colors indicate the relative importance (%) of each time step across input features. ![]() Table 1Summary statistics (2018–2022) and data sources of input features and model output. The ranges (minimum – maximum) and median (in parenthesis) were calculated based on the measured values in the dataset.
|
|