This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
This paper investigates the improvement in organic matter classification accuracy from different aquatic environments through the application of machine learning and deep learning techniques, supplemented with data generated by an LSTM-GAN model. Samples from the Nakdong and Yeongsan Rivers in South Korea were analyzed using Orbitrap HR-MS to obtain natural organic matter (NOM) data. Classification was performed using three machine learning algorithms—random forest, support vector machine (SVM), and logistic regression—and one deep learning algorithm, a multi-layer perceptron (MLP). Due to the limited performance of deep learning with insufficient data, an LSTM-GAN-based augmentation model was proposed, improving MLP performance. The MLP with augmented data achieved the highest classification accuracy (79% for Yeongsan River, 68% for Nakdong River), demonstrating the significant potential of LSTM-GAN in enhancing deep learning models for river classification tasks. This approach provides a robust framework for improving environmental monitoring through machine learning.
Keywords: Data augmentation, Deep Learning, DOM, LSTM-GAN, Orbitrap MS
Graphical Abstract
Keywords: Data augmentation, Deep Learning, DOM, LSTM-GAN, Orbitrap MS
1 Introduction
Natural organic matter (NOM) is derived from the debris of animals or plants existing in the ecosystem. Its characteristics are diverse because it decomposes in a variety of ways depending on the environment. Thus, NOMs are heterogeneous and complex compounds. In aquatic environments, they play an important role as carbon energy sources but can also behave like aquatic pollutants, causing taste problems or forming disinfection by-products. Therefore, understanding the characteristics of NOM is very important in the field of water quality studies.
In addition, the characteristics of NOM serve as key indicators of the unique properties of freshwater ecosystems. Rivers, lakes, and other water bodies exhibit distinct NOM compositions due to variations in geography, climate, biological activity, and anthropogenic influences. Characterizing NOM provides practical insights into the origin and condition of water sources, aiding in the design of effective water treatment processes. The presence of specific NOM compounds affects coagulation efficiency, adsorption capacity, and membrane fouling during water treatment.
Various analytical methods are being attempted to characterize NOMs. Generally, total organic carbon analyzers, ultraviolet or fluorescence spectroscopy, and elemental analyzers reveal bulk characteristics such as carbon concentration and hydrophobicity. For molecular structure analysis, nuclear magnetic resonance (NMR) and mass spectrometry have been widely used. Among NOM characterizing techniques, analysis using high-resolution mass spectrometry (HR-MS) has become more prevalent recently. This is because it provides high mass resolution and accuracy while being more economically feasible compared to NMR. In particular, Orbitrap HR-MS has gained attention for characterizing NOMs at the molecular level. Park et al. studied the diversity of NOM characteristics depending on sample origin [1]. They classified identified biopolymers using Orbitrap HR-MS and compared NOM characteristics among samples. Similarly, Phungsai et al. collected NOM samples from water treatment processes and analyzed the transformation of organic matter following these processes [2]. Their research found that ozonation and chlorination steps reduced low-molecular-weight organic matter.
Despite these advantages, there are few studies applying Orbitrap HR-MS data to water quality classification, mainly due to the complexity of the analysis and the scarcity of available data. The pretreatment of samples for Orbitrap analysis involves several steps, such as filtration, concentration, and evaporation [3]. Furthermore, the large size of Orbitrap HR-MS data (approximately 200 MB per sample) increases computational complexity. For these reasons, applying Orbitrap results to water quality prediction modeling has been challenging.
Machine learning and deep learning are useful tools for identifying data patterns in complex and large-scale datasets, as they can extract essential features from the data. Several studies have applied machine learning to water quality modeling and estimation. Hong et al. estimated the E. coli concentration in an irrigation pond using machine learning models, achieving R2 values above 0.896 for the test dataset [4]. Uddin et al. proposed an innovative framework for predicting uncertainties in Water Quality Index (WQI) models using Gaussian Process Regression (GPR). Their machine learning model reduced uncertainty by 12.86% in summer and 10.27% in winter [5]. Additionally, recent studies have used machine learning to generate data for developing models. Hou et al. described the simulation of dissolved organic matter (DOM) transformations in sewer systems. To overcome limited data availability, they proposed using Generative Adversarial Networks (GANs) integrated with machine learning (ML) to improve prediction accuracy. GANs, as one of the representative data augmentation models, artificially generate new data by making slight changes to the original data [6]. Their developed model generated 1,000 virtual samples, achieving R2 of 0.5389 and RMSE of 0.0273 [7]. Similarly, Li et al. investigated the application of GANs for detecting contamination events, addressing challenges such as accurately extracting spatial and temporal features from water quality data [8].
In this paper, we aim to predict the NOM source based on molecular characteristics using machine learning with data augmentation. First, we collected samples from the Nakdong River and the Yeongsan River, located in the South Korea, respectively. Second, the collected samples were analyzed using Orbitrap HR-MS to obtain molecular information on NOM in each river. Finally, NOM source classification, distinguishing river origins, was performed using machine learning and deep learning with a data augmentation model.
2 Materials and Methods
2.1. Sample Collection
Riverine NOM samples were collected from the Nakdong River and the Yeongsan River in the Republic of Korea between June and November 2017. Nakdong River samples were collected 17 times from Hapcheon-Changnyeong Weir (35°N, 128°E), and Yeongsan River samples were obtained 19 times from Juksan Weir (35°N, 126°E) (Fig. 1).
2.2. Orbitrap HR-MS Analysis
As a preliminary step for Orbitrap mass spectrometry analysis, solid-phase extraction (SPE) was used to remove inorganic salts. The SPE method and the operating conditions for Orbitrap mass spectrometry followed the protocol described by Baek et al. [9]. NOM was adsorbed using manually packed SPE cartridges containing HLB (Oasis, Waters, USA), ENV+ (International Sorbent Technology, UK), Strata X-AW, and X-CW (Phenomenx, UK). The cartridges were conditioned with 5 mL of methanol and 10 mL of deionized water, followed by loading 1 L of the sample. Subsequently, the adsorbed NOM was extracted using 6 mL of ethyl acetate/methanol (50:50 v/v) with 0.5% ammonia and 3 mL of ethyl acetate/methanol (50:50 v/v) with 1.7% formic acid. This method achieved a high recovery rate of over 75%.
The pretreated samples were analyzed using an Ultimate 300 UPLC system (Thermo Fisher Scientific, San Jose, CA, USA) coupled with an Exactive Orbitrap mass spectrometer (Thermo Fisher Scientific, San Jose, CA, USA) equipped with a heated electrospray ionization (HESI) interface. The HESI was operated under the following conditions: sheath gas flow at 45 L/min, capillary temperature at 320°C, spray voltage at 3800 V (positive mode)/3000 V (negative mode), auxiliary gas pressure at 10 arbitrary units, and ion sweep gas at 2 arbitrary units. The injection volume was 200 μL with a methanol mobile phase. Mass spectra were recorded in the range of 100 to 2000 m/z. Molecular formulas were assigned using a compound identification algorithm in MATLAB, as described by Kujawinski and Behn [10].
2.3. River Classification Based on Machine Learning and Deep Learning
From the Orbitrap HR-MS analysis results, we obtained 17 Nakdong River samples and 19 Yeongsan River samples, respectively. Each Orbitrap HR-MS result contained six features (Exp. m/z, C, H, N, O, S) and approximately 2,500 data points per sample (Table 1).
The analyzed results were merged to build a dataset based on the sample origin. The Nakdong River dataset contained 49,131 data points, while the Yeongsan River dataset contained 42,078 data points. These datasets were divided into training and test datasets in a ratio of 7:3. Specifically, 30,093 data points from 12 samples of the Nakdong River and 31,216 data points from 14 samples of the Yeongsan River were used as training datasets for the model. Additionally, 19,038 data points from five samples of the Nakdong River and 10,862 data points from five samples of the Yeongsan River were used as test datasets. This systematic dataset division ensures a balanced and representative distribution of data points across training and testing phases, facilitating the development of a robust and generalizable model. By incorporating diverse samples from each river into both the training and test datasets, the constructed datasets enable accurate learning of complex patterns and relationships inherent in the samples. Moreover, this approach provides a reliable framework for evaluating model performance, ensuring its effectiveness in capturing and generalizing the underlying characteristics of the data.
For NOM source prediction based on river origin, we employed a Multi-Layer Perceptron (MLP) as a deep learning technique alongside three machine learning techniques: Random Forest, Support Vector Machine (SVM), and Logistic Regression.
Random forest is an ensemble machine learning model that uses several decision trees to conduct classification and regression. Support vector machines can be used for both classification and regression by maximizing the margin from the support vector. Logistic regression performs binary classification by employing the sigmoid function to calculate the likelihood of belonging to the class. MLP, which includes one or more hidden layers between the input and output layers, is a deep learning technique designed for learning non-linearly distributed data. In this study, the MLP was configured with four hidden layers. The input nodes for the first, second, third, and fourth layers were 8, 64, 128, and 64, respectively, while the output nodes for these layers were 64, 128, 64, and 2, respectively. Cross-entropy was utilized as the criterion function due to the presence of two final output nodes. Adam was selected as the optimizer, with a learning rate of 0.001.
2.4. Data Augmentation Model
We used the PyTorch framework for implementing machine learning and deep learning techniques for river classification, as well as the Long Short-Term Memory (LSTM)- Generative Adversarial Network (GAN) model for data augmentation [11]. Model development and training were conducted on a workstation equipped with an NVIDIA RTX A6000 GPU, an Intel i9-10940X 3.30 GHz processor, and 128 GB of RAM.
Fig. 2 illustrates the overall workflow for data augmentation. Initially, the distribution of the model’s input data is thoroughly analyzed, followed by preprocessing of the previously split training dataset. Additionally, a portion of the training dataset is further partitioned into a validation dataset to evaluate the performance of the LSTM-GAN model and prevent overfitting.
This data-splitting process ensures that the model does not become overly reliant on the training dataset, while the validation dataset allows for the assessment of the model’s generalization ability. As depicted in the figure, the training data is used to train the augmentation model, while the validation data is employed for performance evaluation and optimization. Finally, the test data is used to independently evaluate the performance of the classification model described in Section 2.3.
Data augmentation was performed for each feature of the training dataset due to their distinct statistical characteristics, such as minimum and maximum values, average, and standard deviation. The training dataset, consisting of six features, was divided into individual features to generate input data for each augmentation model. Since each feature had a unique minimum and maximum value, the data range was normalized between 0 and 1 using MinMaxScaler to enhance data augmentation efficiency and consistency [12].
The LSTM network was integrated into the GAN to generate augmented data and classify between fake and original data. The generator and discriminator in the LSTM-GAN architecture each consist of one LSTM layer, and one fully connected (FC) layer.
Fig. 3 depicts the detailed model architecture of the generator and discriminator. To generate fake data, the generator employs the noise vector as input. The LSTM layer’s initial hidden state vector H0 and cell state vector C0 are zero vectors with the dimensions (1, batch_size, hidden_size). The size of the LSTM layer output vector Ht from the generator at each time t is (batch_size, sequence_length, hidden_size). The FC layer receives the final LSTM layer output vector Ht_end as input, and it produces a vector Yt of size (batch_size, sequence_length, 1). Discriminator D distinguishes between fake and original data using a label vector of size (batch_size, sequence_length, hidden_size). Table 2 and Table 3 show the parameter settings of the generator and discriminator, respectively. The LSTM-GAN model employs Adam as the optimizer and BCELoss as the loss function. Eqs. (1) and (2) calculate the loss values for the generator and discriminator, respectively. Discriminator D learns to maximize LD in order to distinguish between the fake data and original data. Generator G uses mutually adversarial learning by maximizing LG based on the derived loss value [13]. The sequence length was set to 50, batch size to 64, learning rate to 0.005, and epoch to 100.
(1)
(2)
Finally, postprocessing is conducted on the augmented data obtained from the LSTM-GAN model for each feature. The pre-processing step uses MinMaxScaler to normalize augmented data values between 0 and 1. Data inverse scaling is then applied to transform the normalized values back into the original data range for each feature. Since inverse scaling produces real numbers, integer-valued features (e.g., C, H, N, O, and S) are rounded to the nearest integer. The generated feature values are then concatenated to produce complete samples.
This comprehensive data augmentation workflow is expected to play a critical role in supplementing the limited number of data samples by generating synthetic samples that account for the unique characteristics of the Nakdong and Yeongsan River samples derived from Orbitrap HR-MS analysis. By leveraging statistical properties, the augmentation process ensures that the generated data closely reflects the underlying distribution of the original dataset. This approach is expected not only to address data scarcity but also to enhance the model’s generalization capability by providing a more diverse and representative set of training samples. Furthermore, the expanded dataset is anticipated to strengthen the model’s learning ability in scenarios with imbalanced or limited data, making the model more robust to environmental changes and exceptional cases. Ultimately, the augmented dataset is expected to improve prediction accuracy and provide greater reliability for real-world applications.
3 Results & Discussion
3.1. Performance Evaluation of LSTM-GAN
To evaluate the performance of the proposed LSTM-GAN model, we analyzed and compared the data distributions of the augmented water quality data and the originally measured data using statistical criteria, such as the mean and standard deviation. The trained LSTM-GAN model generated 60,000 augmented data points for both the Nakdong River and the Yeongsan River, respectively. Data analysis was performed using these generated datasets, while the original data analysis employed the test dataset described in Section 2.3.
Fig. 4 illustrates the analysis results of both the original and augmented data for each feature of the Nakdong River (a) and Yeongsan River (b). In these figures, Mean, STD, 25%, 50%, and 75% represent the average, standard deviation, first quartile, median, and third quartile, respectively. The augmented data for both the Nakdong River and Yeongsan River exhibit trends similar to those observed in the original data in terms of mean, standard deviation, and quartiles. However, for the N and S features of the Nakdong River, the augmented data showed higher values compared to the original data. In contrast, discrepancies were observed in the S features of the Yeongsan River between the augmented and original data.
To evaluate the statistical similarity between the augmented and original data, we calculated the sum of normalized differences for each feature using Eqs. (3), (4), and (5).
(3)
(4)
(5)
In the equation, NDM, NSD, and SND indicate normalized mean difference, normalized standard deviation difference, and sum of normalized differences, respectively. μ and σ mean that the average and standard deviation, respectively.
Fig. 5 illustrates the sum of normalized differences for each feature of the Nakdong River (a) and Yeongsan River (b), with smaller values indicating higher similarity. For the Nakdong River, the features Exp. m/z, H, and O demonstrated high similarity, whereas the N and S features exhibited relatively low similarity. Similarly, for the Yeongsan River, the features Exp. m/z, C, and H showed high similarity, as indicated by small sums of normalized differences, while the S features displayed relatively low similarity. Overall, except for the S feature, the proposed LSTM-GAN model generates augmented data that closely resembles the original data.
3.2. Accuracy Comparison of River Classification
We utilized Random Forest, SVM, Logistic Regression, and MLP for river classification. Recognizing that the performance of MLP, a deep learning-based model, improves with larger training datasets, we supplemented the training data with augmented samples generated by the LSTM-GAN. To determine the optimal amount of augmented data to incorporate into the original training datasets, we incrementally increased the quantity of augmented data from 0 to 60,000, constructing a synthetic dataset in steps of 20,000 samples at each stage. The synthetic dataset was then randomly sampled, and we evaluated the classification accuracy, training time, and model size for each case.
As illustrated in Fig. 6, the classification accuracy for both rivers improves as the amount of augmented data increases. The classification accuracy reaches its maximum at 40,000 augmented data points and then declines as the amount of augmented data continues to grow.
Fig. 7 depicts the training time and model size as a function of the amount of augmented data. As the amount of augmented data increases, the size of the training dataset grows, resulting in longer training times. At 40,000 augmented data points, the model size reaches its maximum of 20.31 KB, although the overall change in model size remains minor.
Based on these findings, 40,000 augmented data points were added to the training dataset, considering the balance between classification accuracy, training time, and model size.
Fig. 8 depicts the river classification accuracy of machine learning and deep learning models for the Nakdong River and Yeongsan River. Five test datasets were employed. Random Forest, Support Vector Machine, and Logistic Regression were trained using the original dataset. There are two kinds of training datasets for MLP. The first training dataset contains only the original data, while the second includes both the original and augmented data. Fig. 8(a) shows that in the Nakdong River classification performance evaluation, the MLP employing original and augmented data performs well by approximately 68%. The support vector machine has the second highest score, outperforming the MLP trained on the original samples. This indicates that if there is insufficient training data, the MLP’s performance may be worse than that of machine learning. Fig. 8(b) indicates that in Yeongsan River’s classification accuracy results, the MLP employing the original and augmented data outperforms other schemes by approximately 79%. In the classification accuracy evaluations of both rivers, the MLP utilizing the original and augmented data outperforms the MLP using only the original data and machine learning approaches. This suggests that the augmented data generated by the proposed LSTM-GAN improves the performance of deep learning models, such as the multi-layer perceptron. Logistic Regression performs poorly in both evaluations, with classification accuracy below 50%.
4 Conclusions
This study demonstrated the effectiveness of using an LSTM-GAN-based data augmentation model to enhance the classification of DOM from different riverine environments. Samples were analyzed using Orbitrap HR-MS to obtain DOM data. By supplementing limited original datasets with augmented samples, the classification performance of a MLP deep learning model improved significantly compared to traditional machine learning algorithms, including Random Forest, SVM, and Logistic Regression.
The results highlighted the following key findings:
Enhanced Model Accuracy: The MLP trained on both original and augmented data achieved classification accuracies of approximately 68% and 79% for the Nakdong and Yeongsan Rivers, respectively, outperforming all other models.
Impact of Data Augmentation: Augmented data generated by the LSTM-GAN model closely resembled the statistical characteristics of the original data, as validated through normalized differences. This augmentation effectively addressed the challenge of insufficient training data, a common limitation in environmental data modeling.
The findings suggest that integrating advanced data augmentation techniques, such as LSTM-GAN, can significantly improve the performance of deep learning models in complex environmental data classification tasks. By generating synthetic yet statistically similar data, LSTM-GAN enhances model training and generalization, making deep learning approaches more effective even when real-world datasets are limited or imbalanced. For instance, this methodology could be applied to air quality modeling to overcome the limitations of sensor data. Additionally, it can enhance water quality monitoring and management by enabling more accurate pollution source identification, real-time anomaly detection, and predictive modeling of seasonal water quality changes. Water treatment facilities could apply this approach to optimize purification processes by adapting treatment strategies based on machine learning models trained with augmented NOM data.
Notes
Acknowledgements
This work was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea government (MOTIE) (P0017006, The Competency Development Program for Industry Specialist). Additionally, this study was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2020R1C1C1007350).
Conflicts-of-Interest Statements
The authors declare no conflicts of interest.
Author Contributions
Jinho Kim (MS student) conducted writing-original draft preparation. Junho Jeon (Professor) conducted conceptualization and data curation. Jongkwan Park (Associate Professor) wrote and revised the manuscript. Donghyeok An (Associate Professor) wrote and revised the manuscript.
References
1. Lee S, Park J. Comparison of molecular characteristics between commercialized and regional natural organic matters. Environ Eng Res. 2024;29. https://doi.org/10.4491/eer.2023.190
2. Phungsai P, Kurisu F, Kasuga I, Furumai H. Changes in dissolved organic matter during water treatment by sequential solid-phase extraction and unknown screening analysis. Chemosphere. 2021;263:128278. https://doi.org/10.1016/j.chemosphere.2020.128278
3. Jang J, Park J, Ahn S, Park K-T, Ha S-Y, Park J, Cho KH. Molecular-level chemical characterization of dissolved organic matter in the ice shelf systems of King George Island, Antarctica. Front Mar Sci. 2020;7. https://doi.org/10.3389/fmars.2020.00339
4. Hong SM, Morgan BJ, Stocker MD, Smith JE, Kim MS, Cho KH, Pachepsky YA. Using machine learning models to estimate Escherichia coli concentration in an irrigation pond from water quality and drone-based RGB imagery data. Water Res. 2024;260:121861. https://doi.org/10.1016/j.watres.2024.121861
5. Uddin MG, Nash S, Rahman A, Olbert AI. A novel approach for estimating and predicting uncertainty in water quality index model using machine learning approaches. Water Res. 2023;229:119422. https://doi.org/10.1016/j.watres.2022.119422
7. Hou F, Liu S, Yin WX, Gan LL, Pang HT, Lv JQ, Liu Y, Wang HC. Machine learning for high-precision simulation of dissolved organic matter in sewer: Overcoming data restrictions with generative adversarial networks. Sci. Total Environ. 2024;947:174469. https://doi.org/10.1016/j.scitotenv.2024.174469
8. Li Z, Liu H, Zhang C, Fu G. Generative adversarial networks for detecting contamination events in water distribution systems using multi-parameter, multi-site water quality monitoring. Environ. Sci. Ecotechnol. 2023;14:100231. https://doi.org/10.1016/j.ese.2022.100231
9. Baek SS, Choi Y, Jeon J, Pyo J, Park J, Cho KH. Replacing the internal standard to estimate micropollutants using deep and machine learning. Water Res. 2021;188:116535. https://doi.org/10.1016/j.watres.2020.116535
10. Automated Analysis of Electrospray Ionization Fourier Transform Ion Cyclotron Resonance Mass Spectra of Natural Organic Matter. Anal. Chem. [Internet]. [cited 2024 Dec 26]. Available from: https://pubs.acs.org/doi/full/10.1021/ac0600306
12. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, Vanderplas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: Experiences from the scikit-learn project. 2013. https://doi.org/10.48550/arXiv.1309.0238
13. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. Commun. ACM. 2020;63:139–144. https://doi.org/10.1145/3422622
Fig. 1
Sampling points
Fig. 2
The overall workflow for data augmentation
Fig. 3
The architecture of generator and discriminator in LSTM-GAN
Fig. 4
The data comparison between augmented sample and measured (a) Nakdong River and (b) Yeongsan River sample
Fig. 5
The sum of normalized differences of (a) Nakdong River and (b) Yeongsan River
Fig. 6
Classification accuracy with different augmented samples
Fig. 7
Training time and model size with different augmented samples
Fig. 8
River classification accuracy for (a) Nakdong River and (b) Yeongsan River