Flood Early Warning Systems using Machine Learning techniques. Case the Tomebamba catchment at the southern Andes of Ecuador

: Flood Early Warning Systems (FEWSs) using Machine Learning (ML) has gained worldwide popularity. However, determining the most efficient ML technique is still a bottleneck. We assessed FEWSs with three river states, No-alert, Pre-alert, and Alert for flooding, for lead times between 1 to 12 hours using the most common ML techniques, such as Multi-Layer Perceptron (MLP), Logistic Regression (LR), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Random Forest (RF). The Tomebamba catchment in the tropical Andes of Ecuador was selected as case study. For all lead times, MLP models achieve the highest performance followed by LR, with f1-macro (log-loss) scores of 0.82 (0.09) and 0.46 (0.20) for the 1- and 12-hour cases, respectively. The ranking was highly variable for the remaining ML techniques. According to the g-mean, LR models correctly forecast and show more stability at all states, while the MLP models perform better in the Pre-alert and Alert states. Future efforts are recommended to enhance the input data representation and develop communication applications to boost the awareness of the society for floods.


Introduction
Flooding is the most common natural hazard and produces worldwide one of the most damaging disasters [1]- [4]. Recent studies associate the increasing frequency and severity of flood events with a change in land use (e.g., deforestation and urbanization) and climate [2], [5]- [7]. This particularly holds for the tropical Andes region, where complex hydro-meteorological conditions result in the occurrence of intense and patchy rainfall events [8]- [10].
According to the flood generation mechanism can floods be classified into long-and short-rain floods [11], [12]. A key for building resilience to floods is to anticipate timely to the event, as to gain time for better preparedness. The response time between a rainfall event and its associated flood depends on the catchment properties and might vary from minutes to hours [13]. In this study special attention is given to flash-floods, which are floods that develop less than six hours after a heavy rainfall with little or no forecast lead time [14].
Flood anticipation can be achieved through the development of a Flood Early Warning System (FEWS). FEWSs have proved to be cost-efficient solutions for life preservation, damage mitigation, and resilience enhancement [15]- [17]. However, although crucial, flood forecasting remains a major challenge in mountainous regions due to the difficulty to effectively record the aerial distribution of precipitation due to the sparse density of the monitoring network and the absence of high-tech equipment by budget constraints [8], [9].
To date, there is no report of any operational FEWS in the Andean region for scales other than continental [17]- [19]. An alternative attempt in Peru aimed to derive daily maps of potential floods based on the spatial cumulated precipitation in the past days [20]. Other endeavors in Ecuador and Bolivia focused on the monitoring of the runoff in the upper parts of the catchment to predict the likelihood of flood events in the downstream basin area [18], [21]. However, such attempts are unsatisfactory as countermeasures against floods and especially flash-floods, where it is required to have reliable and accurate forecasts with lead times shorter than the response time between the farthest precipitation station and runoff control point.
There are two paradigms that drive the modeling of the precipitation-runoff response. First, the physically-based paradigm includes knowledge of the physical processes by using physical process equations [22]. This approach requires extensive ground data and in consequence intensive computation that hinders the temporal forecast window [23]. Moreover, it is argued that physically based models are inappropriate for real-time or short-term flood forecasting due to the inherent uncertainty of river-catchment dynamics and over-parametrization of this type of models [24]. The second data-driven paradigm assumes floods as stochastic processes with an occurrence distribution probability derived from historical data. Here, the idea is to exploit relevant input information (e.g., precipitation, past runoff) to find relations to the target variable (i.e., runoff) without requiring knowledge about the underlying physical processes. Among the traditional data-driven approaches, statistical modeling has proven to be unsuitable for short-term prediction due to lack of accuracy, complexity, model robustness, and even computational costs [23]. Previous encouraged the use of advanced data-driven models, e.g., machine learning (ML), to overcome the aforementioned shortcomings [7], [23], [25], [26]. Particularly during the last decade, ML approaches have gained increasing popularity among hydrologists [23].
Different ML strategies for flood forecasting are implemented, generating either a quantitative or qualitative runoff forecast [27]- [33]. Qualitative forecasting consists of classifying floods into distinct categories or river states according to their severity (i.e., runoff magnitude), and use this as a base for flood class prediction [34]. The advantage of developing a FEWS is the possibility to generate a semaphore-like warning system that is easy to understand by decision-makers and the public (non-hydrologists). The challenge of FEWSs is the selection of the most optimal ML technique to obtain reliable and accurate forecasts with sufficient lead time for decision making. To date, the problem has received scant attention in the research literature, and as far as our knowledge extends no previous work examined and compared the potential and efficacity of different ML techniques for flood forecasting.
The present study compares the performance of five ML classification techniques for short-rain flood forecasting with special attention to flash floods. ML models were developed for a medium-size mountain catchment, the Tomebamba basin located in the tropical Andes of Ecuador. The ML models were tested with respect to their capacity to forecast three flood warning stages (No-alert, Pre-alert, and Alert) for varying forecast lead times of 1, 4, and 6 hours (flash-floods), but also 8 and 12 hours to further test whether the lead time can be satisfactorily extended without losing models' operational value.

Study area and dataset
The study area comprises the Tomebamba catchment delineated upstream of the Matadero-Sayausí hydrological station of the Tomebamba river (Figure 1), where the river enters the city. The Tomebamba is a tropical mountain catchment located in the southeastern flank of the Western Andean Cordillera, draining to the Amazon River. The drainage area of the catchment is approximately 300 km2, spanning from 2800 to 4100 meters above sea level (m asl). Like many other mountain catchments of the region, it is primarily covered by a páramo ecosystem, which is known for its important water regulation function [8]. The Tomebamba river plays a crucial role as a drinking water source for the city of Cuenca (between 25 to 30 % of the demand). Other important water users are agricultural and industrial activities. Cuenca, which is the third-largest city of Ecuador (around 0.6 million inhabitants), is crossed by 4 rivers that annually flood parts of the city, causing human and significant economic losses.
The local water utility, the Municipal Public Company of Telecommunications, Water, Sewerage and Sanitation of Cuenca (ETAPA-EP), defined three flood alert levels associated with the Matadero-Sayausí station for floods originated in the Tomebamba catchment: i) No-alert of flood occurs when the measured runoff is less than 30 m 3 /s, ii) Pre-alert when runoff is between 30 and 50 m 3 /s, and iii) the flood Alert is triggered when discharge exceeds 50 m 3 /s. With these definitions, and as shown in Figure 2, the discharge label for the No-alert class represents the majority of the data, whereas the Pre-alert and Alert classes comprise the minority yet the most dangerous classes.
To develop and operate forecasting models, we use data of two variables: precipitation in the catchment area and river discharge at a river gauge. For both variables, the available dataset comprises 4 years of concurrent hourly time series, from Jan/2015 to Jan/2019 ( Figure 2). Precipitation information was derived from 3 tipping-bucket rain gauges: Toreadora (3955 m a.s.l.), Virgen (3626 m a.s.l.), and Chirimachay (3298 m a.s.l.) installed within the catchment and along its altitudinal gradient. Whereas for discharge, we used data of the Matadero-Sayausí station (2693 m a.s.l., Figure 1). To develop the ML modes, we split the dataset into training and test subsets. The training period ran from 2015 to 2017, whereas 2018 was used as the model testing phase.

Machine Learning (ML) methods for classification of flood alert levels
ML classification algorithms can be grouped in terms of their functionality. According to Mosavi et al. (2018), five of the worldwide most-popular statistical method groups are commonly used for short-term flood prediction (extreme runoff), and include: i. Regression algorithms modeling the relationships between variables (e.g., Logistic Regression, Linear Regression, Multivariate Adaptive Regression Splines, etc.). ii.
Instance-based algorithms that rely on memory-based learning, representing a decision problem fed with data for training (e.g., K-nearest neighbor, learning vector quantification, locally weighted learning, etc.). iii.
Decision tree algorithms, which progressively divide the whole data set into subsets based on certain feature values, until all target variables ate grouped in one category (e.g., Classification and regression tree, M5, Random Forest, etc.). iv.
Bayesian algorithms based on Bayes' theorem on conditional probability (e.g., Naive Bayes, Bayesian network, Gaussian Naïve Bayes, etc. Logistic Regression (LR) is a discriminative model, modeling the decision boundary between classes. In a first instance, linear regressions are applied to find existent relationships between model features. Thereafter, the probability (conditional) of belonging to a class is identified using a logistic (Sigmoid) function that effectively deals with outliers (binary classification). From these probabilities, the LR classifies, with regularization, the dependent variables into any of the created classes. However, for multiclass classification problems are all binary classification possibilities considered, it is No-alert vs. Pre-alert, No-alert vs. Alert, and Pre-alert vs. Alert. Finally, the solution is the classification with the maximum probability (multinomial LR) using the softmax function (equation 1). With this function is the predicted probability of each class defined [35]. The calculated probability for each class is positive with the logistic function and normalized across all classes. (1) where z_i is the ith input of the softmax function, corresponding to class i from the K number of classes.

K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a non-parametric statistical pattern recognition algorithm, for which no theoretical or analytical background exist but an intuitive statistical procedure (memory-based learning) for the classification. KNN classifies unseen data based on a similarity measure such as a distance function (e.g., Euclidean, Manhattan, Chebyshev, Hamming, etc.). The use of multiple neighbors instead of only one is recommended to avoid the wrong delineation of class boundaries caused by noisy features. In the end, the majority vote of the nearest neighbors (see the formulation in [35]) determines the classification decision. The number of nearest neighbors can be optimized to reach a global minimum avoiding longer computation times, and the influence of class size. The major advantage of the KNN is its simplicity. However, the drawback is that KNN is memory intensive, all training data must be stored and compared when added information is to be evaluated.

Random Forest
Random Forest (RF) is a supervised ML algorithm that ensembles a multitude of decorrelated decision trees (DTs) voting for the most popular class (classification). In practice, a DT (particular model) is a hierarchical analysis based on a set of conditions consecutively applied to a dataset. To assure decorrelation, the RF algorithm applies a bagging technique for a growing DT from different randomly resampled training subsets obtained from the original dataset. Each DT provides an independent output (class) of the phenomenon of interest (i.e., runoff), contrary to numerical labels for regression applications. The popularity of RF is due to the possibility to perform random subsampling and bootstrapping which minimizes biased classification [36]. An extended description of the RF functioning is available in [37], [38].
The predicted class probabilities of an input sample are calculated as the mean predicted class probabilities of the trees in the forest. For a single tree, the class probability is computed as the fraction of samples of the same class in a leaf. However, it is well-known that the calculated training frequencies are not accurate conditional probability estimates due to the high bias and variance of the frequencies [39]. This deficiency can be resolved by controlling the minimum number of samples required at a leaf node, with the objective to induce a smoothing effect, and to obtain statistically reliable probability estimates.

Naïve Bayes
Naïve Bayes (NB) is a classification method based on Bayes' theorem with the "naive" assumption that there is no dependence between features in a class, even if there is dependence [40]. Bayes' theorem can be expressed as: (2) where P(A|B) is the probability of y (hypothesis) happening, given the occurrence of X (features), and X can be defined as X=x_1,x_2,…,x_n. Based here on can Bayes' theorem be written as: There are different NB classifiers depending on the assumption of the distribution of P(x_i |y). In this matter, the study of Zhang [40] proved the optimality of NB under the Gaussian distribution even when the assumption of conditional independence is violated (real application cases). Additionally, for multiclass problems, the outcome of the algorithm is the class with the maximum probability. For the Gaussian NB algorithm no parameters have to be tuned.

Multi-layer Perceptron
The Multi-Layer Perceptron (MLP) is a class of feedforward artificial neural networks (ANN). A perceptron is a linear classifier that separates an input into two categories with a straight line and produces a single outcome. Input is a feature vector multiplied by specific weights and added to a bias. Contrary to a single-layer case, the MLP can approximate non-linear functions using additional so-called hidden layers. Prediction of probabilities of belonging to any class is calculated through the softmax function. The MLP consists of multiple neurons in fully connected multiple layers. Determination of the number of neurons in the layers with a trial-and-error approach remains widely used [41]. Neurons in the first layer corresponding to the input data, whereas all other nodes relate inputs to outputs by using linear combinations with certain weights and biases together with an activation function. To measure the performance of the MLP, the logistic loss function is defined with the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method as the optimizer for training the network. A detailed and comprehensive description of ANN can be found in [42]. Figure 3 depicts schematic the methodology followed in this study. The complete dataset for the study consists, as mentioned before, of precipitation and labeled discharge time-series (see Figure 2). The dataset was split in two groups, respectively for training and testing purposes, and training and test feature spaces were composed for each lead time for the tasks of model hyperparameterization and model assessment. This procedure is repeated for each of the ML techniques studied. Finally, the ranking of the performance quality of all ML methods for every lead time, based on performance metrics and a statistical significance test, were determined.

Feature space composition
For each lead time, we used single training and test feature spaces for all ML techniques. A feature space is composed by features (predictors) coming from two variables: precipitation and discharge. The process of feature space composition starts by defining a specific number of precipitation and discharge features (present time and past hourly lags) according to statistical analyses relying on Pearson´s cross-, auto and partial-auto-correlation functions [43]. The number of lags from each station was selected by setting up a correlation threshold of 0.2 [27].
Similarly, for discharge, we used several features coming from past time slots of discharge selected for the analysis. It is worth noting that the number of discharge features triples since we replace each discharge feature with 3 features (one per flood warning class) in a process known as one-hot-encoding or binary encoding. Therefore, each created feature denotes 0 or 1 when the correspondent alarm stage is false or true, respectively. Finally, we performed a feature standardization process before the computation stage of the KNN, LR, NB, and NN algorithms. Standardization was achieved by subtracting the mean and scaling it to unit variance, resulting in a distribution with a standard deviation equal to 1 and a mean equal to 0.

Model hyperparameterization
After the composition of the feature space the optimal architectures for each ML forecasting model, and for each lead time was set up. The optimal architectures were de-Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 November 2021 doi:10.20944/preprints202111.0510.v1 fined by the combination of hyperparameters under the concept of balance between accuracy, and computational cost, and speed. However, finding optimal architectures requires an exhaustive search of all combinations of hyperparameters. To overcome this issue, we relied on the Randomized Grid Search (RGS) with a 10-fold cross-validation scheme. The RGS procedure randomly explores the search space for discretized continuous hyperparameters based on a cross-validation evaluation. Moreover, we selected the f1-macro score (see section 2.3.4) as objective function.

Principal Component Analysis
ML applications require in general the analysis of high-dimension and complex data, involving substantial amounts of memory and computational costs. Reduction of the dimensionality was realized through the application of Principal Component Analysis (PCA) enabling exclusion of correlating features that do not add information to the model. PCA was applied after feature scaling and normalization.
This method enables finding the dimension of maximum variance and the reduction of the feature space to that dimension so that the model performance remains as intact as possible when compared to performance with the full feature space. But considering that each ML technique assimilates data differently, we did not define the number of principal components to keep a fixed threshold of variance explanation (e.g., 80-90%), but performed an exploratory analysis to evaluate its influence on each model. As such, the number of PCAs was treated as an additional hyperparameter, and we optimized the number of principal components for each specific model (lead time and ML technique) with the objective to find the best possible model for each case.
All ML techniques and the RGS procedure were implemented through the scikit-learn package for ML in Python® [44]. Table 1 presents the relevant hyperparameters for each ML technique and their search space for tuning [45]. We employed default values for the hyperparameters which are depicted in Table 1. Forecasting hydrological extremes such as floods turns into an imbalanced classification problem, and becomes even more complex when the interest lies in the minority class of the data (flood alert). This is because most ML classification algorithms focus on the minimization of the overall error rate, it is the incorrect classification of the majority class [46]. Resampling the class distribution of the data for obtaining an equal number of samples per class is one solution. In this study, we used another approach that relies on training ML models with the assumption of imbalanced data. The approach we used penalizes mistakes in samples belonging to the minority classes rather than un- der-sampling or over-sampling data. In practice, this implies that for a given metric efficiency (see section 5.1), the overall score is the result of averaging each performance metric (for each class) multiplied by its corresponding weight factor. According to the class frequencies were the weight factors for each class calculated (inversely proportional), using equation 4.
where is the weight of class , N is the total number of observations, is the number of classes, and the number of observations in class . This implies that higher weights will be obtained for minority classes.

Performance metrics
The metrics for the performance assessment were derived from the well-known confusion matrix, especially suitable for imbalanced datasets and multiclass problems, and are respectively the f1 score, the geometric mean, and the logistic regression loss score [46]- [51]. Since neither of the metrics is adequate it is suggested to use a compendium of metrics to properly explain the performance of the model. In addition, those metrics complement each other.

F1 score
The f score is a metric that relies on precision and recall, which is an effective metric for imbalanced problems. When the f score as a weighted harmonic mean, we name this score f1. The latter score can be calculated with Equation 5.
where precision and recall are defined with the following equations: where stands for True Positives, for False Positives, and for False Negatives.
The f1 score ranges from 0 to 1, indicating perfect precision and recall. The advantage of using the f1 score compared to the arithmetic or geometric mean is that it penalizes models most when either the precision or recall is low. However, classifying a No-Alert flood warning as Alert might have a different impact on the decision-making than when the opposite occurs. This limitation scales up when there is an additional state, e.g., Pre-alert. Thus, the interpretation of the f1 score must be taken with care. For multiclass problems, the f1 score is commonly averaged across all classes, and is called the f1-macro score to indicate the overall model performance.

Geometric mean
The geometric-mean (g-mean) measures simultaneously the balanced performance of TP and True Negative (TN) rates. This metric gives equal importance to the classification task of both the majority (No-alert) and minority (Pre-alert and Alert) classes. The g-mean is an evaluation measure that can be used to maximize accuracy to balance TP and TN examples at the same time with a good trade-off [48]. It can be calculated using Equation 8. where and are defined by: The value of the g-mean metrics ranges from 0 to 1, where low values indicate deficient performance in the classification of the majority class even if the minority classes are correctly classified.

Logistic regression loss
The metric logistic regression loss (log-loss) measures the performance of a classification model when the input is a probability value between 0 and 1. It accounts for the uncertainty of the forecast based on how much it varies from the actual label. For multiclass classification, a separate log-loss is calculated for each class label (per observation), and the results are summed up. The log-loss score for multi-class problems is defined as: (11) where is the number of samples, the number of classes, equal to 1 when the observation belongs to class ; else 0, and is the predicted probability that the observation belongs to class . Starting from 0 (best score), the log-loss magnitudes increase as the probability diverges from the actual label. It punishes worse errors more harshly to promote conservative predictions. For probabilities close to 1, the log-loss slowly decreases. However, as the predicted probability decreases, the log-loss increases rapidly.

Statistical significance test for comparing ML algorithms
Although we can directly compare performance metrics of ML alternatives and claim to have found the best one based on the score, it is not certain whether the difference in metrics is real or the result of statistical chance. Different statistical frameworks are available allowing to compare the performance of classification models (e.g., a difference of proportions, paired comparison, binomial test, etc.).
Among them, Raschka [52] recommends using the chi-squared test to quantify the likelihood of the samples of skill scores, being observed under the assumption that they have the same distributions. The assumption is known as the null hypothesis, and aims to prove whether there is a statistically significant difference between two models (error rates). If rejected, it can be concluded that any observed difference in performance metrics is due to a difference in the models and not due to statistical chance. In our study we used the chi-squared test to assess whether the difference in the observed proportions of the contingency tables of a pair of ML algorithms (for a given lead time) is significant.
For the model comparison, we defined the statistical significance of improvements/degradations for all lead times (training and test subsets) under a value of 0.05 (chi-squared test). In all cases, the MLP model was used as the base model to which the other models were compared. zation for all models, followed by an evaluation and ranking of the performance of the ML techniques.

Feature space composition
Figures 4 and 5 show the results of the discharge and precipitation lag analyses for the flood forecasting model 1-hour before the flood would occur. Figure 4(a) depicts the discharge autocorrelation function (ACF) and the corresponding 95% confidence interval from lag 1 up to 600 (hours). We found a significant correlation up to a lag of 280 h (maximum correlation at the first lag), and thereafter, the correlation fell within the confidence band. On the other hand, Figure 4(b) presents the discharge partial-autocorrelation function (PACF) and its 95% confidence band from lag 1 to 30 h. We found a significant correlation up to lag 8 h (first lags outside the confidence band). As a result, based on the interpretation of the ACF and PACF analyses, and according to Muñoz et al. [27] we decided to include 8 discharge lags (hours) for the case of 1-hour flood forecasting in the Tomebamba catchment.    Similarly, the same procedure was applied for the remaining lead times (i.e, 4, 6, 8, and 12 hours). In Table 2, we present the input data composition and the resulting total number of features obtained from the lag analyses for each forecasting model. For instance, for the 1-hour case, the total number of features in the feature space equals 67, from which 43 are derived from precipitation (past lags and one feature from present time for each station), and 24 from discharge (one-hot-encoding).

Model hyperparameterization
The results of the hyperparameterization including the number of PCA components employed for achieving the best model efficiencies are presented in Table 3. No evident relation between the number of principal components and the ML technique nor the lead time was found. In fact, for some models we found differences in the f1-macro score lower than 0.01 for a low and high number of principal components. See for instance the case of the KNN models where the optimal number of components significantly decayed for lead times greater than 4 hours. For the 1-hour lead time, 96% of the components were used, whereas for the rest of the lead times only less than 8%.
If we turn to the evolution of models' complexity with lead time (Table 3) more complex ML architectures are needed to forecast greater lead times. This is underpinned by the fact that the corresponding optimal models require for greater lead times a stronger regularization (lower values of C) for LR, a greater number of neighbors (n_neighbors) for KNN, more specific trees (lower values of min_samples_split) for RF and more hidden layers (hidden_layers) for MLP.

Model performance evaluation
As mentioned before, model performances calculated with the f1-score, g-mean, and log-loss score were weighted according to class frequencies. Table 4 presents the frequency distribution for the complete dataset respectively for the training and test subsets. Here, the dominance of the No-alert flood class is evident, with more than 95% of the samples in both subsets. With this information, the class weights for the training period were calculated as , and . The results of the model performance evaluation for all ML models and lead times (test subset) are summarized in Table 5. We proved for all models that the differences in performance metrics for a given lead time were due to the difference in the ML techniques rather than to the statistical chance. As expected, ML models' ability to forecast floods decreased for a longer lead time. For instance, for the case of 1-hour forecasting, we found a maximum f1-macro score of 0.88 (MLP) for the training and 0.82 (LR) for the test subset. Whereas, for the 12-hour case, the maximum f1-macro score was 0.71 (MLP) for the training and 0.46 (MLP) for the test subset. The extensive hyperparameterization (RGS scheme) powered by 10-fold cross-validation served to assure robustness in all ML models and reduced overfitting. We found only a small difference between the performance values by using the training and the test subsets. For all models, maximum differences in performances were lower than 0.27 for the f1-macro score and 0.19 for the g-mean.
In general, for all lead times, the MLP technique obtained the highest f1-macro score, followed by the LR algorithm. This performance dominance was confirmed by the ranking of the models according to the log-loss score. The ranking of the remaining models was highly variable and therefore not conclusive. For instance, the results of the KNN models obtained the second-highest score for the training subset, but the lowest for the test subset, especially for longer lead times. This is because the KNN is a memory-based algorithm and therefore more sensitive to the inclusion of information different to the training subset in comparison to the remaining ML techniques. This can be noted in Table 4, where the training and test frequency distributions are different for the Pre-alert and Alert classes.
On the other hand, for the g-mean score, we obtained a different ranking of the methods. We found the highest scores for the LR algorithm, followed by the RF and the MLP models. Despite this behavior, the values of the g-mean were superior to the f1-macro scores for all lead times and subsets. This is because the f1 score relies on the harmonic mean. Therefore, the f1 score penalizes more a low precision or recall in comparison with a metric based on a geometric or arithmetic mean. Results of the g-mean served to identify that the LR is the most stable method in terms of correctly classifying both the majority (No-alert) and the minority (Pre-alert and Alert) flood warning classes, while the MLP technique could be used to focus on the minority (flood alert) classes.
To extend the last idea, we analyzed the individual f1 scores of each flood warning class. This unveils the ability of the model to forecast the main classes of interest, i.e., Pre-alert and Alert. Figure 6 presents the evolution of the f1-score of each ML algorithm at the corresponding lead time. We found that for all ML techniques, the Alert class is clearly the most difficult to forecast when the f1-macro score was selected as the metric for the hyperparameterization task. An additional exercise consisted in choosing the individual f1-score for the Alert class as the target for hyperparameterization of all models. However, although we obtained comparable results for the Alert class, the scores of the Pre-alert class were highly deteriorated, even reaching scores near zero. The most interesting aspect in Figure 6 is that the most efficient and stable models across lead times (test subset) were the models based on MLP and LR techniques. It is also evident that for all forecasting models, a lack of robustness for the Pre-alert warning class was found, it is major differences between the f1-scores for the training and test subsets. An explanation for this might be that the Alert class implies a Pre-Alert warning class, but not the opposite. Consequently, this might mislead the learning process causing overfitting during training leading to poor performances when assessing unseen data during the test phase.
Moreover, although we added a notion of class frequency distribution (weights) to the performance evaluation task, it can be noted that for all models, the majority class is most perfectly classified. This is because the No-alert class arises from low-to-medium discharge magnitudes. This eases and simplifies the learning process of the ML techniques since these magnitudes can be related to normal conditions (present time and past lags) of precipitation and discharge.

Discussion
In this study, we developed and evaluated five different FEWSs relying on the most common ML techniques for flood forecasting, and for short-term lead times of 1, 4, 6 hours for flash-floods, and 8 and 12 hours to assess models' operational value for longer lead times. Historical runoff data were used to define and label the three flood warning scenarios to be forecasted (No-alert, Pre-alert, and Alert). We constructed the feature space for the models according to the statistical analyses of precipitation and discharge data followed by a PCA analysis embedded in the hyperparameterization.
This was aimed to better exploit the learning algorithm of each ML technique. In terms of model assessment, we proposed an integral scheme based on the f1-score, the geometric mean, and the log-loss score to deal with data imbalance and multiclass characteristics. Finally, the assessment was complemented with a statistical analysis to provide a performance ranking between ML techniques. For all lead times, we obtained the best forecasts for both, the majority and minority classes from the models based on the LR, RF, and MLP techniques (g-mean). The two most suitable models for the dangerous warning classes (Pre-Alert and Alert) were the MLP and LR (f1 and log-loss scores). This finding has important implications for developing FEWSs since real-time applications must be capable to deal with both the majority and minority classes. It can therefore be suggested that the most appropriate forecasting models are based on the MLP technique.
The results on the evolution of model performances across lead times suggest that the models are acceptable for lead times up to 6 hours, i.e., the models are suitable for flash-flood applications in the Tomebamba catchment. For lead times greater than 6 hours, we found a strong decay in model performance. In other words, the utility of the 8 and 12-hour forecasting models is limited by the models' operational value. This is because, in the absence of rainfall forecasts, the assumption of future rain is solely based on runoff measurements at past and present times. This generates forecasts that are not accurate enough for horizons greater than the concentration-time of the catchment. The concentration-time of the Tomebamba catchment was estimated between 2 and 6 hours according to the equations of Kirpich, Giandotti, Ven Te Chow, and Temez, respectively. A summary of the equations can be found in Almeida et al. [53]. This results in an additional performance decay for the 8 and 12-hour cases in addition to the error in modeling.
The study of Furquim et al. [28] is comparable. These authors analyzed the performance of different ML classification algorithms for flash-flood nowcasting (3 hours) in a river located in an urban area of Brazil. They found that models based on neural networks and decision trees outperformed the ones based on the Naive Bayes technique. However, this study only evaluated the percentage of correctly classified instances which is a simplistic evaluation. Thus, we recommend a more integral assessment of model performances, like the one in the current study, which allows for better support on decision making.
Other studies related to quantitative forecasting revealed that neural network-based models usually outperform the remaining techniques proposed in our study [29]- [31]. Nevertheless, in certain cases, the use of less expensive techniques regarding the computational costs produces comparable results as in [33]; this is also the case in our short-rain and flash-flood flood classification problem.
As a further step, we propose the development of ensemble models for improving the performance results of individual models. This can be accomplished by combining the outcomes of the ML models with weights obtained, for instance, from the log-log scores. Another alternative that is becoming popular is the construction of hybrid models as a combination of ML algorithms for more accurate and efficient models [23], [32], [33]. Moreover, as stated by Solomatine and Xue [33], inaccuracies in forecasting floods are Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 November 2021 doi:10.20944/preprints202111.0510.v1 mainly due to data-related problems. In this regard, Muñoz et al. [9] reported a deficiency in precipitation-driven models due to rainfall heterogeneity in mountainous areas, where orographic rainfall formation occurs. In most cases, rainfall events are only partially captured by punctual measurement, and even the entire storm coverage can be missing.
In general precipitation-runoff models will reach at a certain point an effectiveness threshold that cannot be exceeded without incorporating new types of data such as soil moisture [54], [55]. In humid areas, the rainfall-runoff relation also depends on other variables such as evapotranspiration, soil moisture, and land use, which leads to significant spatial variations of water storage. However, these variables are difficult to measure or estimate.

Conclusions
The current study set out to develop and perform an integral performance evaluation of five of the most common ML classification techniques for short-rain flood forecasting, with special attention to flash floods. The developed models aimed at forecasting three flood warnings, No-alert, Pre-alert, and Alert for the Tomebamba catchment in the tropical Andes of Ecuador.
From the results, the following conclusions can be drawn:  Results related to model comparison are statistically significant. This is important because this is not usually performed in other studies and it validates the performance comparison and ranking hereby presented.  For all lead times, the most suitable models for flood forecasting are based on the MLP followed by the LR techniques. From the integral evaluation (i.e., several performance metrics), we suggest LR models as the most efficient and stable option for classifying both the majority (No-alert) and the minority (Pre-alert and Alert) classes. Whereas, we recommend MLP when the interest lies in the minority classes.  The forecasting models we developed are robust. Differences in the averaged f1, g-mean and log-loss scores between training and test are consistent to all models. However, we limit the utility of the models for flash-flood applications (lead times up to 6 hours). For longer lead times, we encourage improvement in precipitation representation, and even forecasting this variable for lead times longer than the concentration-time of the catchment.  A more detailed model assessment (individual f1 scores) unveiled the difficulties to forecast the Pre-alert and Alert flood warnings. This was evidenced when the hyperparameterization was driven for the optimization of the forecast for the alert class and this, however, did not improve the model performance of this specific class.
This study can be extended with a deep exploration of the effect of input data composition, precipitation forecasting, and the feature engineering strategies for both the MLP and LR techniques. Feature engineering pursues the use of data representation strategies that could, for example, provide spatial and temporal information of the precipitation in the study area. This can be done by spatially discretizing precipitation in the catchments with the use of remotely sensed imagery. With this additional knowledge, it would be possible to improve the performance of the models hereby developed at longer lead times.
We recommend that future efforts should be put on applying the methodology and assessment framework hereby proposed in other tropical Andean catchments, and/or benchmarking the results obtained in this study with the outputs of physically-based forecasting models. This was not possible for this study due to lack of data.
Finally, for FEWSs, the effectiveness of the models is strongly linked to the speed of communication to the public after a flood warning is triggered. Therefore, future efforts should focus on the development of a web portal and/or mobile application as a tool to boost the preparedness of the society against floods that currently threaten people's lives, possessions, and environment in Cuenca and other comparable tropical Andean cities. Funding: This research was funded by the project: "Desarrollo de modelos para pronóstico hidrológico a partir de datos de radar meteorológico en cuencas de montaña", funded by the Research Office of the University of Cuenca (DIUC), and the Empresa Pública Municipal de Telecomunicaciones, Agua Potable, Alcantarillado y Saneamiento de Cuenca (ETAPA-EP). Our thanks go to these institutions for their generous funding.
Data Availability Statement: All data, models, and code that support the findings of this study are available from the corresponding author upon reasonable request.