Blue Whale Migration Pattern Prediction - 11/2023

Introduction

Climate change significantly threatens marine life. This threat is particularly evident in cetaceans, including whales, dolphins, and porpoises, which, like other animals, face challenges due to rising sea surface temperatures. These environmental changes disrupt their distribution, habitats, and migration patterns (van Weelden et al., 2021). Understanding and predicting the impact on cetacean migrations is crucial for practical conservation efforts and further research in marine biology.

This study aims to investigate the hypothesis that significant shifts in the migration patterns of blue whales (Balaenoptera musculus) can be accurately predicted through the application of advanced machine learning algorithms, analyzing historical data in conjunction with sea surface temperature changes due to climate change.

This study employed three models: a linear regression model, a random forest model, and an XGBoost model. These models were coupled with multi-output regression to enhance prediction accuracy.

Data And Visualisation

The data used in this research comes from the OBIS-SEAMAP (Ocean Biodiversity Information System Spatial Ecological Analysis of Mega Vertebrate Populations) database made possible by the Marine Geospatial Ecology Lab at Duke University.

For this research, a subset of the total data was used, focusing on sightings from 1990 to the present, totaling 12,073 data points out of approximately 17,000. This selection was aimed at making future predictions as accurately as possible as it focused more on recent trends rather than a century's worth of trends.

Data exploration and visualizations were done as a first step to understand the data better. A global map with all recorded sightings served as a macroscopic view of the data's distribution in space and time (Figure 1). To delve deeper into migration patterns, a Kernel Density Estimation (KDE) plot was generated, highlighting the regions with the highest frequency of sightings in all the data (Figure 2).

Figure 1: Blue Whale Sightings Map From 1903 Till 2023

Figure 2: Blue Whale Sightings Kernel Density Estimation Map From 1903 Till 2023

From both of these plots and other initial data exploration, we can come to some important conclusions that will help with our prediction in the future. Firstly, the Kernel Density Estimation map highlights the areas that see the most volume of sightings. When cross-referenced with the scatter plot, we also see that these areas are persistent throughout the years, implying that there might be underlying migration patterns.

Then, to better understand the data I wanted to use for my project, I did another Kernel Density Estimation plot for the range that I would be using to find more recent patterns (Figure 3). The results were different from the other plot, leading me to believe the pattern change was already present.

Figure 3: Blue Whale Sightings Kernel Density Estimation Map From 1990 Till 2023

Subsequently, I utilized sea surface temperature data from the National Oceanic and Atmospheric Administration (NOAA) archives. An algorithm was developed to filter through NOAA's accessible dataset, aligning it with the closest coordinates in the whale sighting records. This feature engineering process significantly enhanced the ability to discover more distinct patterns, particularly the impacts of climate change.

Methodology

This study encompasses a step-by-step, multi-model approach, incorporating a baseline linear regression model, a random forest model, and an XGBoost model, all equipped with multi-output regressors. By gradually transitioning from more straightforward to more intricate models, this methodology enables a thorough analysis of the data, aiming to improve the precision and dependability of the predictions.

In selecting the appropriate AI methodologies for this study, we focused on machine learning models known for their robustness and adaptability in analyzing ecological data. Linear regression was chosen for its simplicity and interpretability as a baseline for understanding basic trends. The random forest model, an ensemble learning method, is particularly effective in handling the complexities of ecological datasets, offering insights into nonlinear relationships and interactions. XGBoost, known for its efficiency in large datasets, was selected for its ability to manage the multifaceted nature of our data, including non-linear trends and high-dimensional spaces. These methods collectively provide a comprehensive approach to deciphering complex patterns in blue whale migration data, making them particularly suitable for addressing the challenges presented by climate change impacts on marine ecosystems.

Linear Regression Model

The linear regression model, chosen for its simplicity and interpretability, was the first step in the analysis. It provided a baseline understanding of the historical whale sighting data by modeling the relationship between time, sighting frequencies, and sea surface temperature. This approach established a fundamental trend analysis and set the stage for more complex models.

Random Forest Model

Building upon linear regression, the Random Forest model was employed to handle complex, non-linear ecological data relationships. As an ensemble of decision trees, it offers a detailed analysis by averaging multiple deep decision trees, which reduces the risk of overfitting while capturing subtle patterns and interactions among variables. This model proved valuable in examining the factors influencing whale migration, such as sea surface temperature.

In the Random Forest model, using the parameters n_estimators=100 and max_depth=None offers a balanced model performance and reproducibility approach. The choice of n_estimators=100 establishes a forest of 100 trees, striking a balance between model accuracy and computational efficiency. The max_depth=None parameter allows the trees to grow to their full depth, enabling the model to capture more complex patterns.

XGBoost Model

Finally, the XGBoost model was utilized for its exceptional performance and efficiency in handling large and complex datasets. Chosen for its advanced capabilities in handling diverse features and superior predictive accuracy, XGBoost was instrumental in refining the predictions of whale migration patterns. Its efficiency in processing complex, non-linear data, robustness against overfitting, feature importance analysis, and overall performance and flexibility make it an excellent choice for predicting blue whale migration patterns in the face of climate change. This project's choice of multi-output regressors was crucial due to the multidimensional nature of whale migration patterns and their predictions.

In the XGBoost algorithm for multi-output regression, selecting objective='reg:squarederror' is crucial in ensuring model accuracy and reproducibility. The reg:squarederror objective is selected for its suitability in regression tasks, as it minimizes the mean squared error between predicted and actual values. This heavily penalizes more significant errors and focuses on minimizing these deviations, making it particularly effective for accurately predicting continuous variables, such as migration patterns.

Multi-Output Regressor

Multi-output regression models can simultaneously capture these multiple dependent variables, providing a holistic view of migration patterns. This approach allows for the modeling of multiple aspects of whale migration, such as location, timing, and density, in a single model, enhancing the efficiency of the analysis and providing a comprehensive understanding of how different factors collectively influence whale migration patterns.

Results

In this study, I utilized R² scores, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) as crucial metrics to evaluate the performance of the regression models. The R² score is crucial as it measures how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. MSE and RMSE are vital in assessing the average of the squares of the errors, indicating the average squared difference between the estimated values and what is estimated. MSE gives a raw idea about the magnitude of error, while RMSE is more interpretable in the same units as the response variable. MAE offers a clear representation of the actual average error magnitude, disregarding the direction of the error. These metrics together give a comprehensive picture of the model’s accuracy and error magnitude, which is essential for the reliability of ecological predictions.

Average R^2 score Test Mean Squared Error Test Root Mean Squared Error Test Mean Absolute Error Linear Regression -139.19 408566.63 639.19 361.16 Random Forest -92.35 701792.79 837.73 353.28 XGBoost 910.1 917403.1 633.04 452.62

The results from the linear regression model reveal significant limitations. The notably negative R² scores indicate that the linear model fails to account for the complexities inherent in the ecological data. Such scores suggest that the model performs worse than a simple horizontal line representing the mean of the target values, implying a lack of fit to the data. This inadequacy is further reinforced by the high values of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE), indicating substantial deviations between the model’s predictions and the actual data point.

The results from the Random Forest model suggest a significant underperformance compared to the Linear Regression model. An average R² score of -92.35 implies that the Random Forest model is less accurate than a simple horizontal line representing the mean of the dependent variable, indicating a failure to capture the underlying patterns effectively. Furthermore, the high Test Mean Squared Error (701,792.79) and Root Mean Squared Error (837.73) reaffirm this underperformance, showing a substantial deviation of the model’s predictions from the actual data. In contrast, if the Linear Regression model yielded closer R² values to 1 and lower error metrics, it would indicate its superior predictive capability in this context.

The results of the XGBoost model also show similar outcomes. Notably, the negative R² scores underscore a significant deviation of the model's predictions from the actual data, indicating that the model may not adequately capture the underlying trends and patterns essential for accurate predictions. Furthermore, the high values in Mean Squared Error (917403.0915), Root Mean Squared Error (633.0458), and Mean Absolute Error (452.6239) reinforce this perspective, highlighting a considerable average error in the predictions.

This critical analysis reflects the complexities inherent in ecological data modeling. It underscores the necessity for continuous refinement of predictive models to enhance their accuracy and reliability, especially in the context of climate change and its impact on marine life.

Discussion

Figure 4: Kernel Density Estimations For Years 2020, 2021, 2022 and 2023 For Comparison

Figure 5: Model Estimations For 2024

Figure 6: Model Estimations For 2025

The Linear Regression predictions for 2024 and 2025 show some degree of variability, with specific areas indicating higher densities of blue whale sightings. The model, known for its simplicity, may not capture complex non-linear patterns well, but it does provide a baseline understanding. The relatively conservative range of predictions suggests that linear regression may not fully account for the more subtle or complex interactions between whales and their changing environment.

The Random Forest predictions display a broader range of values and more distinct geographical patterns, likely due to its ability to capture non-linear relationships. The variability in predictions from 2024 to 2025 suggests that the Random Forest model is sensitive to the changes in the underlying data, potentially reflecting a better grasp of the ecological dynamics at play. However, the negative R² score implies that the model, despite its complexity, does not necessarily align well with the observed data.

For 2024, the XGBoost predictions are somewhat similar to the Linear Regression model, while the 2025 predictions show a drastic change, indicating a potential shift in predicted whale densities or migration patterns. The drastic change in predictions for 2025 could suggest that the model is either susceptible to the input features or capturing a significant underlying trend that the other models are not.

The potential widespread application of our AI-driven approach carries both positive and negative implications. Positively, it could significantly advance marine conservation efforts, enabling more precise predictions of whale migration patterns and informing better strategies to protect these majestic creatures amidst changing environmental conditions. However, there are also concerns to be considered. Over-reliance on predictive models might lead to overlooking unmodeled factors, such as unforeseen ecological changes or human activities. Additionally, policy decisions based solely on model predictions could inadvertently result in misallocation of conservation resources or failure to address other critical factors in marine ecosystem health. Therefore, while our approach offers promising advancements, it must be integrated thoughtfully with other ecological insights and conservation strategies.

Conclusion

In the context of marine conservation and understanding whale behavior under the impact of climate change, the models present varied interpretations of potential future migration patterns. The negative R² scores across all models indicate challenges in predicting complex ecological phenomena with high accuracy. However, the differing predictions highlight the importance of considering multiple modeling approaches to capture the full spectrum of possible outcomes. The variation in predictions year over year, especially the drastic changes suggested by the XGBoost model for 2025, may reflect the models' sensitivity to environmental variables, such as sea surface temperatures, which are known to be affected by climate change. The variability and potential extremity of the XGBoost predictions underscore the urgency of addressing climate change to preserve marine habitats.

From a conservation perspective, these models, despite their current limitations as indicated by the negative R² scores, are valuable in informing conservation efforts. They allow for scenario planning and highlight the need for robust, adaptable conservation strategies to accommodate various possible conditions.

The study emphasizes the complexity of predicting blue whale migrations and the impacts of climate change. While the models used in this study provide a starting point, the results suggest a need for further model refinement. Incorporating additional variables, increasing model complexity, or utilizing different modeling techniques could improve predictive performance. Moreover, ongoing model validation with new data and collaboration with marine biologists will be crucial to enhancing the accuracy and reliability of predictive models used in marine conservation.

References

  1. Van Weelden, M., Reijnders, P. J., & Van der Hiele, T. (2021). Climate change impact on cetaceans: A review. Environmental Research Letters, 16(9), 094005. https://doi.org/10.1088/1748-9326/ac1e62

  2. OBIS-SEAMAP. (n.d.). Ocean Biodiversity Information System Spatial Ecological Analysis of Mega Vertebrate Populations. Retrieved from https://seamap.env.duke.edu/

  3. NOAA National Centers for Environmental Information. (n.d.). Sea Surface Temperature (SST). Retrieved from https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00516

  4. Analytics Vidhya. (n.d.). MAE, MSE, RMSE, Coefficient of Determination, Adjusted R Squared — Which Metric is Better? Medium. Retrieved from https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e

  5. Scikit-learn developers. (n.d.). sklearn.linear_model.LinearRegression. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

  6. Scikit-learn developers. (n.d.). sklearn.ensemble.RandomForestRegressor. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

  7. XGBoost developers. (n.d.). XGBoost Documentation. Retrieved from https://xgboost.readthedocs.io/en/stable/get_started.html

  8. Scikit-learn developers. (n.d.). sklearn.multioutput.MultiOutputRegressor. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

Previous
Previous

Modeling the Effects of Water Temperature on Coral Reef Deterioration - 10/2022