STEM Program

Decoding the Environment: Applied Machine Learning and Explainable AI for Environmental Applications

Faculty Advisor: Adjunct Professor, Data Science and Machine Learning, Duke University

Research Program Introduction

In an era defined by data and environmental urgency, the ability to not only predict environmental outcomes but to understand the forces driving them is paramount. This research practicum guides students to the cutting edge of data science, where machine learning (ML) and environmental stewardship converge. 

Students will move beyond the theoretical to engage in hands-on, applied research, addressing critical questions in areas like air and water quality, climate change, and biodiversity. The core of this program is its focus on Explainable AI (XAI). While standard ML models can act as "black boxes", a significant barrier to their use in policy and science , we will use state-of-the-art techniques like SHAP (Shapley Additive exPlanations) to illuminate why our models make specific predictions.

Students will learn to identify a research question, source and critically evaluate a relevant dataset, and develop a predictive model using Python in Google Colab. They will then employ SHAP to uncover the hidden drivers within their data. For  example, identifying the most significant contributors to urban air pollution or the key factors determining wildfire risk. 

Program Deliverables

Students will complete a Final Research Report of 8-10 pages, formatted as a professional academic paper. This tangible product will be the capstone of their research, integrating the narrative of their research question, the technical details of their model, and the crucial insights derived from their XAI analysis. The report will include sections for an Abstract, Introduction, Data and Methodology, Results, Explainable AI Insights, Discussion, Conclusion, and a link to a reproducible Google Colab notebook.

Possible Topics for Final Project:

  • Identifying the Key Drivers of Urban Air Pollution: What are the most impactful meteorological and chemical precursors to high PM2.5 levels?

  • Predicting Water Potability: Which chemical properties most strongly influence a water source's classification as non-potable?

  • Explaining Wildfire Occurrence Risk: What are the subtle, interacting factors (e.g., drought index, wind) that contribute most to a fire starting?

  • Modeling the Impact of Climate Change on Local Temperatures: How do national-level factors like CO2 emissions and renewable energy adoption contribute to predicting local temperature anomalies?

  • Classifying Species Conservation Status: Which factors (e.g., park ecosystem, nativeness) are most associated with a species' endangered status?

  • Linking Air Quality to Public Health Outcomes: Which specific pollutants have the strongest predictive impact on adverse health events like hospital admissions?

  • Forecasting Deforestation Risk: Using satellite-derived features, what land-use characteristics are the most powerful predictors of a forest patch being cleared?

  • Assessing the Factors Behind National CO₂ Emissions: What is the relative importance of population, energy mix, and economic activity in predicting a country's total CO₂ emissions?

  • Or other topics in this subject area that you are interested in, and that your professor approves after discussing it with you.

Program Details

  • Cohort size: 3 to 6 students

  • Workload: Around 4 to 5 hours per week (including class and homework time)

  • Target students: 9 to 12th graders interested in Computer Science, Data Science, Environmental Science, Environmental Studies, Engineering (especially Environmental, Civil, and Systems), Statistics, Applied Mathematics, Public Policy, Economics, or other related areas.

  • Prerequisites

    • Academic: Successful completion of high school Algebra II. Concurrent or prior enrollment in a statistics course (including AP Statistics) is strongly recommended.

    • Technical: Demonstrable experience with programming in at least one structured language (Python strongly preferred), through either formal coursework (e.g., AP Computer Science) or significant personal projects.

    • No prior machine learning experience is required.

  • Schedule: TBD. Meetings will take place for around one hour per week, with a weekly meeting day and time to be determined a few weeks before the start date.