Example Projects

All projects and faculty mentors are selected based on alignment with the program goals, potential for academic or industry impact, and committed resources to support student success within the summer timeframe and beyond. Each project is designed to be challenging yet accessible for undergraduates. REU students will work with curated datasets, starter code, and pre-trained models, while graduate mentors handle more advanced technical tasks. REU students will focus on building and interpreting embeddings, benchmarking against baselines, and producing reproducible analyses. This tiered structure ensures authentic research contributions and tangible outcomes such as models, visualizations, reports, posters, and publications.

P1: Embedding-Based Machine Learning for Wildfire Risk Prediction (Dr. Ting Xiao)

Machine learning models powered by multi-source embeddings will be developed to predict wildfire risk and spread potential across geographic regions. Students will integrate diverse public datasets, including satellite imagery, historical wildfire records, weather patterns, vegetation indices, and topographic data, into unified embedding representations of fire-prone areas. These embeddings will be used to train and evaluate models for short- and medium-term wildfire risk, with performance compared to traditional feature-based approaches. To ensure feasibility for undergraduates, curated datasets, scaffolding code, and step-by-step tutorials will be provided, enabling students to focus on analysis and interpretation rather than complex data engineering. Graduate mentors will guide students through model pipelines while undergraduates contribute to benchmarking and interpreting results.

P2: Transforming Human Connection Through Intelligent Cooperation Embeddings (Dr. Junhua Ding)

Vector embeddings derived from dialogue data will be used to detect and enhance patterns of human cooperation in conversation. Students will generate cooperation-focused embeddings by fine-tuning open-source LLM (e.g., Llama 4) on dialogue datasets such as DailyDialog and EmpatheticDialogues, then evaluate similarity metrics to distinguish cooperative from uncooperative exchanges. Building on these embeddings, they will design small-scale Retrieval-Augmented Generation (RAG) agents that provide theory-informed feedback and track skill development over time. Starter code and pre-processed datasets will lower technical barriers, while graduate mentors provide support for model fine-tuning, allowing undergraduates to focus on designing evaluation experiments and analyzing outcomes.

P3: Multimodal Large Language Models for Specialized Healthcare Applications (Dr. Mark Albert)

Multimodal large language models (MLLMs) that integrate text, images, and video into shared embedding spaces will be applied and adapted for specialized healthcare tasks. Students will begin by evaluating pretrained MLLMs on real-world needs from active clinical collaborations, such as generating embeddings from tongue ultrasound images for speech pathology assessment, creating video-based functional embeddings for scoring arthrogryposis multiplex congenita, and developing embedding-powered decision-support tools for surgeons using clinical data and visual reports. Undergraduates will be provided with curated datasets and simplified starter notebooks to ensure accessibility, while graduate mentors manage technical setup. Students will focus on comparing embedding-based approaches with baselines and preparing interpretive visualizations of results.

P4: Stock Price Movement Prediction Using Embeddings (Dr. Zinat Alam)

Students will apply embedding techniques to historical stock market data to capture patterns in price movements and trading activity. They will collect and clean publicly available datasets, compute technical indicators such as Moving Average (MA), Relative Strength Index (RSI), and Moving Average Convergence Divergence (MACD), and integrate these features into embedding vectors for short-term (daily or weekly) price movement prediction. Using accessible machine learning algorithms, including logistic regression and decision trees, and leveraging ChatGPT-assisted coding to reduce technical barriers, students will build and evaluate predictive models through backtesting and accuracy metrics. The project will provide hands-on experience in data preprocessing, embedding construction, and applied machine learning in finance, while also prompting discussion on the ethical and practical implications of algorithmic trading.

P5: Embeddings for Organizing and Analyzing Large-Scale Archival Collections (Dr. Haihua Chen)

This project develops AI/ML techniques to improve access to vast, multi-modal archival collections containing text, images, audio, and video. Students will use deep learning and MLLMs to create cross-modal embeddings that connect related items across different formats, enabling more effective metadata generation, content-based search, and integrated analysis. Example tasks include linking objects in historical photographs with relevant textual records and matching spoken-word segments from interviews to corresponding visual or written materials. To make the project accessible, students will begin with curated archival subsets and scaffolded code for simple cross-modal matching tasks before progressing to larger retrieval experiments. Graduate mentors will guide them in running experiments, while undergraduates contribute by analyzing relationships and preparing prototype demonstrations.

P6: Embedding-Aware Fairness and Calibration in Personalized Ranking Systems (Dr. Jing Yuan)

Personalized ranking systems, such as those used for job, news, or product recommendations, can be enhanced by embedding-driven approaches that improve both fairness and calibration across diverse user groups. Using synthetic and real-world datasets, user and item embeddings will be generated for ranking models, followed by evaluation of how these representations interact with fairness and calibration objectives. The work will include implementing baseline methods such as collaborative filtering and learning-to-rank, applying fairness-aware strategies like personalized re-ranking and adversarial fairness learning, and incorporating calibration techniques including Platt scaling, isotonic regression, and multi-calibration. Undergraduates will work with starter datasets and pre-coded fairness and calibration methods, focusing their contributions on designing experiments to measure trade-offs and preparing visualizations and reports of their findings.

P7: Deep autoencoders and Transfer Learning for Estimation of Quantum Bits from Crystallographic Defects (Dr. Yuanxi Wang)

Embedding representations of atomic structures learned through deep autoencoders will be combined with transfer learning to improve prediction of defect formation energies in crystalline solids. Students will work with small datasets of defect properties in oxides and larger datasets of pristine solid formation energies, leveraging pre-trained models on approximately 10,000 crystal structures to enhance accuracy for the smaller defect dataset. The deep autoencoder will generate low-dimensional embeddings that capture key structural and energetic features, which will then be fine-tuned for defect prediction. Graduate mentors will provide code templates and pre-trained embeddings, while students focus on comparative analysis with baseline ML methods, interpreting embedding spaces, and presenting their findings through figures and reports.

P8: Embedding-Driven Reinforcement Learning for Recommender Systems (Dr. Yang Zhang)

Embedding-based state representations will be combined with reinforcement learning (RL) techniques to improve adaptability, personalization, and long-term user satisfaction in recommender systems. Students will investigate RL approaches such as contextual bandits, policy gradient methods, and deep Q-learning, framing recommendation as a sequential decision-making process that adapts to evolving user preferences. Work will include designing reward functions that capture long-term engagement, developing scalable embeddings for large-scale user-item interactions, and implementing off-policy evaluation methods to ensure safe experimentation. Undergraduates will work with simplified datasets and scaffolded RL frameworks provided by the faculty mentor, contributing by tuning reward functions, testing embedding-based state representations, and evaluating outcomes in reproducible experiments.

P9: Embedding-Based Machine Learning for Glass Materials Design (Dr. Jincheng Du)

This project uses machine learning coupled with molecular dynamics (MD) simulations to accelerate the design of novel glass materials with optimized properties. Students will work with datasets containing chemical compositions and structure features from molecular dynamics simulations to develop embeddings that capture structure–property relationships, focusing on key properties such as glass transition temperature (Tg), elastic moduli and density. These embeddings will be incorporated into predictive models to improve accuracy over traditional feature-based approaches, enabling faster evaluation of candidate compositions. Curated datasets and model notebooks will be provided, with graduate mentors handling advanced simulations so students can focus on constructing embeddings, comparing predictive models, and visualizing structure-property trends.

P10: Genome and Proteome Embeddings for Detecting Antibiotic Resistance (Dr. Rajeev Azad)

In this project, students will apply vector embedding techniques to genome and proteome sequence data to address the urgent challenge of antibiotic resistance. They will use genome and proteome language models to treat biological sequences as “biological language,” generating embeddings that capture sequence patterns linked to resistance. Students will apply these embeddings to classify resistant versus susceptible bacterial strains, predict underlying resistance mechanisms, and identify mobile genetic elements involved in spreading resistance. Graduate mentors will scaffold pre-trained models and pipelines, while students contribute by running classification experiments, applying interpretability tools to identify key genetic factors, and preparing short reports on findings.

P11: AI-Guided Metasurfaces for Controlling Thermal Radiation (Dr. Yuzhe Xiao)

Embedding representations of metasurface geometries and their optical-thermal properties will be used to guide the design of structures that control thermal radiation across broad wavelength ranges for applications in energy harvesting, infrared imaging, and thermal camouflage. Students will generate datasets of structure-property pairs, create low-dimensional embeddings that capture key physical features, and use these embeddings to inform Generative Adversarial Network (GAN) exploration for novel, high-performance metasurface designs. Generated structures will be evaluated through simulation, with comparisons of wavelength coverage and directional control against conventional designs. To ensure accessibility, students will begin with simplified metasurface datasets and pre-trained GAN frameworks. Their primary contributions will be to generate embeddings, interpret structure-property trends, and evaluate prototype designs with mentor support.

NSF REU Program

Example Projects