About Me

I am Mengyi Yan, currently is the assistant Professor at School of Artificial Intelligence, Shandong University. In Jun. 2025, I received my Ph.D. degree at the School of Computer Science and Engineering, Beihang University (BUAA), under the supervision of Prof. Jianxin Li.

Before that, I received my bachelor degree at the School of Mathematical Science, Beihang University (BUAA).

My work focuses on Database, Data Quality, Data Cleaning, and AI4DB with LLM.

I am actively seeking self-motivated Master students(MSC)/Research Assistants(RA) in AI. The research topics mainly include but are not limited to the following:

Feel free to reach out via email yanmy@sdu.edu.cn; yanmy1008@buaa.edu.cn; or yanmy1008@gmail.com.

News

Selected Publications

(*:Corresponding author)

Mengyi Yan, Wenfei Fan, Yaoshu Wang, and Min Xie, Enriching Relations with Additional Attributes for ER, *Proceedings of the VLDB Endowment (VLDB), 2024. Link

Mengyi Yan, Yaoshu Wang, Yue Wang, Xiaoye Miao, and Jianxin Li, GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models, *ACM SIGMOD International Conference on Management of Data (SIGMOD), 2025. Link Camera-Ready

Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, and Jianxin Li, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024. Link

Mengyi Yan, Weilong Ren, Yaoshu Wang, and Jianxin Li, A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model, Database Systems for Advanced Applications (DASFAA), 2024. Link

Full Version is available at Publication page above.

Research Interests

My research focuses on on Database, Data Quality, Data Cleaning, and AI4DB with LLM, with publications in SIGMOD, VLDB, KDD, DASFAA, EMNLP, etc. A brief summary of my past work can be found below.

Cost-Efficient Data Preprocessing for Data-Centric AI

I have worked on building a high data- and cost-efficient data preprocessing pipeline for LLM, by minimizing labelling and computing cost(most capable on consumer-level hardware with offline LLM) to achieve comparatable results in various data preprocessing scenarios, e.g. Entity Resolution, Tabular Representation Learning and Relation Extraction. Relevant results were published in [KDD’24, DASFAA’24, BigData’24 and EMNLP’25].

Data Cleaning

I have worked on improving the performance of data cleaning systems, by leveraging knowledge-enhanced approaches, e.g. LLM-based agent, Knowledge Graph. Relevant results were published in [SIGMOD’25, SIGMOD’24, VLDB’24]

Data Evaluation and Synthesis

My research focuses on evaluating and optimizing synthetic data for Large Language Models (LLMs). By applying a suite of machine learning tools (e.g. Uncertainty Quantification, Influence Functions, and Submodular Optimization), I assess the redundancy and value of data assets throughout the pre-training, fine-tuning, and domain-specific adaptation stages. The primary goal is to determine the optimal composition of training datasets for various specialized tasks, thereby maximizing model performance and training efficiency. Relevant results were published in [FCS’25, AIJ’23 and arxiv].

Talks

MELD: Efficient Mixture of Experts based on LLM for Low-Resource Data Preprocessing

  • KDD conference, August 2024, Barcelona, Spain

External Reviewer

KDD, NIPS, NeurIPS, ICLR, AAAI, ICDE, etc