About Me

I am Mengyi Yan, currently is the assistant Professor at School of Artificial Intelligence, Shandong University. In Jun. 2025, I received my Ph.D. degree at the School of Computer Science and Engineering, Beihang University (BUAA), under the supervision of Prof. Jianxin Li.

Before that, I received my bachelor degree at the School of Mathematical Science, Beihang University (BUAA).

My work focuses on Database, Data Quality, Data Cleaning, and AI4DB with LLM.

I am actively seeking self-motivated Master students(MSC)/Research Assistants(RA) in AI. The research topics mainly include but are not limited to the following:

Data-centric AI for database
Data- and cost-efficient computing
Machine learning theory for data quality improvement.

Feel free to reach out via email yanmy@sdu.edu.cn; yanmy1008@buaa.edu.cn; or yanmy1008@gmail.com.

News

AUG 2025 Our PUER: Boosting Few-shot Positive-Unlabeled Entity Resolution with Reinforcement Learning paper was accepted to EMNLP 2025.
Jul 2025 I joined School of Artificial Intelligence, Shandong University (SDU) as an assitant professor.
JUN 2025 Our Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution paper was accepted to Frontiers of Computer Science, 2025.
DEC 2024 Our Unsupervised Domain Adaptation for Entity Blocking Leveraging Large Language Models paper was accepted to IEEE Big Data 2024.
NOV 2024 Our GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models paper was accepted to SIGMOD 2025.
AUG 2024 Our Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing paper was accepted to KDD 2024 as oral paper.
JUN 2024 Our Enriching Relations with Additional Attributes for ER paper was accepted to VLDB 2024
MAR 2024 Our A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model paper was accepted to DASFAA 2024.
OCT 2023 Our Splitting Tuples of Mismatched Entities paper was accepted to SIGMOD 2024.
Dec 2021 I joined Shenzhen Institute of Computing Science (SICS) as a research intern.

Selected Publications

(*:Corresponding author)

Mengyi Yan, Wenfei Fan, Yaoshu Wang, and Min Xie, Enriching Relations with Additional Attributes for ER, *Proceedings of the VLDB Endowment (VLDB), 2024. Link

Mengyi Yan, Yaoshu Wang, Yue Wang, Xiaoye Miao, and Jianxin Li, GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models, *ACM SIGMOD International Conference on Management of Data (SIGMOD), 2025. Link Camera-Ready

Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, and Jianxin Li, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024. Link

Mengyi Yan, Weilong Ren, Yaoshu Wang, and Jianxin Li, A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model, Database Systems for Advanced Applications (DASFAA), 2024. Link

Full Version is available at Publication page above.

Research Interests

My research focuses on on Database, Data Quality, Data Cleaning, and AI4DB with LLM, with publications in SIGMOD, VLDB, KDD, DASFAA, EMNLP, etc. A brief summary of my past work can be found below.

Cost-Efficient Data Preprocessing for Data-Centric AI

I have worked on building a high data- and cost-efficient data preprocessing pipeline for LLM, by minimizing labelling and computing cost(most capable on consumer-level hardware with offline LLM) to achieve comparatable results in various data preprocessing scenarios, e.g. Entity Resolution, Tabular Representation Learning and Relation Extraction. Relevant results were published in [KDD’24, DASFAA’24, BigData’24 and EMNLP’25].

Data Cleaning

I have worked on improving the performance of data cleaning systems, by leveraging knowledge-enhanced approaches, e.g. LLM-based agent, Knowledge Graph. Relevant results were published in [SIGMOD’25, SIGMOD’24, VLDB’24]

Data Evaluation and Synthesis

My research focuses on evaluating and optimizing synthetic data for Large Language Models (LLMs). By applying a suite of machine learning tools (e.g. Uncertainty Quantification, Influence Functions, and Submodular Optimization), I assess the redundancy and value of data assets throughout the pre-training, fine-tuning, and domain-specific adaptation stages. The primary goal is to determine the optimal composition of training datasets for various specialized tasks, thereby maximizing model performance and training efficiency. Relevant results were published in [FCS’25, AIJ’23 and arxiv].

Talks

MELD: Efficient Mixture of Experts based on LLM for Low-Resource Data Preprocessing

KDD conference, August 2024, Barcelona, Spain

External Reviewer

KDD, NIPS, NeurIPS, ICLR, AAAI, ICDE, etc

Mengyi Yan 晏梦懿