About Me

I am Mengyi Yan, currently is the assistant Professor at School of Artificial Intelligence, Shandong University. In Jun. 2025, I received my Ph.D. degree at the School of Computer Science and Engineering, Beihang University (BUAA), under the supervision of Prof. Jianxin Li.

Before that, I received my bachelor degree at the School of Mathematical Science, Beihang University (BUAA).

My work focuses on Database, Data Quality, Data Cleaning, and AI4DB with LLM.

Feel free to reach out via email yanmy@sdu.edu.cn; yanmy1008@buaa.edu.cn; or yanmy1008@gmail.com.

News

MAY 2026 Our Accelerating Influence Function Estimation for Large Language Models: A Practical Design paper was accepted to KDD 2026.
FEB 2026 Our SPARQ: A Cost-Efficient Framework for Offline Table Question Answering via Adaptive Routing paper was accepted to ICDE 2026.
AUG 2025 Our PUER: Boosting Few-shot Positive-Unlabeled Entity Resolution with Reinforcement Learning paper was accepted to EMNLP 2025.
Jul 2025 I joined School of Artificial Intelligence, Shandong University (SDU) as an assitant professor.
JUN 2025 Our Towards uncertainty-calibrated structural data enrichment with large language model for few-shot entity resolution paper was accepted to Frontiers of Computer Science, 2025.
DEC 2024 Our Unsupervised Domain Adaptation for Entity Blocking Leveraging Large Language Models paper was accepted to IEEE Big Data 2024.
NOV 2024 Our GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models paper was accepted to SIGMOD 2025.
AUG 2024 Our Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing paper was accepted to KDD 2024 as oral paper.
JUN 2024 Our Enriching Relations with Additional Attributes for ER paper was accepted to VLDB 2024
MAR 2024 Our A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model paper was accepted to DASFAA 2024.
OCT 2023 Our Splitting Tuples of Mismatched Entities paper was accepted to SIGMOD 2024.
Dec 2021 I joined Shenzhen Institute of Computing Science (SICS) as a research intern.

Selected Publications

(*:Corresponding author)

Mengyi Yan, Yaoshu Wang, Guangyi Zhang, Kehan Pang, and Haoyi Zhou*, Accelerating Influence Function Estimation for Large Language Models: A Practical Design, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2026.

Yang Liu, Mengyi Yan*, Jiao Xue, Weilong Ren, Yutong Ye, Haoyi Zhou, Jianxin Li*, and Zhumin Chen, SPARQ: A Cost-Efficient Framework for Offline Table Question Answering via Adaptive Routing, IEEE International Conference on Data Engineering (ICDE), 2026. Paper Code

Mengyi Yan, Wenfei Fan, Yaoshu Wang, and Min Xie, Enriching Relations with Additional Attributes for ER, *Proceedings of the VLDB Endowment (VLDB), 2024. Link

Mengyi Yan, Yaoshu Wang, Yue Wang, Xiaoye Miao, and Jianxin Li, GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models, *ACM SIGMOD International Conference on Management of Data (SIGMOD), 2025. Link Camera-Ready

Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, and Jianxin Li, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024. Link

Mengyi Yan, Weilong Ren, Yaoshu Wang, and Jianxin Li, A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model, Database Systems for Advanced Applications (DASFAA), 2024. Link

Full Version is available at Publication page above.

Research Interests

My research focuses on on Database, Data Quality, Data Cleaning, and AI4DB with LLM, with publications in SIGMOD, VLDB, KDD, DASFAA, EMNLP, etc. A brief summary of my past work can be found below.

Cost-Efficient Data Preprocessing and Reasoning for Data-Centric AI

I have worked on building data- and cost-efficient pipelines on top of LLMs, by minimizing labelling and computing cost (most capable on consumer-level hardware with offline LLMs) to achieve comparable results across data preprocessing and tabular reasoning scenarios, including Entity Resolution, Tabular Representation Learning, Relation Extraction, and Table Question Answering. Relevant results were published in [ICDE’26, KDD’24, DASFAA’24, BigData’24 and EMNLP’25].

Data Cleaning

I have worked on improving the performance of data cleaning systems, by leveraging knowledge-enhanced approaches, e.g. LLM-based agent, Knowledge Graph. Relevant results were published in [SIGMOD’25, SIGMOD’24, VLDB’24]

Data Evaluation, Influence Estimation, and Synthesis for LLMs

My research focuses on evaluating and optimizing data for Large Language Models (LLMs). By applying a suite of machine learning tools (e.g. Uncertainty Quantification, Influence Functions, and Submodular Optimization), I assess the redundancy and value of data assets throughout the pre-training, fine-tuning, and domain-specific adaptation stages. A particular emphasis is placed on making influence-function-style attribution practical and scalable for modern LLMs, so it can drive the composition of training datasets for specialized tasks and maximize model performance and training efficiency. Relevant results were published in [KDD’26, FCS’25, AIJ’23 and arxiv].

Talks

MELD: Efficient Mixture of Experts based on LLM for Low-Resource Data Preprocessing

KDD conference, August 2024, Barcelona, Spain

External Reviewer

KDD, NIPS, NeurIPS, ICLR, AAAI, ICDE, etc

Mengyi Yan 晏梦懿