About Me
I am Mengyi Yan, currently is the assistant Professor at School of Artificial Intelligence, Shandong University. In Jun. 2025, I received my Ph.D. degree at the School of Computer Science and Engineering, Beihang University (BUAA), under the supervision of Prof. Jianxin Li.
Before that, I received my bachelor degree at the School of Mathematical Science, Beihang University (BUAA).
My work focuses on Database, Data Quality, Data Cleaning, and AI4DB with LLM.
Feel free to reach out via email yanmy@sdu.edu.cn; yanmy1008@buaa.edu.cn; or yanmy1008@gmail.com.
News
Selected Publications
(*:Corresponding author)
Mengyi Yan, Yaoshu Wang, Guangyi Zhang, Kehan Pang, and Haoyi Zhou*, Accelerating Influence Function Estimation for Large Language Models: A Practical Design, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2026.
Yang Liu, Mengyi Yan*, Jiao Xue, Weilong Ren, Yutong Ye, Haoyi Zhou, Jianxin Li*, and Zhumin Chen, SPARQ: A Cost-Efficient Framework for Offline Table Question Answering via Adaptive Routing, IEEE International Conference on Data Engineering (ICDE), 2026. Paper Code
Mengyi Yan, Wenfei Fan, Yaoshu Wang, and Min Xie, Enriching Relations with Additional Attributes for ER, *Proceedings of the VLDB Endowment (VLDB), 2024. Link
Mengyi Yan, Yaoshu Wang, Yue Wang, Xiaoye Miao, and Jianxin Li, GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models, *ACM SIGMOD International Conference on Management of Data (SIGMOD), 2025. Link Camera-Ready
Mengyi Yan, Yaoshu Wang, Kehan Pang, Min Xie, and Jianxin Li, Efficient Mixture of Experts based on Large Language Models for Low-Resource Data Preprocessing, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024. Link
Mengyi Yan, Weilong Ren, Yaoshu Wang, and Jianxin Li, A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model, Database Systems for Advanced Applications (DASFAA), 2024. Link
Full Version is available at Publication page above.
Research Interests
My research focuses on on Database, Data Quality, Data Cleaning, and AI4DB with LLM, with publications in SIGMOD, VLDB, KDD, DASFAA, EMNLP, etc. A brief summary of my past work can be found below.
Cost-Efficient Data Preprocessing and Reasoning for Data-Centric AI
I have worked on building data- and cost-efficient pipelines on top of LLMs, by minimizing labelling and computing cost (most capable on consumer-level hardware with offline LLMs) to achieve comparable results across data preprocessing and tabular reasoning scenarios, including Entity Resolution, Tabular Representation Learning, Relation Extraction, and Table Question Answering. Relevant results were published in [ICDE’26, KDD’24, DASFAA’24, BigData’24 and EMNLP’25].
Data Cleaning
I have worked on improving the performance of data cleaning systems, by leveraging knowledge-enhanced approaches, e.g. LLM-based agent, Knowledge Graph. Relevant results were published in [SIGMOD’25, SIGMOD’24, VLDB’24]
Data Evaluation, Influence Estimation, and Synthesis for LLMs
My research focuses on evaluating and optimizing data for Large Language Models (LLMs). By applying a suite of machine learning tools (e.g. Uncertainty Quantification, Influence Functions, and Submodular Optimization), I assess the redundancy and value of data assets throughout the pre-training, fine-tuning, and domain-specific adaptation stages. A particular emphasis is placed on making influence-function-style attribution practical and scalable for modern LLMs, so it can drive the composition of training datasets for specialized tasks and maximize model performance and training efficiency. Relevant results were published in [KDD’26, FCS’25, AIJ’23 and arxiv].
Talks
MELD: Efficient Mixture of Experts based on LLM for Low-Resource Data Preprocessing
- KDD conference, August 2024, Barcelona, Spain
External Reviewer
KDD, NIPS, NeurIPS, ICLR, AAAI, ICDE, etc
