I am a research scientist/senior director in the Data Analytics and Intelligence Lab (DAIL) at Alibaba Group. Prior to joining Alibaba, I was a researcher in Microsoft Research. I completed my Ph.D. in Computer Science at University of Illinois at Urbana-Champaign under the supervision of Prof. Jiawei Han, my M.Phil. at The Chinese University of Hong Kong, advised by Jeffery Xu Yu, and my B.S. at Renmin University of China, advised by Shan Wang and Qing Zhu.

We are hiring research scientists, engineers, and research interns for our lab! Please drop me a line if you are interested.

Research Interests

My recent research focuses on making systems, including physical computing systems such as databases and machine learning systems, and social systems, intelligent and efficient with machine learning and optimization techniques. More specifically,

  • systems, programming framework, and algorithms for tuning large language models and building LLM applications;
  • empowering databases and data analytics with machine learning and large language models;
  • data privacy, pricing, and federated learning;
  • theory behind large language models and relationship between LLMs and traditional algorithms.

Projects

Together with the team, I am actively working on the following projects:


Data-Juicer (open-source project)

Pipelines of training and tuning large language models start from the data pre-processors, which continuously optimize the mixture of training data after receiving feedbacks from model evaluators. Data-Juicer is a one-stop multimodal data processing system to make data higher-quality, juicier, and more digestible for training and tuning large language models. Data-Juicer features a fine-grained abstraction of the pipeline for constructing data recipes, with over 50 built-in operators that can be freely composed and extended. By incorporating visualization and auto-evaluation capabilities, it enables a timely feedback loop after data processing for both LLM pre-training and fine-tuning to continuously improve data recipes. Data-Juicer can be integrated with various ecosystems for LLM training and evaluation, and is optimized for distributed computing.

Representative publications
the Data-Juicer system [SIGMOD 2024], ...
back to top


AgentScope (open-source project)

We are developing a programming framework called AgentScope which abstracts each LLM instance with auxiliary components and functionalities such as memory and RAG as an LLM agent. AgentScope allows developers to easily assemble multiple LLM agents into a traditional or AI-driven application. From the view of programming languages, AgentScope aims to offer agent-oriented programming (AOP) as a new programming model to organize the design and implementation of next-generation LLM-empowered applications. AgentScope also offers necessary native services for agents, including communication between agents (locally or remotely), web search, and code execution.

Representative publications
the technical report about AgentScope [manuscript], ...
back to top


LLM4Analytics

Large language models make various difficult tasks (e.g., writing code) more accessible than they used to be. This project focuses on developing tools and systems for data analytical tasks with the help of large language models, which, for example, help i) translate natural language into executable actions, ii) find the right pieces of code for these actions, iii) augment data with knowledge in LLMs, and iv) assemble the data analytics pipeline. Following are some preliminary works and more are forthcoming.

Representative publications
Text2SQL [VLDB 2024],
UniDM: LLM for data manipulation [MLSys 2024],
SMARTFEAT: LLM for feature augmentation [CIDR 2024], ...
back to top


PilotScope (open-source project)

Is it possible for models to learn to be a statistician, a database administrator, an index, a query processor, or a query optimizer? We develop a series of "learning-to-be" techniques for different database components, as well as a middleware system, called PilotScope, to deploy these learned components into databases. PilotScope offers a programming framework to bridge the gaps between AI4DB (Artificial Intelligence for Databases) algorithms and database systems, with which researchers can develop and train their own learned components, deploy them without modifying the code in database engines, and compare them fairly with alternatives in actual database systems.

Representative publications
the PilotScope system [VLDB 2024],
learned cardinality estimator [VLDB 2022, VLDB 2024, SIGMOD 2024],
Lero: learned query optimizer [VLDB 2023],
Eraser: eliminating performance regression in AI4DB [VLDB 2024], ...
back to top


FederatedScope (open-source project)

We have open-sourced an easy-to-use federated learning package, FederatedScope (media coverage 1, media coverage 2, ...). It provides comprehensive functionalities including privacy protection, personalization, auto-tuning of federated machine learning models, based on a programming framework with which one can conveniently develop and deploy her/his own federated models in various settings about how data and models are distributed, e.g., vertically/horizontally collaborative learning, and cross-silo/cross-device federated learning services. We have enriched its functionality for some specific scenarios, e.g., FederatedScope-GNN for federated learning of graph neural networks, FederatedScope-Real for federated learning across many devices, and FederatedScope-LLM for tuning large language models in federated learning settings. We also consider fundamental questions about how to price data/models in federated learning (e.g., via truthful bidding).

Representative publications
the FederatedScope system [VLDB 2023],
FederatedScope for GNN [KDD 2022],
FederatedScope for devices [VLDB 2023, KDD 2023],
data pricing in federated learning [VLDB 2024], ...
back to top


AI4SocialScience

We conduct interdisciplinary research between Social Science (starting from, e.g., Economics) and Machine Learning. Besides traditional theoretical study, we conduct simulation study by building multi-agent platforms based on reinforcement learning and large language models, in order to investigate scenarios that are too complicated to be formulated as clean and solvable theoretical/math problems. For example, in recommendation systems, we study how users would react to different data privacy policies, and how the utility of eCommerce platforms and the utility of individual users are affected accordingly.

Representative publications
privacy v.s. utility on eCommerce platforms [TOIS 2023],
competitive information design [SODA 2023], ...
back to top