Bolin Ding

Technical
Notes

Occasional writings on research, systems, and ideas in LLMs, agents, machine learning, data management, and beyond.

1
data pricing federated learning mechanism design

Performance-Based Pricing for Federated Learning via Auction

Zitao Li, Liuyi Yao, Yaliang Li, Xiaokui Xiao, Bolin Ding, Jingren Zhou

VLDB 2024

A truthful bidding mechanism for data exchanging and sharing in federated learning. In FL, data owners contribute local model updates of varying quality, yet they need economic incentives to participate honestly. This paper designs an auction where data buyers and sellers bid truthfully — no party gains by misreporting — and payment is tied to the actual performance gain each contributor delivers.

2
query optimization AI4DB learning to rank

Lero: A Learning-to-Rank Query Optimizer

Rong Zhu, Wei Chen, Xingguang Chen, Andreas Pfadler, Ziniu Wu, Bolin Ding, Jingren Zhou

VLDB 2023

A learning-to-rank query optimizer that treats plan selection as a pairwise ranking problem rather than absolute cost estimation. Traditional optimizers suffer from inaccurate cardinality estimates; Lero sidesteps this by learning which plan is better than another from execution feedback, achieving robust performance across diverse workloads without expensive re-calibration.

3
data privacy recommender systems simulation

Studying the Impact of Data Disclosure Mechanism in Recommender Systems via Simulation

Ziqian Chen, Fei Sun, Yifan Tang, Haokun Chen, Jinyang Gao, Bolin Ding

TOIS 2023

A multi-agent simulation study of how data disclosure policies reshape social welfare in recommender systems. Using reinforcement learning agents to model users' strategic privacy choices, the paper reveals that the platform's disclosure mechanism — what data it shares and with whom — has outsized effects on both recommendation quality and user privacy, effects that are hard to measure without simulation.

4
federated learning systems platform

FederatedScope: A Flexible Federated Learning Platform for Heterogeneity

Yuexiang Xie, Zhen Wang, Dawei Gao, Daoyuan Chen, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, Jingren Zhou

VLDB 2023

A flexible federated learning platform built around a message-passing abstraction that naturally handles statistical, model, and system heterogeneity. FederatedScope lets practitioners compose FL algorithms from reusable event-driven components, making it straightforward to benchmark, extend, and deploy FL in settings where data distributions, model architectures, and device capabilities all differ.

5
federated learning graph learning systems

FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning

Zhen Wang, Weirui Kuang, Yuexiang Xie, Liuyi Yao, Yaliang Li, Bolin Ding, Jingren Zhou

KDD 2022

A unified package for federated graph neural network learning, addressing the unique challenge that graph data carries structural information that cannot be trivially partitioned or anonymized. FederatedScope-GNN covers diverse graph FL settings — from subgraph federation to global graph aggregation — and provides efficient communication-aware protocols for training GNNs across distributed data silos.

6
machine learning optimization regularization

iFlood: A Stable and Effective Regularizer

Yuexiang Xie, Zhen Wang, Yaliang Li, Ce Zhang, Bolin Ding, Jingren Zhou

ICLR 2022

A regularizer that enforces a minimum margin between training loss and perfect fit, stabilizing optimization across a wide range of learning tasks. Unlike L2 or dropout, iFlood targets the loss landscape itself: by penalizing configurations that bring loss too close to zero, it prevents overconfident flat minima and consistently improves generalization, especially under distribution shift.

7
federated learning data privacy collaborative filtering

Federated Matrix Factorization with Privacy Guarantee

Zitao Li, Ce Zhang, Ninghui Li, Bolin Ding, Jingren Zhou

VLDB 2022

Matrix factorization under local differential privacy in a fully decentralized federated setting — no trusted server required. The paper shows how to decompose the MF objective so that clients only exchange privacy-protected gradients, and proves that the resulting noise can be calibrated to give formal DP guarantees without destroying the recommendation accuracy that makes MF valuable.

8
query processing machine learning cardinality estimation

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

Renzhi Wu, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, Bolin Ding, Jingren Zhou

VLDB 2022

A distribution-independent learned estimator for counting distinct values (NDV) in a column — a fundamental statistic for query optimization. By framing NDV estimation as a learning problem over sub-samples, the model generalizes across arbitrary data distributions without requiring a parametric assumption, substantially outperforming classical estimators like Good-Turing on skewed real-world data.

9
data privacy algorithms local DP

Linear and Range Counting under Metric-based Local Differential Privacy

Zhuolun Xiang, Xi He, Bolin Ding, Jingren Zhou

ISIT 2020

Differentially private range counting queries that exploit the geometric structure of the domain to reduce sensitivity. By defining privacy budget in terms of a metric distance rather than a flat ε-ball, the mechanism allocates less noise to nearby domain values — yielding tighter answers for range queries while still satisfying a rigorous local DP guarantee for each user.

10
data privacy query processing analytics

Answering Multi-Dimensional Analytical Queries under Local Differential Privacy

Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, Somesh Jha

SIGMOD 2019

Efficient multi-dimensional range queries under local differential privacy, where each user perturbs their own data before sending it to the server. The key challenge is that naive composition of 1-D LDP protocols blows up variance in high dimensions; this paper derives optimal protocols and variance-reduction strategies that make OLAP-style analytics practical under LDP.

11
data privacy systems local DP

Collecting Telemetry Data Privately

Bolin Ding, Janardhan Kulkarni, Sergey Yekhanin

NeurIPS 2017

Repeated multi-dimensional telemetry collection at population scale under local differential privacy — the model deployed in Windows 10. The paper addresses the challenge that users submit data many times over their device lifetime, and shows how to design protocols that compose gracefully over repeated collections while keeping per-report privacy budgets small enough for real deployment.

12
query processing databases approximate query processing

Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee

Silu Huang, Bolin Ding, Surajit Chaudhuri, Kaushik Chakrabarti, Chi Wang

SIGMOD 2016

Approximate query processing that synergistically combines random samples and sorted indexes to answer aggregate queries with distribution-level precision guarantees. Instead of choosing between fast-but-imprecise sampling and slow-but-exact index traversal, Sample+Seek adaptively blends both, guaranteeing not just an error bound on the estimate but a tight characterization of the full answer distribution.

13
data privacy databases data cubes

Differentially Private Data Cubes: Optimizing Noise Sources and Consistency

Bolin Ding, Marianne Winslett, Jiawei Han, Zhenhui Li

SIGMOD 2011

A mechanism for releasing entire OLAP data cubes under differential privacy, where the budget must be split across exponentially many cells and marginals that are algebraically constrained to be consistent. The paper formulates this as an optimization problem over noise allocation and proposes an efficient post-processing step to enforce consistency, enabling accurate drill-down queries on sensitive aggregate data.

14
algorithms databases hardware

Fast Set Intersection in Memory

Bolin Ding, Arnd Christian König

VLDB 2011

How to compute the intersection (or union) of two sets as quickly as hardware allows — using bitwise operations and SIMD parallelism. The paper systematically evaluates merge-based, hash-based, and bitset-based strategies across density regimes, derives when each dominates, and combines them into an adaptive algorithm used in inverted index evaluation and join processing.

15
databases algorithms keyword search

Finding Top-k Min-Cost Connected Trees in Databases

Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang, Xuemin Lin

ICDE 2007

A fixed-parameter tractable algorithm for the group Steiner tree problem as it arises in keyword search over relational databases — finding the smallest connected subgraph that spans all keyword matches. By exploiting that the query keyword count k is small in practice, the algorithm achieves polynomial time in data size, making it the first practical exact solver for this NP-hard problem in a database context.

16
databases XML pattern matching

TwigList: Make Twig Pattern Matching Fast

Lu Qin, Jeffrey Xu Yu, Bolin Ding

DASFAA 2007

Reminding me the good old times in Hong Kong with my buddy, Lu Qin. Efficient twig pattern matching in XML by decomposing a twig query into path lists and merging them with a stack-based algorithm. Twig queries — tree-structured XPath expressions — are the core operation in XML query processing; TwigList accelerates them by converting the branching structure into linear list operations, significantly reducing intermediate result size compared to prior approaches.