Performance-Based Pricing for Federated Learning via Auction
VLDB 2024A truthful bidding mechanism for data exchanging and sharing in federated learning. In FL, data owners contribute local model updates of varying quality, yet they need economic incentives to participate honestly. This paper designs an auction where data buyers and sellers bid truthfully — no party gains by misreporting — and payment is tied to the actual performance gain each contributor delivers.
Lero: A Learning-to-Rank Query Optimizer
VLDB 2023A learning-to-rank query optimizer that treats plan selection as a pairwise ranking problem rather than absolute cost estimation. Traditional optimizers suffer from inaccurate cardinality estimates; Lero sidesteps this by learning which plan is better than another from execution feedback, achieving robust performance across diverse workloads without expensive re-calibration.
Studying the Impact of Data Disclosure Mechanism in Recommender Systems via Simulation
TOIS 2023A multi-agent simulation study of how data disclosure policies reshape social welfare in recommender systems. Using reinforcement learning agents to model users' strategic privacy choices, the paper reveals that the platform's disclosure mechanism — what data it shares and with whom — has outsized effects on both recommendation quality and user privacy, effects that are hard to measure without simulation.
FederatedScope: A Flexible Federated Learning Platform for Heterogeneity
VLDB 2023A flexible federated learning platform built around a message-passing abstraction that naturally handles statistical, model, and system heterogeneity. FederatedScope lets practitioners compose FL algorithms from reusable event-driven components, making it straightforward to benchmark, extend, and deploy FL in settings where data distributions, model architectures, and device capabilities all differ.
FederatedScope-GNN: Towards a Unified, Comprehensive and Efficient Package for Federated Graph Learning
KDD 2022A unified package for federated graph neural network learning, addressing the unique challenge that graph data carries structural information that cannot be trivially partitioned or anonymized. FederatedScope-GNN covers diverse graph FL settings — from subgraph federation to global graph aggregation — and provides efficient communication-aware protocols for training GNNs across distributed data silos.
iFlood: A Stable and Effective Regularizer
ICLR 2022A regularizer that enforces a minimum margin between training loss and perfect fit, stabilizing optimization across a wide range of learning tasks. Unlike L2 or dropout, iFlood targets the loss landscape itself: by penalizing configurations that bring loss too close to zero, it prevents overconfident flat minima and consistently improves generalization, especially under distribution shift.
Federated Matrix Factorization with Privacy Guarantee
VLDB 2022Matrix factorization under local differential privacy in a fully decentralized federated setting — no trusted server required. The paper shows how to decompose the MF objective so that clients only exchange privacy-protected gradients, and proves that the resulting noise can be calibrated to give formal DP guarantees without destroying the recommendation accuracy that makes MF valuable.
Learning to be a Statistician: Learned Estimator for Number of Distinct Values
VLDB 2022A distribution-independent learned estimator for counting distinct values (NDV) in a column — a fundamental statistic for query optimization. By framing NDV estimation as a learning problem over sub-samples, the model generalizes across arbitrary data distributions without requiring a parametric assumption, substantially outperforming classical estimators like Good-Turing on skewed real-world data.
Linear and Range Counting under Metric-based Local Differential Privacy
ISIT 2020Differentially private range counting queries that exploit the geometric structure of the domain to reduce sensitivity. By defining privacy budget in terms of a metric distance rather than a flat ε-ball, the mechanism allocates less noise to nearby domain values — yielding tighter answers for range queries while still satisfying a rigorous local DP guarantee for each user.
Answering Multi-Dimensional Analytical Queries under Local Differential Privacy
SIGMOD 2019Efficient multi-dimensional range queries under local differential privacy, where each user perturbs their own data before sending it to the server. The key challenge is that naive composition of 1-D LDP protocols blows up variance in high dimensions; this paper derives optimal protocols and variance-reduction strategies that make OLAP-style analytics practical under LDP.
Collecting Telemetry Data Privately
NeurIPS 2017Repeated multi-dimensional telemetry collection at population scale under local differential privacy — the model deployed in Windows 10. The paper addresses the challenge that users submit data many times over their device lifetime, and shows how to design protocols that compose gracefully over repeated collections while keeping per-report privacy budgets small enough for real deployment.
Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee
SIGMOD 2016Approximate query processing that synergistically combines random samples and sorted indexes to answer aggregate queries with distribution-level precision guarantees. Instead of choosing between fast-but-imprecise sampling and slow-but-exact index traversal, Sample+Seek adaptively blends both, guaranteeing not just an error bound on the estimate but a tight characterization of the full answer distribution.
Differentially Private Data Cubes: Optimizing Noise Sources and Consistency
SIGMOD 2011A mechanism for releasing entire OLAP data cubes under differential privacy, where the budget must be split across exponentially many cells and marginals that are algebraically constrained to be consistent. The paper formulates this as an optimization problem over noise allocation and proposes an efficient post-processing step to enforce consistency, enabling accurate drill-down queries on sensitive aggregate data.
Fast Set Intersection in Memory
VLDB 2011How to compute the intersection (or union) of two sets as quickly as hardware allows — using bitwise operations and SIMD parallelism. The paper systematically evaluates merge-based, hash-based, and bitset-based strategies across density regimes, derives when each dominates, and combines them into an adaptive algorithm used in inverted index evaluation and join processing.
Finding Top-k Min-Cost Connected Trees in Databases
ICDE 2007A fixed-parameter tractable algorithm for the group Steiner tree problem as it arises in keyword search over relational databases — finding the smallest connected subgraph that spans all keyword matches. By exploiting that the query keyword count k is small in practice, the algorithm achieves polynomial time in data size, making it the first practical exact solver for this NP-hard problem in a database context.
TwigList: Make Twig Pattern Matching Fast
DASFAA 2007Reminding me the good old times in Hong Kong with my buddy, Lu Qin. Efficient twig pattern matching in XML by decomposing a twig query into path lists and merging them with a stack-based algorithm. Twig queries — tree-structured XPath expressions — are the core operation in XML query processing; TwigList accelerates them by converting the branching structure into linear list operations, significantly reducing intermediate result size compared to prior approaches.