Federated learning (FL) offers a collaborative framework for training foundation models (FMs) and other AI models across distributed computing infrastructures and datasets while incorporating privacy-preserving techniques to manage private-sensitive datasets. This proposal addresses the challenges inherent in adapting FL to the “pre-train” and “fine-tune” paradigms of FMs with billions or trillions of parameters. These challenges include increased communication costs, computation burdens on clients, and the handling of massive model parameters and multi-modal datasets. Moreover, existing privacy-preserving techniques, such as differential privacy (DP), need to address scalability issues with such large models and different privacy requirements from clients. With synthetic data emerging as a promising alternative, new data management challenges are anticipated in the privacy-preserving FL (PPFL) framework with privacy-sensitive datasets and synthetic data.
The project develops efficient communication, memory, and energy optimization techniques for FL algorithms, particularly for large-scale FMs, while ensuring fairness and incentivizing participation. It advances DP techniques to address scalability and heterogeneity challenges, create and manage synthetic data to preserve privacy while maintaining data utility, and integrate these efforts into a cohesive data management framework to enhance the scalability and performance of PPFL systems. Specifically, the research is structured around four main thrusts: (1) improving communication, memory, and energy efficiency; (2) addressing continual learning with incentives and fairness; (3) developing scalable and heterogeneous DP techniques; and (4) creating synthetic data as a privacy-preserving alternative. A crosscut thrust integrates these efforts, providing efficient model and data management schemes using tools, like Mofka and ProxyStore, to tackle access, sharing, versioning, control, and evolution of large datasets and models.
This research effort significantly advances the field of PPFL by enhancing the scalability and efficiency of training large FMs, ensuring fairness and incentive structures for client participation in FL, developing scalable DP techniques that maintain model utility while ensuring privacy, and creating high-quality synthetic data as a proxy for sensitive datasets. The integration of these thrusts is demonstrated through specific scientific use cases in X-ray image science and electric grids, focusing on efficiently training large FMs with substantial data streams subject to privacy constraints. The outcomes ensure the sustainable and trustworthy training and deployment of FMs for science, benefiting a wide range of applications and advancing the state of the art in AI and FL.
Funding Sources
- Advancements in Artificial Intelligence for Science Program, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy
Participating Institutions
- Argonne National Laboratory
- Brookhaven National Laboratory
- Oak Ridge National Laboratory
- Arizona State University
- Rutgers University
Symbols: Abstract, Publication, Presentation, BibTeX Citation