Privacy-Preserving AI: Encrypted Data Training

In the era of Artificial Intelligence (AI) and big data, predictive models have become an essential tool in various industries, including healthcare, finance, and genomics. These models rely heavily on processing sensitive information, making data privacy a critical concern. The key challenge lies in maximizing the utility of the data without compromising the confidentiality and integrity of the information involved. Achieving this balance is essential to the continued advancement and acceptance of AI technologies.

Jordan Frery

Machine learning technology leader at Zama.

Collaboration and open source

Creating a robust data set for training machine learning models presents significant challenges. For example, while AI technologies like ChatGPT have thrived by collecting large amounts of data available on the Internet, healthcare data cannot be compiled as freely due to privacy concerns. Building a healthcare dataset involves integrating data from multiple sources, including doctors, hospitals, and across borders.

The health sector stands out because of its social importance, but the principles apply widely. For example, even a smartphone autocorrect feature, which personalizes predictions based on user data, faces similar privacy issues. The financial sector also encounters obstacles in data sharing due to its competitive nature.

Therefore, collaboration emerges as a crucial element to safely harness the potential of AI in our societies. However, one aspect that is often overlooked is the actual execution environment of the AI and the underlying hardware that powers it. Today's advanced AI models require robust hardware, including ample CPU/GPU resources, substantial amounts of RAM, and even more specialized technologies such as TPU, ASIC, and FPGA. On the contrary, the trend towards easy-to-use interfaces with simple APIs is gaining popularity. This scenario highlights the importance of developing solutions that allow AI to operate on third-party platforms without sacrificing privacy, and the need for open source tools that facilitate these privacy-preserving technologies.

Privacy solutions for training machine learning models

To address privacy challenges in AI, several sophisticated solutions have been developed, each focusing on specific needs and scenarios.

Federated learning (FL) enables the training of machine learning models on multiple decentralized devices or servers, each containing local data samples, without actually exchanging the data. Similarly, Secure Multiparty Computing (MPC) allows multiple parties to jointly compute a function on their inputs while keeping those inputs private, ensuring that sensitive data does not leave its original environment.

Another set of solutions focuses on manipulating data to maintain privacy while enabling useful analytics. Differential privacy (DP) introduces noise into data in a way that protects individual identities but still provides accurate aggregate information. Data anonymization (DA) removes personally identifiable information from data sets, ensuring some anonymity and mitigating the risk of data breaches.

Finally, homomorphic encryption (HE) allows operations to be performed directly on encrypted data, generating an encrypted result that, when decrypted, matches the result of operations performed on the plain text.

The perfect fit

Each of these privacy solutions has its own set of advantages and trade-offs. FL, for example, maintains communication with a third-party server, which can potentially cause data leaks. MPC operates on cryptographic principles that are sound in theory but can create significant bandwidth demands in practice.

DP involves manual configuration where noise is strategically added to the data. This setting limits the types of operations that can be performed on the data, as noise must be carefully balanced to protect privacy while preserving the usefulness of the data. DA, although widely used, often provides the least privacy protection. Since anonymization typically occurs on a third-party server, there is a risk that cross-referencing could expose hidden entities within the data set.

HE, and specifically fully homomorphic encryption (FHE), is notable for allowing calculations on encrypted data that closely mimic those performed on plain text. This capability makes FHE highly compatible with existing systems and easy to implement thanks to accessible, open source libraries and compilers like Concrete ML, which have been designed to provide developers with easy-to-use tools to develop different applications. The main drawback at the moment is the slowdown in calculation speed, which can affect performance.

While all the solutions and technologies we reviewed encourage collaboration and joint efforts, with its increased data privacy protection, FHE can drive innovation and facilitate a scenario where no more trade-offs are needed when it comes to enjoying of services and products without compromising personal data.

We have presented the best encryption software.

This article was produced as part of TechRadarPro's Expert Insights channel, where we feature the best and brightest minds in today's tech industry. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing, find out more here:

Collaboration and open source

You may also like...