Here’s how machine learning can violate your privacy

8 Min Read

This article originally appeared on The conversation.

Machine learning has pushed the boundaries in several areas including personalized medicine, self-driving cars And custom ads. However, research has shown that these systems remember aspects of the data they were trained with to learn patterns, raising privacy concerns.

In statistics and machine learning, the goal is to learn from past data to make new predictions or inferences about future data. To achieve this goal, the statistician or machine learning expert selects a model to capture the suspected patterns in the data. A model applies a simplifying structure to the data, making it possible to learn patterns and make predictions.

Complex machine learning models have some inherent advantages and disadvantages. The upside is that they can learn much more complex patterns and work with richer data sets for tasks such as image recognition And predicting how a specific person will respond to treatment.

However, they are also at risk overfitting to the data. This means that they make accurate predictions about the data they have been trained with, but they end up learning additional aspects of the data that are not directly related to the task at hand. This leads to models that are not generalized, meaning they perform poorly on new data that is the same type, but not exactly the same as the training data.

While there are techniques to address the predictive error associated with overfitting, there are also privacy concerns because we can learn so much from the data.

How machine learning algorithms draw conclusions

Each model has a certain number parameters. A parameter is an element of a model that can be changed. Each parameter has a value or setting that the model infers from the training data. Parameters can be thought of as the different knobs that can be turned to influence the performance of the algorithm. While a straight pattern has only two buttons, the slope and interceptThere are many machine learning models parameters. The language model, for example GPT-3has 175 billion.

See also  Is my iPhone or Android phone waterproof?

To choose the parameters, machine learning methods use training data with the aim of… predictive error on the training data. For example, if the goal is to predict whether a person would respond well to a particular medical treatment based on their medical history, the machine learning model would make predictions about the data where the model’s developers know whether a person has good or bad responded. The model is rewarded for predictions that are correct and penalized for incorrect predictions, which causes the algorithm to adjust its parameters (i.e., turn some “knobs”) and try again.

To avoid overfitting the training data, machine learning models are checked against a validation dataset also. The validation dataset is a separate dataset that is not used in the training process. By checking the performance of the machine learning model on this validation dataset, developers can ensure that the model is capable of generalize it learns beyond the training data, avoiding overfitting.

While this process succeeds in ensuring good performance of the machine learning model, it does not directly prevent the machine learning model from remembering information in the training data.

Privacy concerns

Due to the large number of parameters in machine learning models, there is a potential that the machine learning method remembers some of the data it was trained on. In fact, this is a widespread phenomenon, and users can extract the stored data from the machine learning model by using searches tailored to obtaining the data.

If the training data contains sensitive information, such as medical or genomic data, the privacy of the people whose data was used to train the model could be at risk. Recent research has shown that this is indeed the case necessary for machine learning models to remember aspects of the training data to obtain optimal performance when solving certain problems. This indicates that there may be a fundamental trade-off between the performance of a machine learning method and privacy.

See also  76ers Joel Embiid has Bell's palsy, here's what it is

Machine learning models also make it possible to predict sensitive information using seemingly non-sensitive data. Purpose was, for example could predict which customers were likely to be pregnant by analyzing the purchasing behavior of customers who have registered with the Target baby registry. Once the model was trained on this dataset, it was able to send pregnancy-related ads to customers suspected of being pregnant because they had purchased items such as supplements or unscented lotions.

Is privacy protection even possible?

Although many methods have been proposed to reduce memorization in machine learning methods, most have been largely ineffective. Currently, the most promising solution to this problem is to guarantee a mathematical limit on the privacy risk.

The state-of-the-art method for formal privacy protection is differential privacy. Differential privacy requires that a machine learning model not change much when an individual’s data in the training dataset is changed. Differential privacy methods achieve this guarantee by introducing additional randomness into the algorithm that ‘hides’ the contribution of a given individual. Once a method is protected with different privacy, an attack is no longer possible may violate that privacy guarantee.

However, even if a machine learning model is trained using differential privacy, this does not prevent it from making sensitive inferences, as in the Target example. To prevent these privacy violations, all data sent to the organization must be protected. This approach is called local differential privacyAnd Apple And Googling have implemented it.


Because differential privacy limits how much the machine learning model can rely on one individual’s data, it prevents memorization. Unfortunately, it also limits the performance of the machine learning methods. Because of this trade-off, there is criticism of the usefulness of differentiated privacy, as it often results in significant privacy decline in performance.

See also  What is NFC and how do I disable it? The N symbol explained

Moving forward

Because of the tension between inferential learning and privacy concerns, there is ultimately a social question as to which is more important in which contexts. When data does not contain sensitive information, it is easy to recommend using the most powerful machine learning methods available.

However, when working with sensitive data, it is important to weigh the consequences of privacy leaks, and it may be necessary to sacrifice some machine learning performance to protect the privacy of the people whose data trained the model.

Disclosure Statement: Jordan Awan receives funding from the National Science Foundation and the National Institute of Health. He also serves as a privacy consultant for the federal nonprofit MITER.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *