Privacy-Preserving ML for DPOs: Federated Learning, Differential Privacy and Synthetic Data

This article accompanies Hour 2: Privacy-Preserving ML and LLM Privacy Risks in our full-day CPD programme on XpertAcademy. Completion of the full one-hour session, including the related learning materials, contributes to the one-hour CPD certificate issued for that session. You can access the course here: CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training.

"We can use synthetic data and differential privacy, so GDPR should not be a blocker."

That sentence sounds reassuring in a data science meeting. It may also be completely premature.

Privacy-preserving machine learning techniques can be valuable. Federated learning can reduce the need to centralise training data. Differential privacy can reduce the risk that an individual can be singled out from statistical outputs. Synthetic data can support testing, development, research and model exploration without using the original dataset in every environment.

But none of those techniques is magic. Each protects against certain risks and leaves others untouched. Each depends on implementation detail. Each can be weakened by poor governance, small datasets, excessive querying, overfitting, weak access controls, unsafe linkage, poor parameter choices or unclear reuse.

For a DPO, the practical question is not "is this a privacy-preserving technique?" It is: what problem is the technique meant to solve, what personal data remains in the pipeline, what residual risk remains, and what evidence shows that the control works for this use case?

The operational problem

Assume a data science team wants to build a churn prediction model using customer transaction history, service contacts, complaints, product holdings, location region, vulnerability indicators and some demographic fields. The team wants to move fast because a commercial deadline is approaching. It proposes three controls:

generate synthetic data for early model development;
apply differential privacy to summary outputs used for reporting;
explore federated learning with two group companies so raw customer data does not move into one central training environment.

Those ideas may be sensible. They are also not a replacement for a DPIA, lawful basis analysis, transparency review, role mapping, minimisation, security controls, retention rules or vendor due diligence.

The privacy team should resist two bad reactions. The first is to dismiss the techniques as technical window-dressing. The second is to accept the labels as proof that the project is safe. The better route is to ask what each technique changes in the risk assessment.

What the techniques can and cannot solve

Privacy-preserving ML is a broad label. In this article, the focus is on three techniques DPOs are likely to meet in AI and analytics projects.

Technique	What it can help with	What it does not automatically solve
Federated learning	Training across separate datasets without routinely copying all raw training data into one central location.	It does not remove all leakage risk from model updates, gradients or outputs. It does not by itself settle controller, processor or joint controller roles.
Differential privacy	Adding carefully calibrated noise so statistical outputs reveal less about any one individual.	It does not make every dataset anonymous. It depends on parameter choices, query limits, context and whether outputs can be linked with other data.
Synthetic data	Creating artificial records with similar statistical properties to an original dataset.	It does not guarantee anonymity. Synthetic data may memorise rare records, preserve outliers, be linkable or remain personal data depending on how it is generated and used.

The ICO's PETs guidance is useful because it frames these techniques as tools that can support data protection compliance, not as automatic exemptions. Its AI guidance also reminds organisations to assess data minimisation, security and privacy attacks on models in context.

For DPOs, that means the conversation should move from labels to evidence.

Worked example: the shortcut proposal

In the churn model scenario, the project team starts with a real business purpose: identify customers at risk of leaving so the organisation can improve service, retention and support. The team wants to use five years of customer data and link records from the CRM, billing system, support platform and complaints system.

The privacy team starts with several facts:

the training data includes identifiable customer records;
complaint text and vulnerability indicators may include sensitive or higher-impact information;
two group companies may contribute data;
the proposed outputs may influence customer targeting and service interventions;
the first development environment is separate from production;
data scientists want realistic development data quickly.

Several points are still unknown:

whether all proposed fields are necessary;
whether the lawful basis and transparency position supports this use;
whether vulnerable customer data should be excluded, transformed or subject to additional safeguards;
whether group companies are separate controllers, joint controllers or processors for any part of the project;
whether synthetic records can be linked back to rare customers;
what differential privacy parameters would be used and who would approve them;
whether federated learning updates could leak information about local datasets;
what access, retention and deletion rules apply to intermediate files and model artefacts.

The decision question is therefore not whether to bless "PPML". It is whether these techniques make the proposed processing necessary, proportionate and evidentially controlled.

Mini example one: synthetic data for development

The data science team proposes using synthetic data so model developers do not need direct access to raw customer records during early build and testing. That is a reasonable starting point.

Synthetic data can reduce exposure where developers need realistic structure but not real individuals. It may help with sandbox testing, pipeline development, quality checks, training exercises and demonstrations. It can reduce the number of people and environments that touch the original dataset.

However, the DPO should not assume that synthetic data is automatically anonymous. If the original dataset contains rare combinations of attributes, unusual complaints, small subgroups or high-value outliers, a synthetic generator may reproduce patterns close enough to allow re-identification or singling out. If the synthetic data is created from personal data, the generation process itself is processing personal data. The controls around the original data still matter.

The privacy team asks for:

the generation method and tool;
the source fields used to generate the synthetic data;
whether rare categories and outliers are suppressed, generalised or tested;
re-identification and membership inference testing where appropriate;
access controls for the original dataset and synthetic output;
a rule for whether synthetic data can be exported, shared with vendors or retained longer than source data;
documentation explaining whether the synthetic data is treated as anonymous, pseudonymous or personal data for this project.

The decision might be to permit synthetic data in the development environment, but not to treat it as anonymous unless the team provides stronger evidence. That is not a failure. It is a defensible control position.

Mini example two: differential privacy for reporting

The team also wants to publish internal dashboards showing model performance by region, product, complaint type and vulnerability segment. It proposes differential privacy for aggregated reporting.

Differential privacy can be valuable where repeated statistical outputs could otherwise reveal information about individuals. It can help reduce singling-out and inference risk by adding controlled noise. It is particularly relevant where small groups, sensitive categories or repeated queries create disclosure risk.

The DPO still needs the details. Differential privacy is not a switch labelled "private". The level of protection depends on the privacy budget, noise mechanism, query limits, group sizes, release process and whether outputs can be combined with other internal data.

For example, a dashboard showing churn risk for "customers in Region A, with Product B, who complained twice, who have a vulnerability flag and are over 75" may have a very small cell size. Noise may help, but the better answer may also require suppression, aggregation, access restrictions or removal of certain segment combinations.

The privacy team asks:

what outputs will be released and to whom;
whether users can create arbitrary queries or only approved dashboards;
what privacy budget is proposed and who controls it;
whether small cells are suppressed before or after noise is added;
whether outputs can be linked with other reports to narrow individuals;
whether dashboards will be used for customer-level decisions or only aggregate monitoring;
how the organisation will monitor changes in the privacy budget and report design over time.

The decision may be to approve differentially private aggregate reporting for specified dashboards, with suppression rules and restricted access, while prohibiting customer-level action from noisy aggregate outputs. That keeps the technique tied to the purpose.

Mini example three: federated learning across group companies

The third proposal is federated learning. Two group companies have related customer datasets but do not want to pool raw data into one central environment. The data science team proposes training local models and sharing model updates to build a better global model.

Federated learning can reduce the need to move raw training data. That may reduce some security, transfer and access risks. It may also support minimisation where centralisation would otherwise be excessive.

It does not remove every GDPR question. The local training data remains personal data. Model updates, gradients or parameters may still leak information in some circumstances, especially where datasets are small, imbalanced or sensitive. The group companies still need to define their roles, lawful basis, transparency, security responsibilities, audit rights, incident process and decision-making around the model.

The privacy team asks:

which entity determines the model purpose and training design;
whether the group companies are independent controllers, joint controllers or processors for the relevant processing;
what information leaves each local environment;
whether secure aggregation, encryption, access controls or update clipping are used;
whether updates can be attacked or inverted to infer local data;
whether one entity can use the resulting model for purposes not shared by the others;
how data subject rights and objections will be handled across the arrangement;
what happens if one participant changes its data quality, purpose or withdrawal position.

The decision might be to continue design work only after role mapping, a data-sharing or joint-controller arrangement, a federated learning threat model and technical leakage testing are complete. Again, the point is not to block the technique. The point is to make the governance match the risk.

What the DPO or privacy team should check

The core check is whether the technique solves the risk the project says it solves.

Define the purpose before choosing the technique. A PET cannot fix a vague or drifting purpose.
Map the personal data across the full ML lifecycle: source data, extraction, feature engineering, training, validation, synthetic generation, model updates, logs, outputs and dashboards.
Identify affected individuals and sensitive contexts, including vulnerable customers, employees, children, patients, complainants or small populations.
Check whether the proposed technique reduces data exposure, inference risk, centralisation risk, identifiability risk or something else.
Confirm whether personal data remains in the source data, intermediate files, model, synthetic output, gradients, logs or dashboards.
Test lawful basis, transparency and fairness for the processing, not just for the technical method.
Map controller, joint controller and processor roles, especially in group, supplier, research or collaborative training settings.
Ask for technical evidence: parameter choices, privacy budget, leakage testing, re-identification testing, model attack assessment and access controls.
Confirm retention and deletion rules for original data, synthetic data, intermediate datasets, model versions, logs and reports.
Decide who can approve changes to privacy budgets, features, dashboard outputs, training participants and reuse of the model.
Record residual risk and escalation triggers, including use with sensitive data, small groups, high-impact decisions or external sharing.

The checklist should be applied with judgement. A low-risk internal test using synthetic data may need lighter evidence than a cross-organisation model trained on sensitive health or vulnerability data. The evidence should scale with the risk.

Evidence and records

The evidence record should show why the privacy-preserving technique was chosen and what it actually controls.

For the churn model, the DPIA or AI assessment should include the project purpose, data categories, affected individuals, lawful basis, transparency position, expected outputs and decision impact. It should then address each technique separately.

The synthetic data record should identify the source data, generation method, transformation rules, outlier handling, re-identification testing, access controls, sharing limits and classification decision. If the organisation treats the output as anonymous, the evidence threshold should be much higher than if it treats it as controlled pseudonymous or personal data.

The differential privacy record should include the release purpose, approved outputs, privacy budget, noise mechanism, query limits, small-cell rules, access model, approval owner and monitoring approach. It should also identify who can change the settings and how changes are logged.

The federated learning record should include participant roles, data locations, local training controls, information shared between participants, aggregation design, security measures, leakage testing, governance agreement, incident route and rights-handling position.

The organisation should also retain:

Evidence item	Why it matters
Decision record	Shows what was approved, what was excluded and why the technique was proportionate.
DPIA or AI assessment	Connects technical controls to lawful basis, fairness, transparency, rights, minimisation and risk.
Technical validation note	Shows that privacy claims were tested, not simply asserted.
Data-flow and model-flow map	Shows where personal data, synthetic data, gradients, model artefacts and outputs move.
Access and retention evidence	Shows who can use each dataset or output and when it is deleted.
Vendor or group governance agreement	Shows role mapping, support access, security, audit, transfer and change controls.
Residual risk note	Shows remaining risks, owner acceptance and review triggers.
Review schedule	Shows when parameters, datasets, purposes and release outputs will be checked again.

This is the evidence that lets the DPO say: the technique reduces these risks, does not reduce these others, and has been approved within these limits.

Escalation and review triggers

Privacy-preserving ML controls should be reviewed when the facts change.

Escalate or reopen the assessment if the dataset expands, the model purpose changes, vulnerable groups or special category data are added, query access becomes broader, synthetic data is shared outside the project, differential privacy parameters change, the privacy budget is exhausted or reset, federated participants change, a model is exposed through an API, a vendor gains access to model artefacts, or outputs start influencing individual decisions.

The same applies where a technique is used to justify wider access than originally approved. A common warning sign is "because it is synthetic, we can let more people use it" or "because it is federated, the group can reuse the model anywhere". Those statements may be true in some limited cases. They are not safe as assumptions.

What this means for CPD

For CPD purposes, the key skill is being able to ask the right governance questions without pretending to be the ML engineer.

A DPO does not need to derive the differential privacy mechanism by hand. The DPO does need to know that the privacy budget matters, that small groups can still be risky, that synthetic data is not automatically anonymous, and that federated learning does not remove role mapping or leakage risk.

After this topic, a privacy or governance lead should be able to move a meeting from "we are using PETs" to "which risks do these PETs mitigate, what risks remain, what evidence supports the claim, and what conditions apply to approval?"

That is the practical value of privacy-preserving ML in a governance setting. It gives the organisation better options, but it still needs disciplined decisions.

This article is intended to support the learning covered in Hour 2 of our XpertAcademy CPD programme. The relevant CPD certificate is issued for completion of the full one-hour session on XpertAcademy, rather than for reading this article on its own. You can return to the course here: CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training.

Sources

Information Commissioner's Office, Privacy-enhancing technologies (PETs): https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/
Information Commissioner's Office, Guidance on AI and data protection – security and data minimisation in AI: https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/how-should-we-assess-security-and-data-minimisation-in-ai/
Information Commissioner's Office, AI and data protection risk toolkit: https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ai-and-data-protection-risk-toolkit/
European Data Protection Board, Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models: https://www.edpb.europa.eu/documents/opinion-of-the-board-art-64/opinion-282024-on-certain-data-protection-aspects-related-to_en
European Data Protection Board, AI Privacy Risks and Mitigations – Large Language Models, Support Pool of Experts report: https://www.edpb.europa.eu/system/files/2025-04/ai-privacy-risks-and-mitigations-in-llms.pdf

Publication verification notes:

ICO PETs guidance re-checked on 2026-06-25. It includes differential privacy, synthetic data and federated learning as PET topics and is aimed at DPOs and others using large personal data sets.
ICO AI guidance re-checked on 2026-06-25. It addresses AI security, minimisation, privacy attacks, model inversion, membership inference and PETs. The page carried a Data (Use and Access) Act 2026 banner and should be re-checked before publication.
EDPB Opinion 28/2024 and the EDPB-published SPE LLM report were used cautiously. The SPE report is practical but should not be described as a formal EDPB position.