# Pseudonymisation, Anonymisation and Data Minimisation in AI Systems

Canonical URL: https://xpertdpo.com/pseudonymisation-anonymisation-data-minimisation-ai-systems/

Content type: Article

Published: 2026-06-25T14:16:32+01:00

Updated: 2026-06-25T16:13:55+01:00

Author: Philipa Jane Farley, Head of Legal and Operations

Summary: Pseudonymisation, anonymisation and data minimisation can reduce privacy risk in AI systems, but only where the controls match the use case, the evidence and the AI lifecycle.

## Article

*This article accompanies **Hour 1: Data Minimisation, Pseudonymisation & Anonymisation** in our full-day CPD programme on [XpertAcademy](https://xpertacademy.com/). Completion of the full one-hour session, including the related learning materials, contributes to the one-hour CPD certificate issued for that session. You can access the course here: [CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training](https://xpertacademy.com/cpd-event-b-ai-technical/).*

 Organisations often reach for three privacy controls when an AI use case starts to feel uncomfortable: minimise the data, pseudonymise the dataset, or anonymise it completely. Those are sensible controls. They are also easy to overstate.

 Removing obvious identifiers from a dataset does not necessarily make an AI system low risk. A model may still learn patterns from personal data. A retrieval tool may still expose source records. A prompt log may still contain sensitive information. A vendor may describe data as “de-identified” or “aggregated” without giving enough evidence to show what that means in practice.

 For DPOs, privacy leads and governance teams, the practical question is not whether the right words appear in a DPIA. The question is whether the control chosen actually reduces the risk in the context of the AI system being built or bought.

 This is general guidance, not legal advice. It is intended to help teams ask better questions, record better evidence and avoid treating pseudonymisation, anonymisation or minimisation as magic words.

## Why these controls matter more in AI governance

 AI systems can create privacy risk at more points than a conventional database or workflow tool. Data may be used for training, testing, retrieval, prompting, output generation, monitoring, support, analytics and future model improvement. Each stage can involve different people, systems, processors, subprocessors, retention periods and access routes.

 That matters because privacy controls are stage-specific. A control that is effective for a static analytics extract may not be enough for a system that generates free-text responses, stores user prompts, allows administrators to inspect logs, or uses customer data to improve a vendor-managed model.

 The European Data Protection Board has been particularly clear that AI model assessment depends on context. Its Opinion 28/2024 addresses when an AI model may be considered anonymous and notes that the issue turns on whether information relating to identifiable individuals can be obtained from the model using means reasonably likely to be used. That is a high-evidence question, not a labelling exercise.

 The same discipline applies under UK GDPR. The ICO’s AI guidance recognises that AI systems may need significant data, but still expects organisations to consider data minimisation from the design phase and through procurement where third-party AI systems are involved. “AI needs lots of data” is not a minimisation analysis. It is the start of one.

## The difference between the three controls

 The terms are often used together, but they do different jobs.

| Control | Practical meaning | What it does not prove by itself |
| --- | --- | --- |
| Data minimisation | The organisation uses only personal data that is adequate, relevant and necessary for the stated purpose. | It does not mean “no personal data”. It also does not justify collecting everything now in case it becomes useful later. |
| Pseudonymisation | Identifiers are removed, replaced or transformed, with additional information kept separately and protected so the data cannot be attributed to individuals without controlled access. | It does not usually take the data outside GDPR or UK GDPR by itself. Pseudonymised data is still normally personal data. |
| Anonymisation | Data is processed so individuals are not identifiable, taking account of realistic re-identification risks in the relevant context. | It is not a permanent sticker. It must be assessed against the data, the environment, the recipients and reasonably likely means of re-identification. |

 The distinction is not academic. If a team says it is using anonymous data when it is actually using pseudonymised personal data, the governance record may be wrong at every later step: lawful basis, transparency, DPIA threshold, processor terms, retention, international transfers and individual rights. If a team says it has minimised data because it removed names, it may miss the location histories, rare job titles, complaint narratives or transaction patterns that identify people in practice.

## Data minimisation is a design question, not a slogan

 Data minimisation requires a purpose. Without a clear purpose, the team cannot explain what data is necessary.

 That sounds basic, but AI projects often begin with a vague aim: improve customer support, make recruitment faster, identify risk, summarise case files, personalise learning, monitor quality, detect fraud, or make internal knowledge easier to search. Those aims are too broad for a proper minimisation assessment.

 A better starting point is to state the decision, recommendation, output or support function the AI system is intended to provide. Then the team can test each data category against that purpose. Do we need the full free-text record, or would labelled categories be enough? Do we need precise dates, or would month and year be sufficient? Do we need direct identifiers in the model feature set, or only in a separate operational system? Do we need historic or special category data, or is it being pulled in because it sits in the source system?

 The minimisation analysis should also include prompts, uploaded documents, embeddings, vector databases, logs, telemetry and generated outputs. In a retrieval-augmented generation system, the source documents and retrieved snippets may be more privacy-relevant than the base model. In a staff-facing AI assistant, prompt logs may reveal employee concerns, commercial information or personal data about third parties.

 A practical minimisation review should ask whether the organisation can achieve the same purpose with less personal data, less granular data, shorter retention, narrower access, synthetic test data, redaction, aggregation, local deployment, or a vendor configuration that excludes customer data from model improvement.

 The answer will not always be “use less data at all costs”. Some AI systems need enough data to be accurate, fair and safe. Data minimisation is not a ban on useful data. It is the discipline of explaining why each category is necessary and why a less intrusive alternative would not be sufficient.

## Pseudonymisation that deserves the name

 Pseudonymisation is one of the most useful controls in AI governance, but it is often described too casually.

 Under GDPR, pseudonymisation involves processing personal data so it can no longer be attributed to a specific person without additional information, provided that the additional information is kept separately and protected by technical and organisational measures. The EDPB’s 2025 guidelines emphasise both sides of the control: the transformation of the data and the measures that prevent unauthorised attribution.

 For AI systems, that means the DPO should look beyond whether names were replaced with reference numbers. The review should consider who can access the mapping table or key, whether the transformation can be reversed, whether the same pseudonym is used across multiple datasets, whether attributes left in the dataset still identify people, and whether data can be linked with other internal or external sources.

 Weak pseudonymisation can create false comfort. If the AI team keeps the pseudonym key in the same project folder as the transformed dataset, the separation may be more cosmetic than real. If multiple teams can use stable pseudonyms across HR, productivity, absence and performance datasets, linkage risk may increase. If the dataset contains rare combinations such as location, role and incident narrative, a person may still be identifiable without the original name.

 This is particularly important where AI systems rely on feature-rich data. A recruitment model, employee analytics tool, fraud model, patient triage assistant or complaints classifier may not need a person’s name to produce an effect about that person. Pseudonymisation can reduce exposure and support security, purpose limitation and data protection by design, but it does not remove the need to consider fairness, lawful basis, transparency, individual rights or residual risk.

 For vendor-managed AI tools, the evidence should also cover the vendor environment. It is not enough to say that the customer sends pseudonymised data if the vendor can combine it with account information, support tickets, usage telemetry or prompt logs. The contract and technical design should align: what is sent, what is stored, who can access it, what the vendor may use it for, whether it can be used for model improvement, and how it is deleted.

## Anonymisation is an outcome to evidence

 Anonymisation can be powerful. If data is genuinely anonymous, data protection law will generally not apply to the anonymous information itself. But anonymisation is not achieved merely by deleting names, hashing an identifier or calling a dataset “de-identified”.

 The ICO’s anonymisation guidance frames anonymisation around whether information relates to identified or identifiable individuals. The assessment depends on identifiability risk, including who may have access to the data, what other information may be available, and what means are reasonably likely to be used. The EDPB’s AI models opinion takes the same contextual approach for AI models trained on personal data.

 In AI, anonymisation claims need particular care for three reasons.

 First, AI systems are good at finding patterns. Attributes that appear harmless separately can become identifying in combination. A dataset that includes postcode district, job role, event date and narrative text may identify a person in a small organisation or local community.

 Second, AI models and associated systems may expose information in unexpected ways. LLM risks include memorisation, regurgitation, membership inference, model inversion and extraction attacks. These risks vary by model type, training data, access controls, release context and monitoring. They should be considered where an organisation is claiming that a model or dataset is anonymous.

 Third, anonymity may be undermined by the environment. A dataset released publicly has a different risk profile from one held in a restricted analytics environment. A vendor with broad support access, future enrichment, retraining, export, sharing or integration with other systems may change the assessment.

 An anonymisation assessment for AI should therefore be specific about scope. Is the organisation claiming that the training dataset is anonymous, that the model is anonymous, that outputs are anonymous, that logs are anonymous, or that only a reporting extract is anonymous? Each claim needs its own evidence.

 Good evidence will usually include a description of the original data, the anonymisation method, the remaining attributes, expected users and recipients, the access environment, likely external data sources, linkage scenarios, small group risks, testing, residual risk and review triggers. Where the claim is material to the lawful basis or to the decision that a DPIA is not required, the evidence should be robust enough to withstand challenge.

## Where these controls often fail in AI systems

 In practice, the problem is rarely that a team has ignored privacy completely. The problem is usually that the control is applied to one part of the system while personal data risk remains elsewhere.

 Training data may be over-collected because it is available in the data lake. Test datasets may contain live customer records because creating representative synthetic data would take longer. Prompt logs may be retained by default. A retrieval tool may expose full source documents when a smaller excerpt would be enough. A vendor may reserve rights to use customer inputs for service improvement. Outputs may reproduce personal data. Model updates may change behaviour after the original DPIA has been signed off.

 Another common failure is relying on a single control for multiple risks. Pseudonymisation may reduce the impact of unauthorised disclosure, but it will not necessarily address unfair use, excessive retention, lack of transparency, inaccurate outputs or unlawful repurposing. Anonymisation may reduce the scope of data protection obligations for a particular dataset if properly achieved, but it does not automatically make every later use ethical, accurate or contractually permitted. Minimisation may reduce data volume and exposure, but it does not prove that the model is fair, explainable or secure.

 Controls need to be mapped to risks. That is the heart of the DPIA discipline.

## The evidence a DPO should expect

 An AI privacy control is only as strong as the evidence behind it. For a higher-risk or uncertain AI use case, the evidence pack should answer the following questions:

1. What is the specific purpose of the AI use case, and what output or decision support is the system intended to provide?
2. What personal data categories are used at each stage, including training, testing, prompting, retrieval, outputs, logs, monitoring and support?
3. Why is each category necessary for the stated purpose, and what less intrusive alternatives were considered?
4. If data is pseudonymised, what transformation is used, where is the additional information held, who can access it, and how is unauthorised attribution prevented?
5. If data or a model is claimed to be anonymous, what re-identification assessment has been completed, for which environment, and with what residual risk?
6. What retention and deletion rules apply to datasets, prompts, uploaded files, embeddings, logs, outputs, audit records and vendor-held data?
7. What access controls, monitoring and incident processes apply internally and at vendor level?
8. What do the vendor terms say about model improvement, support access, subprocessors, international transfers, telemetry and deletion?
9. What DPIA, AI assessment, legitimate interest assessment or procurement record captures the judgement?
10. What change events will trigger review, such as new data categories, wider deployment, model updates, new integrations, new recipients or evidence of unexpected outputs?

 This is not paperwork for its own sake. It is the record that allows a DPO, risk committee, audit team or regulator to see why the organisation believed the control was appropriate at the time.

## How this connects to DPIAs

 Pseudonymisation, anonymisation and minimisation should not be pasted into a DPIA as generic mitigations. They should be connected to specific risks.

 If the risk is unauthorised access to identifiable customer records, pseudonymisation may reduce impact if the key is properly separated and access is controlled. If the risk is excessive use of employee monitoring data, minimisation may require the team to remove direct identifiers, reduce granularity, shorten retention or redesign the output. If the risk is that a vendor may use prompts for model improvement, the control may be contractual and technical rather than anonymisation alone. If the risk is re-identification from a small dataset, anonymisation may require aggregation, suppression, generalisation, access restriction or a decision not to proceed with that dataset.

 Good DPIA analysis should show the risk, the control, why that control reduces the risk, what risk remains, who owns the control and when it will be reviewed. This is where many AI DPIAs become thin. They list controls, but they do not explain the link between control and risk.

 That link becomes especially important where an AI system changes over time. A pilot may begin with a restricted dataset and no vendor model improvement. Six months later, the same tool may be connected to new source systems, rolled out to more users, configured to keep longer logs, or upgraded to a different model. The original assessment may no longer be accurate.

 For organisations building an AI governance process, this is why AI review should sit inside a lifecycle rather than a one-off sign-off. XpertDPO’s [AI Governance and DPIA Lifecycle Support](https://xpertdpo.com/ai-governance-dpia-lifecycle-support/) keeps DPIA evidence, vendor review and change control connected as AI use cases evolve. For a single assessment, [DPIA Support](https://xpertdpo.com/data-protection-impact-assessment-dpia-support/) can help structure the evidence and residual risk. Where a third-party AI tool is involved, [Vendor / Third-Party Privacy Governance](https://xpertdpo.com/vendor-third-party-privacy-governance/) is often part of the same control picture.

## Practical examples

 A customer service chatbot may not train on customer data, but the DPO should still ask what customer data is retrieved into prompts, whether transcripts are retained, whether staff can inspect conversations and whether the vendor uses logs for improvement. An HR analytics tool may pseudonymise employee identifiers before development, but the model is still about employees and may affect management decisions. An internal knowledge assistant may not look like an AI privacy project at first, yet indexed documents may contain employee names, complaints, disciplinary information and third-party personal data.

 These examples show why the same terms cannot be assessed in the abstract. The control must match the use case.

## When specialist support is sensible

 Most organisations can handle straightforward minimisation decisions internally if the purpose is clear, the data is low risk and the system is well understood. Specialist support becomes more useful when the data is sensitive, the model behaviour is difficult to explain, the vendor terms are unclear, or the judgement needs to be defensible for board, audit or regulator scrutiny.

 It is particularly sensible to pause where an organisation is relying on an anonymisation claim to conclude that GDPR or UK GDPR does not apply; where high-value or sensitive data is used for training or testing; where the system involves employees, children, patients, financial vulnerability, complaints or location data; where a vendor’s “de-identified” language is doing too much work; or where the project depends on international transfers and a complex subprocessor chain.

## The governance test

 The simplest test is this: if someone asked why this AI system uses this data, in this form, for this purpose, with these recipients and these retention periods, could the organisation answer clearly?

 If the answer is yes, the controls are probably being treated as part of governance. If the answer is no, the organisation may only have labels.

 Pseudonymisation should come with evidence of transformation, separation and protection against unauthorised attribution. Anonymisation should come with a context-specific assessment of identifiability and re-identification risk. Data minimisation should come with a reasoned explanation of necessity, alternatives and retention across the AI lifecycle.

 For CPD, board and DPO purposes, that is the key point. Privacy-enhancing controls are valuable, but they are not self-executing. They need to be chosen, documented, maintained and reviewed against the actual system.

 *This article is intended to support the learning covered in **Hour 1** of our [XpertAcademy](https://xpertacademy.com/) CPD programme. The relevant CPD certificate is issued for completion of the full one-hour session on XpertAcademy, rather than for reading this article on its own. You can return to the course here: [CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training](https://xpertacademy.com/cpd-event-b-ai-technical/).*

## General Information Only

This article is provided for general information and does not constitute legal, regulatory, or professional advice. Data protection obligations depend on the specific facts, context, and jurisdiction involved. You should not rely on this content as a substitute for advice tailored to your organisation.

If you would like support with a specific issue, please contact us: https://xpertdpo.com/contact/
