This article accompanies Hour 1: Data Minimisation, Pseudonymisation and Anonymisation in our full-day CPD programme on XpertAcademy. Completion of the full one-hour session, including the related learning materials, contributes to the one-hour CPD certificate issued for that session. You can access the course here: CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training.
A team proposes pseudonymising customer data before model testing and treats that as automatic risk removal. The design replaces names, emails and customer IDs with tokens. The model team says the data is no longer identifiable and asks for approval to use it in a shared analytics workspace and with a specialist AI vendor.
Pseudonymisation may be a strong and sensible control. It can reduce risk, support data minimisation, limit access to direct identifiers and make AI testing more proportionate. But it does not usually take the data outside data protection law. If someone can still identify people using additional information or other reasonably available means, the data remains personal data.
The EDPB Guidelines 01/2025 on pseudonymisation, adopted in January 2025 as a version for public consultation, are useful for privacy teams because they push the discussion away from labels and towards evidence. A pseudonymised dataset should have a designed transformation, separately protected additional information, technical and organisational measures, and a clear view of the domain in which the control is expected to work.
This article is general guidance, not legal advice on a specific pseudonymisation design, AI project or vendor arrangement. The right answer depends on the data, key-management design, access model, recipients, purpose, auxiliary data, sector context and consequences for individuals.
The practical test is not "have we replaced identifiers?" It is "can we evidence how pseudonymisation reduces risk, where the additional information sits, who can use it, and what limits still apply?"
Why pseudonymisation is often misunderstood in AI projects
AI teams often reach for pseudonymisation because they need useful data but do not want direct identifiers in the model environment. That instinct is usually good. Removing names, email addresses, account numbers and direct customer IDs from a testing dataset can materially reduce exposure if the data is accessed by the wrong person, copied to a development workspace, or shared with a vendor.
The problem comes when pseudonymisation is treated as a legal endpoint. A pseudonymous token may still relate to a person. The dataset may include dates, locations, product behaviour, support histories, rare categories, free text, device IDs or account patterns that make people recognisable. The mapping table or key may be accessible to the same team. A vendor may receive enough context to link records back to a customer. The model output may reveal something about a person even if the input identifier was replaced.
For DPOs and privacy teams, the question is therefore not whether pseudonymisation is "good" or "bad". It is what the control is doing, how strong it is in context, and what evidence exists to show its limits.
The legal and governance issue in plain language
Under GDPR, pseudonymisation is a processing of personal data in such a way that the data can no longer be attributed to a specific person without additional information, provided that the additional information is kept separately and protected by technical and organisational measures. That separation is not a decorative requirement. It is central to the control.
The DPC guidance on anonymisation and pseudonymisation makes the practical distinction clear: anonymisation aims to prevent identification; pseudonymisation replaces or separates identifiers but may still leave personal data in scope. The EDPB Guidelines 01/2025 add useful emphasis on design, context and the protection of additional information.
For AI model testing, this means a privacy review should look at more than the tokenisation script. It should cover the whole control environment: who creates the pseudonyms, where the mapping data or keys are held, who can access them, whether the testing dataset contains other identifying patterns, what the vendor receives, how long the dataset remains, and what happens if the purpose expands.
Worked example: customer data for model testing
Assume a financial services organisation wants to test a model that predicts which customers may need additional support during a product migration. The project team proposes using two years of customer service data, product usage data and account status history. Direct identifiers will be replaced with random tokens before the dataset enters the model testing workspace.
The privacy team starts with several facts. The project has a defined testing purpose. The dataset includes adults only. Direct identifiers are removed from the testing copy. The source data includes support categories, product usage, arrears flags, complaint status, contact frequency and account-change history. A specialist AI vendor may support feature engineering. The team says no automated decisions will be made during the testing phase.
Important facts are still unknown. The privacy team does not know whether the token mapping is stored in the same cloud environment as the testing data. It is unclear whether product managers, analysts and vendor staff can all access the pseudonymised dataset. The team has not explained whether rare combinations of arrears, complaint history and usage make some customers identifiable. The retention period for testing extracts is not defined. The transparency position has not been checked against the migration purpose.
The risk question is: can the organisation approve the pseudonymised dataset for model testing, and if so, what evidence and limits are needed before the control can be relied on?
The privacy team performs six checks.
First, it checks the transformation. The pseudonyms are randomly generated and do not encode the customer number, email, account type or date. The same pseudonym is used only where longitudinal analysis is necessary. Separate environments use separate pseudonym sets unless linkage is justified.
Second, it checks the additional information. The mapping table is held outside the model testing workspace. Access is limited to a small operations group that is not part of the vendor's project team. Access requests are logged, time-limited and approved. The mapping is not exported with the test dataset.
Third, it checks key management and separation. If cryptographic or deterministic methods are used, key storage, rotation, access, backup and deletion must be documented. If the same key or salt is reused across projects, the team assesses whether cross-project linkage creates extra risk.
Fourth, it checks the dataset itself. Pseudonymisation of direct identifiers does not solve rare combinations, free-text notes or unusual events. The team tests whether combinations such as region, product, complaint type, arrears marker and migration date create small groups. Rare categories are grouped or excluded where they are not necessary for testing.
Fifth, it checks purpose limits and access. The model testing workspace is approved only for migration-support testing. Users cannot reuse the dataset for marketing segmentation, collections strategy or product scoring without a new review. Vendor access is limited to the agreed project, with contract terms, security evidence, role mapping and transfer position reviewed where applicable.
Sixth, it checks outputs and re-identification routes. If the model flags a pseudonymous customer as needing support, the operational team may need to re-identify the person later. That route must be controlled, justified and logged. The team records who may trigger re-identification, for what purpose, and whether the model output can be merged back into live customer systems.
The outcome is not a simple yes or no. The organisation approves a narrowed testing phase. It requires separate key storage, restricted access, exclusion of unnecessary rare categories, a defined retention period, vendor controls, a DPIA update and a review before any operational deployment or re-identification route is activated.
Evidence that should exist afterwards
The evidence should show both the control and its limits. A useful pseudonymisation evidence pack may include:
- a purpose statement explaining why pseudonymised data is needed for the AI testing activity;
- a transformation design note explaining how direct identifiers are replaced and whether linkage across records is necessary;
- a record of where additional information, mapping tables, salts or keys are stored;
- access controls for the pseudonymised dataset and for the additional information, showing separation between teams where possible;
- key-management evidence covering creation, storage, rotation, backup, access logging and deletion where relevant;
- a re-identification risk note covering rare combinations, free text, auxiliary data, vendor access and output linkage;
- a DPIA, DPIA screening note or AI assessment update;
- vendor, transfer and security evidence where a third party processes the data;
- a retention and deletion record for extracts, mapping data, logs, outputs and test environments;
- a decision record identifying the approver, purpose, limits, residual risk and review date.
This evidence does not need to be over-engineered. It does need to be specific enough that someone else can understand why pseudonymisation was treated as a meaningful control rather than a comforting label.
What the DPO or privacy team should check
The privacy team should start by asking what pseudonymisation is expected to achieve. Different purposes need different control strength. A low-risk internal prototype may not need the same design as a vendor-supported model using sensitive customer support data.
| Area | Practical check |
|---|---|
| Purpose | What is the approved AI testing purpose, and what uses are outside scope? |
| Data categories | Does the dataset include sensitive data, financial data, complaint data, employee data, free text, location, device data or rare categories? |
| Transformation | Are direct identifiers removed or replaced without embedding meaning into the pseudonym? |
| Additional information | Where is the mapping table, key, salt or other re-identification information stored, and who can access it? |
| Separation | Is the additional information kept separately from the testing dataset in practice, not just in a diagram? |
| Access controls | Who can access the pseudonymised data, who can access the additional information, and are those groups appropriately separated? |
| Re-identification risk | Can people still be identified through rare combinations, auxiliary data, vendor knowledge, free text or outputs? |
| Vendor and transfer | Does any third party process the pseudonymised data, and do the contract, security and transfer records match that reality? |
| Retention | How long are the pseudonymised dataset, mapping information, logs and model outputs retained? |
| Review triggers | What requires re-review before deployment, reuse, re-identification, vendor sharing or new linkage? |
The review should also be honest about what pseudonymisation cannot do. It does not remove the need for a lawful basis. It does not erase fairness or transparency questions. It does not make excessive data necessary. It does not by itself authorise a vendor to use the data for its own model improvement. It is one control in a wider governance design.
When pseudonymisation should trigger escalation
Escalation is sensible where pseudonymised data includes special category data, financial vulnerability, complaint data, employment data, children's data, location trails, behavioural monitoring, voice, images, free text or small populations. It is also sensible where the project wants to share data with a vendor, combine datasets, reuse the same pseudonyms across projects, reconnect outputs to live customer systems, or use the data for a model that may influence important decisions about people.
Escalation might lead to a full DPIA, stronger separation, a smaller dataset, a controlled research environment, a ban on re-identification during testing, a shorter retention period, tighter vendor terms, transfer review, senior sign-off or a decision to use synthetic or aggregated data for early-stage work.
The point is not to make pseudonymisation unattractive. The point is to avoid overclaiming it. A well-designed pseudonymisation control can make AI testing more proportionate and safer. A weakly evidenced control can create false comfort and poor decisions.
How XpertDPO supports pseudonymisation governance
XpertDPO supports organisations that need to evidence privacy controls in AI and data projects. For pseudonymisation, that can mean helping the team test whether the control is suitable, document separation and key management, review vendor or transfer issues, update DPIAs, and define re-identification limits before a pilot moves towards production.
The AI Governance and DPIA Lifecycle Support route is relevant where pseudonymisation forms part of a broader AI approval process. Where the project involves prompts, logs, embeddings or outputs from large language models, XpertDPO's article on LLM privacy risks for DPOs and privacy teams can also help identify additional routes back to identifiability.
What this means for CPD
After working through this topic, a privacy professional should be able to ask for evidence rather than accept the phrase "it is pseudonymised". They should be able to identify the additional information, test separation, question access, check purpose limits, challenge re-identification risk and record the residual risk clearly.
That is the practical value of Hour 1. Pseudonymisation becomes a designed and evidenced control, not an automatic permission slip.
This article is intended to support the learning covered in Hour 1 of our XpertAcademy CPD programme. The relevant CPD certificate is issued for completion of the full one-hour session on XpertAcademy, rather than for reading this article on its own. You can return to the course here: CPD Event B: Full-Day AI, Technical Privacy & Emerging Technology Training.