Pseudonymization guidelines

Beáta Megyesi, Lisa Rudebeck, Elena Volodina (June, 2018 – May 2019)

Online version of this document: https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines


1. Basic principles

2. Supra-categories

3. Pseudonymization

4. Text example

5. Resources (lists) for pseudonyms and automatic pseudonymization scripts

6. SweLL annotation tool

7. SweLL publications on the topic

The purpose of pseudonymization is to de-identify all information that can reveal the identity of the person who wrote the text. This information can include person names, age, addresses and phone numbers, city names and other geographical names, etc.

On top of this, some information is also marked as “potentially sensitive” during the pseudonymization process. This is information which does not in itself disclose the identity of the writer, but which would be particularly harming to reveal were the identity of the writer to be disclosed in spite of the de-identifying efforts. Sensitive information is for instance information on political or religious views of the writer. The information marked as potentially sensitive will be reviewed before publication of the corpus to evaluate whether it needs to be hidden or not.

Your task as an assistant is 1) to identify all information that can relate to the specific person who wrote the text, and categorize what type of information it is so that the person can be de-identified by changing/hiding the specific information, and 2) to mark potentially sensitive information related to the writer. The replacement of the personal information is performed automaticaly given the assigned label.

This document contains instructions for how to proceed.

1. Basic principles

  1. Remove/change the information that can reveal a person behind the essay(s), yet keep to the minimal change rule. The data should be usable in research scenarios.

  2. Data on deviations from standard Swedish will be lost for the pseudonymized strings (e.g. mis-spellings etc.). This also holds for text segments the form of which is dependent on the pseudonymized string (for instance prepositions preceeding pseudonymized city and country names, e.g. in Germany -> in Cuba).

  3. Annotators have to make the assessment of the risks and needs for pseudonymization (an element of subjectivity).

  4. Tokens should not be pseudonymized solely on the basis of them belonging to a specific category listed among the pseudonymization categories, but on the basis of them potentially revealing the identity of the writer. For instance, not all country or city names are pseudonymized, but only those which, together with the context, 1) may be connected to the writer (e.g. because the city may be identified as the writer’s home town), and 2) reveal information which is specific enough to be used to identify the writer. Accordingly, in a text where Istanbul is mentioned as a city where the writer has lived or as a city where a family member of the writer lives (etc.), Istanbul should be pseudonymized. But not so in a text providing general information about Istanbul. And while the information that the writer stems from the Baltic countries may be reason to pseudonymize Baltikum (as a region), the information that the writer stems from Europe does not necessitate pseudonymization, since Europe is such a large region which may be assumed to be the home region for a large number of potential writers.

  5. Keep track of whether the token is “original” or “masked”. (This is done automatically by the annotating tool.)

  6. Categories that need to be marked in the texts, but not necessarily replaced. An assessment should be made later when enough statistics is collected over the learners behind the essays , as well as the assembled texts and metadata on each particular writer:
    • country: the same pseudonymization tag, < country >, is used for:
      • country of origin (Jag kommer från Syrien versus Jag kommer från Luxembourg) - depending upon how many subjects in the database are from the named countries
      • country of “intermediate” residence (Vi har stannat en månad i Turkiet)
      • Note: Mentions of Sweden as a country of origin or residence are not marked.
    • number of family members (Jag har fem bröder och fyra systrar –> Eng. “I have five brothers and four sisters”) - an estimation is necessary to see whether it is a normal pattern in many essays. If yes - no masking/suppression is necessary
    • professions (Jag är webbutvikler)
    • education
  7. Categories that can be used for discrimination, such as political views, religious convictions or sexual orientation, should also be marked (with the tag < sensitive >) without being masked right away. A decision needs to be made later in the process, before publication. E.g. I en dag såg vi en stor demstration det var för mycket människor vill inte Turkiets statsminister Ardogan och vi kände mycket glad för att det var första dag ser vi en fri demstration.

  8. Although information about languages spoken by the writer may help identifying the writer, such information is not pseudonymized, since this information is nevertheless included in the metadata which will be available for the corpus users.

2. Supra-categories

May be applied on top of other categories, as (extra)linguistic information.

2.1 Running numbers

Applies to all personally identifiable information (PII) and their @placeholders. Each unique PII type (e.g.name) should get its own running number, starting with 1 in each individual essay. If the same PII is repeated in the text, the same running number is assigned to it. This is done automatically, but the automatically assigned running number may be changed manually. A manual change of the running number is necessary when the same PII (for instance the same city) is referred to by non-identical strings (for instance due to mis-spelling).

2.2 Morphology

3. Pseudonymization

3.1 Personal Names

3.2 Geographic data

3.3 Institutions

3.4 Transportation

3.5 Age

3.6 Dates

3.7 Phone numbers

3.8 Email addresses

3.9 Web pages

3.10 Social security numbers

3.11 Account numbers

3.12 Certificate, licence numbers

3.13 Other sequence of numbers

3.14 Extra (something else, not covered in the previous categories)

<!–//: # (In that case, the following could apply:)

3.15 Mark up but do not pseudonymize


4. Text example

5. Resources (lists) for pseudonyms and automatic pseudonymization scripts

Resources and lists are collected in a private repository here: https://github.com/SamirYousuf/LR_project

6. SweLL annotation tool

SVALA is used for both manual annotation and for supportive automatic pseudonymization. A demo-version of the tool can be found here: https://spraakbanken.gu.se/swell/dev/#

7. SweLL publications on the topic