swell-project

Pseudonymization guidelines

The purpose of pseudonymization is to de-identify all information that can reveal the identity of the person who wrote the text. This information can include person names, age, addresses and phone numbers, city names and other geographical names, etc.

On top of this, some information is also marked as “sensitive” during the pseudonymization process. This is information which does not in itself disclose the identity of the writer, but which would be particularly harming to reveal were the identity of the writer to be disclosed in spite of the de-identifying efforts. Sensitive information is for instance information on political or religious views of the writer. The information marked as sensitive will be reviewed before publication of the corpus to evaluate whether it needs to be hidden or not.

Your task is 1) to identify all information that can relate to the specific person who wrote the text, and categorize what type of information it is so that the person can be de-identified by changing/hiding the specific information, and 2) to mark potentially sensitive information related to the writer. The replacement of the personal information is performed automaticaly given the assigned label.

This document contains instructions for how to proceed.

Basic principles

  1. Remove/change the information that can reveal a person behind the essay(s), yet keep to the “minimal change” rule. The data should be usable in research scenarios. Example?

  2. Data on deviations from standard Swedish (mis-spellings etc.) will be lost for the pseudonymized strings. This also holds for text segments the form of which is dependent on the pseudonymized string (for instance prepositions preceeding pseudonymized city och country names).

  3. Annotators have to make the assessment of the risks and needs for pseudonymization (an element of subjectivity).

  4. Tokens should not be pseudonymized solely on the basis of them belonging to a specific category listed among the pseudonymization categories, but on the basis of them potentially revealing the identity of the writer. For instance, not all country or city names are pseudonymized, but only those which, together with the context, 1) may be connected to the writer (e.g. because the city may be identified as the writer’s home town), and 2) reveal information which is specific enough to be used to identify the writer. Accordingly, in a text where Istanbul is mentioned as a city where the writer has lived or as a city where a family member of the writer lives (etc.), Istanbul should be pseudonymized. But not so in a text providing general information about Istanbul. And while the information that the writer stems from the Baltic countries may be reason to pseudonymize Baltikum (as a region), the information that the writer stems from Europe does not necessitate pseudonymization, since Europe is such a large region which may be assumed to be the home region for a large number of potential writers.

  5. Keep track of whether the token is “original” or “masked”. (This is done automatically by the annotating tool.)

  6. Categories that need to be marked in the texts, but not necessarily replaced. We will make an assessment later when we have enough statistics over the learners behind the essays, as well as the assembled texts and metadata on each particular writer:
    • country: the same pseudonymization tag, < country >, is used for:
    • country of origin (Jag kommer från Syrien versus Jag kommer från Luxembourg) - depending upon how many subjects in our database are from the named countries
    • country of “intermediate” residence (Vi har stannat en månad i Turkiet)
    • Note: Mentions of Sweden as a country of origin or residence are not marked. * number of family members (Jag har fem bröder och fyra systrar) - we will need to see whether it is a normal pattern in many essays. If yes - no masking/suppression is necessary * professions (Jag är webbutvikler) * education
  7. Categories that can be used for discrimination, such as political views, religious convictions or sexual orientation, should also be marked (with the tag < sensitive >) without being masked right away. A decision will be made later in the process, before publication. E.g. I en dag såg vi en stor demstration det var för mycket människor vill inte Turkiets statsminister Ardogan och vi kände mycket glad för att det var första dag ser vi en fri demstration.

Supra-categories

May be applied on top of other categories, as (extra)linguistic information.

Running numbers

Applies to all named entities (NE) and their @placeholders. Each unique named entity type (e.g.name) should get its own running number, starting with 1. If the same NE is repeated in the text, the same running number is assigned to it. This is done automatically, but the automatically assigned running number may be changed manually. A manual change of the running number is necessary when the same entity (for instance the same city) is referred to by non-identical strings (for instance due to mis-spelling).

<NEED A PICTURE HERE>

Morphology:

<NEED A PICTURE HERE>

Pseudonymize:

1. Personal Names:

2. Geographic data (country, city, zip codes, area names, …)

alt text

3. Institution: < school > , < work > , < other_institution >

4. Transportation: < transport_name >, < transport_nr >

5. Age: < age_digits >, < age_string >

7. Phone numbers < phone_nr >

8. Email addresses < email >

9. [personal] web pages (URL) < url >

10. Social security numbers < personid_nr >

11. Account numbers < account_nr >

12. Certificate/licence numbers (e.g. vehicle) < license_nr >

13. Other sequence of numbers < other_nr_seq >

14. Extra (something else, not covered but the previous categories)

15. Mark up but do not pseudonymize:

alt text

I en dag såg vi en stor demstration det var för mycket människor vill inte Turkiets statsminister Ardogan och vi kände mycket glad för att det var första dag ser vi en fri demstration.

–>

I en dag såg vi en stor demstration det var för mycket människor vill inte Turkiets < country, genitive > statsminister Ardogan < surname > och vi kände mycket glad < sensitive > för att det var första dag ser vi en fri < sensitive > demstration < sensitive >.

16. Comments < OBS! >, < Com! >, document comments

For < sensitive > we need to evaluate if subgroups are needed:

religion, ethnicity, sexual orientation, political views, physical and mental disabilities

Information about languages spoken by the writer is not pseudonymized

Although information about languages spoken by the writer may help identifying the writer, such information is not pseudonymized, since this information is nevertheless included in the metadata which will be available for the corpus users.

To be included:

Examples Original: Jag heter Ali och bor i Borlänge. Jag flyttade till Sverige för 1 år sedan. Jag har flytt från Afghanistan med min familj 2015. Jag har fem bröder och tre systrar. Vi bor på Tegelvägen 32. Jag vill jobba. Jag vill bli arkitekt. Sverige är skön. Jag är muslim.

Marked-up: Jag heter och bor i . Jag flyttade till Sverige för 1 år sedan. Jag har flytt från , med min familj . Jag har bröder och systrar. Vi bor på . Jag vill jobba. Jag vill bli . Sverige är skön. Jag är .

Pseudonymized*: Jag heter Mohammed och bor i Göteborg. Jag flyttade till Sverige för 1 år sedan. Jag har flytt från Afghanistan med min familj 2013. Jag har två bröder och två systrar. Vi bor på Gustavsgatan 1. Jag vill jobba. Jag vill bli . Sverige är skön. Jag är .

Details about geonames (from mail med SweLLers, 3 april 2018):

För att pseudonymisera ”platserna”, kan vi använda http://www.geonames.org

Det finns en hel del info och zip-filer där, så klicka på ”info”-länken (eller använd länken nedan) för att läsa förklaringar http://download.geonames.org/export/dump/readme.txt

För att ladda hem filerna, klicka på Free Gazetteer Data (eller använd länken nedan) http://download.geonames.org/export/dump/readme.txt

Det verkar att de mest användbara listorna (för oss) är:

Vi behöver leka lite med dessa filer. Enligt info-länken, så finns det även koder för kontinenter, till exempel. Filerna är superstora, så dator kan lätt hänga sig om man försöker öppna filerna… Jag klistrar in denna info till vårt anonymiserngsdokument.

Details about personal names (from mail med Markus, 28 mars 2018):

Vi har datan från SCB med för- och efternamn här: https://svn.spraakbanken.gu.se/sb-arkiv/lexikon/scb-namn

(litet skräpig data; vi påbörjade arbetet med att skapa en namn-saldo, men det stannade av. saldo-namn.xml innehåller senaste versionen av namn-morfologin)

About workplaces (category 3) and professtions (15):

(3.) Companies ( < work > ): https://sv.wikipedia.org/wiki/Lista_över_företag_på_Stockholmsbörsen_–_medelstora_företag http://www.largestcompanies.se/topplistor/sverige/de-storsta-foretagen-efter-omsattning

Universities and schools ( < institution > , < school > ): https://sv.wikipedia.org/wiki/Lista_över_universitet_och_högskolor_i_Sverige

< Other_institutions > ?

(15.) < profession > : t.ex. https://www.gymnasium.se/yrkesguiden/alla-yrken-10957 https://www.saco.se/studieval/yrken-a-o/

< education > ?

More notes: