Bitcoin

Datasets for Evaluating Text Sanitization Techniques

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23b, 0373 Oslo, Norway and corresponding author ([email protected]));

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23a, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23a, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23b, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Summary and 1 Introduction

2 context

2.1 Definitions

2.2 NLP approaches

2.3 Data edition preserving confidentiality

2.4 Differential confidentiality

3 data sets and 3.1 Benchmark Anonymization of the text (tab)

3.2 Wikipedia biographies

4 Recognize private life entities

4.1 Wikidata properties

4.2 Silver corpus and fine model adjustment

4.3 Evaluation

4.4 Disagreement labeled

4.5 Various semantic type

5 risk of confidentiality indicators

5.1 LLM probabilities

5.2 SPAN classification

5.3 Disturbances

5.4 Sequences labeling and 5.5 Web search

6 Analysis of confidentiality risk indicators and 6.1 evaluation metrics

6.2 Experimental results and 6.3 discussion

6.4 Combination of risk indicators

7 conclusions and future work

Statements

References

Annexes

A. Wikidata human properties

B. Training parameters for recognition of entities

C. Label Agreement

D. LLM probabilities: Basic models

E. Size and performance of the training

F. Disturbance thresholds

3 data sets

There are only a few generic data sets (non -medical) which are devoted to the evaluation of text disinfection approaches. We present below the two data sets used for the training, evaluation and analysis of errors throughout this article.

3.1 Benchmark Anonymization of the text (tab)

The Tab Corpus (Pil´an et al., 2022) is a collection of legal boxes from the European Court of Human Rights 1268, manually annotated to protect the identity of the individual mentioned in the text. These judicial cases are documents rich in PII which are also available for free in use. The documents are in English and each judicial affair is annotated, among others, with a person to be protected as well as with pii spans detected containing the following information:

Table 1: Semantic type as well as selected explanations and examplesTable 1: Semantic type as well as selected explanations and examples

• The semantic type of duration, according to the 8 categories established in Pil´an et al. (2022), described in Table 1.

• that this scope must be masked in order to protect the confidentiality of the specified individual, with three possible values: direct, almost or no mask.

• that duration is a confidential attribute of the individual, such as religious or philosophical beliefs, political opinions, sexual orientation or sexual life, racial or ethnic origin and health, genetic and biometric data.

• Co-referencing links between the relevant expanses refer to the same entity.

Example 1 illustrates a short extract where the ranges are underlined black were marked by the annotators and those underlined dark green were classified by the annotator as indicating a direct or almost identifying mask.

Example 1. The case is from a request (n ° 44521/04) against the Republic of Poland was filed with the Court under article 34 of the Convention for the Protection of Human Rights and Fundamental Freedoms (“The Convention”) by a Polish national, Mr. Leszek Ko Lodzi´nski (“The applicant”), on August 19, 2004.

The documents are on average 1,442 long tokens. The majority of tokens are labeled as almost identifiers (63%), with fewer masked tokens like direct identifiers (4.4%) and the rest of the text annotated by the expanses left unmasked by annotators. The majority of identifiers belonged to the Datetime categories, Org and people who comply with the texts.

Some of the judicial affairs (274) have also been multi-annotated to allow an evaluation against different solutions, because the task is subjective in nature. The interannotist agreement on the type of identifier (direct, almost, without mask), calculated both on the range (K = 0.46) and the level of character (K ​​= 0.79), shows a moderate agreement which is to be expected from a task where several correct solutions may exist. We follow the training, development and tests divided the authors' version with documents of 1,014, 127 and 127 respectively.

3.2 Wikipedia biographies

Papadopoulou et al. (2022) published a collection of 553 biographies of Wikipedia manually annotated for the anonymization of the text. The annotation task was similar to that of the TAB corpus, with the exception of the lack of annotation for confidential attributes, which was not relevant for this set of data.

Example 2 shows an annotated text manually, both for detected PII doors (highlighted in black), and also for the masking decision (underlined in dark green).

Example 2. David Sherwood is a British tennis coach and a retired tennis player. In his only live match from the Davis Cup, Sherwood played in double with Andy Murray by beating the double n ° 4 team from the Israeli world of Jonathan Erlich and Andy Ram.

The documents for this data set are much shorter than those of the TAB corpus. A large majority of PII racks have been marked by annotators as almost to hide identifiers (56%) or direct identifiers (14%), while 30%of the litters were left as in the text without having to be masked, because they were deemed less specific by annotators and therefore less risk. Most of the identifiers of this data set belong to the types of Datetime, Org and Person.

Of the 553 documents, 22 were also doubly annotated to take into account several correct solutions for this task. Cohen k calculated both in the span (k = 0.44) and the character level (k = 0.81) On the masking decision showed a moderate agreement, highlighting the subjective nature of the task. We follow the division offered in Olstad et al. (2023), which includes all the documents annotated in doubles in the set of tests (100 documents), while 453 documents are available for training purposes.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker