Future Directions for Text Sanitization Research

Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
Abstract and 1 Introduction
2 Background
2.1 Definitions
2.2 NLP Approaches
2.3 Privacy-Preserving Data Publishing
2.4 Differential Privacy
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
3.2 Wikipedia Biographies
4 Privacy-oriented Entity Recognizer
4.1 Wikidata Properties
4.2 Silver Corpus and Model Fine-tuning
4.3 Evaluation
4.4 Label Disagreement
4.5 MISC Semantic Type
5 Privacy Risk Indicators
5.1 LLM Probabilities
5.2 Span Classification
5.3 Perturbations
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
7 Conclusions and Future Work
Declarations
References
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
C. Label Agreement
D. LLM probabilities: base models
E. Training size and performance
F. Perturbation thresholds
7 Conclusions and Future Work
This paper presented a thorough empirical analysis of two key components of text sanitization methods, namely entity recognition and privacy risk indicators.
Our experimental results, based on two complementary corpora, indicate that the entity detection task is the most straightforward step to automate. We highlighted the need to go beyond the mere detection of named entities and identify other types of text spans that may provide identifying information. To this end, we showed how to extract large lists of person-related attributes extracted from a knowledge graph such as Wikidata, and thereby apply those lists, along with a standard NER model, to automatically annotate a silver corpus of PII spans. This silver corpus can, in turn, serve as training data to fine-tune a domain-specific, privacy-oriented entity recognizer.
Determining text spans associated with a high re-identification risk is the most challenging part of the task. A system that masks all detected spans leads to a low privacy risk, but also a comparatively low data utility. We presented five distinct indicators of privacy risk, respectively based on LLM probabilities, span classification, perturbations, sequence labelling and web search. Evaluation results comparing the outputs of those approaches with human annotations demonstrate the difficulty of the task. Sequence labelling leads to masking decisions that are most in line with expert annotations, but this approach hinges upon the availability of labelled training data, a requirement that is rarely satisfied in text sanitization tasks. The use of web search also constitutes a promising direction, but suffers from technical constraints arising from the reliance on search engine APIs.
The present paper did not elaborate on how masking should be performed. A common option is to simply replace the selected spans by a black box or a generic string such as “***”. However, to mitigate the loss of data utility , it is often preferable to replace those PII spans with other, less specific spans, such as editing “Oslo” into “[city in Norway]” or “January 6, 2022” into “2022”. Olstad et al. (2023) explore an approach for selecting suitable replacements for different types of PII spans, which we evaluated against human decisions. Such an approach could be extended to different domains and different types of PII spans.
Evaluating text sanitization tasks is a challenging problem, notably due to the presence of multiple equally correct solutions for a given document (Lison et al., 2021). In this work we evaluate against manually annotated documents, using several expert annotators for each document to ensure a reasonable set of possible solutions. Relying on manual annotations is, however, not always feasible, so alternative evaluation methods suitable for the task should also be explored. Re-identification constitutes a promising approach (Scaiano et al., 2016; Mozes and Kleinberg, 2021; ManzanaresSalor et al., 2022), and operates by carrying out an attack aiming to determine whether an individual was part of the sanitized collection of documents, often having access to additional, external resources. Future work will look at the use of such re-identification attacks as an alternative measure of the text sanitization strength.
Declarations
Author Contribution
A.P: conceptualization, implementation of privacy risk indicators (apart from web search), evaluation of privacy-aware entity recognition, evaluation of privacy risk indicators alone and in combination, creation of figures, creation of table, writing and editing of the final manuscript. P.L: supervision, writing and editing of the final manuscript. M.A: web search risk indicator, writing. L.Ø: supervision, writing. I.P: supervision, writing. All authors reviewed the manuscript prior to submission.
Funding
The research leading to these results received funding from the Research Council of Norway (CLEANUP project) under Grant nr. 308904.
Conflict of Interest
The authors have no conflict of interest to declare.
Data Availability Statement
No datasets were generated during the current study.
Acknowledgment
We acknowledge support from the Norwegian Research Council (CLEANUP project, grant nr.308904).
References
Anandan, B., C. Clifton, W. Jiang, M. Murugesan, P. Pastrana-Camacho, and L. Si. 2012, December. T-plausibility: Generalizing words to desensitize text. Transactions on Data Privacy 5 (3): 505–534.
Beltagy, I., M.E. Peters, and A. Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Chakaravarthy, V.T., H. Gupta, P. Roy, and M.K. Mohania 2008. Efficient techniques for document sanitization. In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 843–852. ACM.
Chevrier, R., V. Foufi, C. Gaudet-Blavignac, A. Robert, and C. Lovis. 2019, 05. Use and understanding of anonymization and de-identification in the biomedical literature: Scoping review. Journal of Medical Internet Research 21. https: //doi.org/10.2196/13484.
Chiu, J.P. and E. Nichols. 2016. Named entity recognition with bidirectional LSTMCNNs. Transactions of the Association for Computational Linguistics 4: 357–370. a 00104.
Clark, K., M. Luong, Q.V. Le, and C.D. Manning 2020. ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Cumby, C. and R. Ghani. 2011, August. A machine learning based system for semi-automatically redacting documents. Proceedings of the AAAI Conference on Artificial Intelligence 25 (2): 1628–1635. org/10.1609/aaai.v25i2.18851.
Dernoncourt, F., J.Y. Lee, O. Uzuner, and P. Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24 (3): 596–606.
Devlin, J., M.W. Chang, K. Lee, and K. Toutanova 2019, June. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics.
Ding, S., H. Xu, and P. Koehn 2019, August. Saliency-driven word alignment interpretation for neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), Florence, Italy, pp. 1–12. Association for Computational Linguistics.
Dwork, C., F. McSherry, K. Nissim, and A. Smith 2006. Calibrating noise to sensitivity in private data analysis. In S. Halevi and T. Rabin (Eds.), Theory of Cryptography, Berlin, Heidelberg, pp. 265–284. Springer Berlin Heidelberg.
Elliot, M., E. Mackey, K. O’Hara, and C. Tudor. 2016. The anonymisation decisionmaking framework. UKAN Manchester.
Erickson, N., J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505.
Fernandes, N., M. Dras, and A. McIver 2019. Generalised differential privacy for text document processing. In F. Nielson and D. Sands (Eds.), Principles of Security and Trust – 8th International Conference, POST 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6-11, 2019, Proceedings, Volume 11426 of Lecture Notes in Computer Science, pp. 123–148. Springer.
Feyisetan, O., T. Diethe, and T. Drake 2019. Leveraging hierarchical representations for preserving privacy and utility in text. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 210–219. IEEE.
Goldsteen, A., G. Ezov, R. Shmelkin, M. Moffie, and A. Farkash. 2021, September. Data minimization for GDPR compliance in machine learning models. AI and Ethics 2 (3): 477–491. org/10.1007/s43681-021-00095-8.
Golle, P. 2006. Revisiting the uniqueness of simple demographics in the US population. In Proceedings of the 5th ACM workshop on Privacy in electronic society, pp. 77–80. ACM.
He, X., K. Zhao, and X. Chu. 2021, January. AutoML: A survey of the state-of-theart. Knowledge-Based Systems 212: 106622. 106622.
HIPAA. 2004. The Health Insurance Portability and Accountability Act. U.S. Dept. of Labor, Employee Benefits Security Administration.
Igamberdiev, T. and I. Habernal 2023. Dp-bart for privatized text rewriting under local differential privacy. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, pp. (to appear). Association for Computational Linguistics.
Johnson, A.E., L. Bulgarelli, and T.J. Pollard 2020. Deidentification of free-text medical records using pre-trained bidirectional transformers. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 214–221.
Juez-Hernandez, R., L. Quijano-S´anchez, F. Liberatore, and J. G´omez. 2023. Agora: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents. Applied Soft Computing 145: 110540. org/org/10.1016/j.asoc.2023.110540.
Kindermans, P.J., S. Hooker, J. Adebayo, M. Alber, K.T. Sch¨utt, S. D¨ahne, D. Erhan, and B. Kim 2019. The (Un)reliability of Saliency Methods, pp. 267–280. Cham: Springer International Publishing.
Krishna, S., R. Gupta, and C. Dupuy 2021, April. ADePT: Auto-encoder based differentially private text transformation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 2435–2439. Association for Computational Linguistics.
Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270.
Lebret, R., D. Grangier, and M. Auli 2016, November. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1203–1213. Association for Computational Linguistics.
Li, J., X. Chen, E. Hovy, and D. Jurafsky 2016, June. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 681–691. Association for Computational Linguistics.
Li, J., W. Monroe, and D. Jurafsky. 2017. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220.
Lison, P., I. Pil´an, D. S´anchez, M. Batet, and L. Øvrelid 2021, August. Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4188–4203. Association for Computational Linguistics.
Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. 1907.11692.
Liu, Z., B. Tang, X. Wang, and Q. Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of Biomedical Informatics 75: S34–S42.
Liu, Z., X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li, et al. 2023. DeID-GPT: Zero-shot medical text de-identification by GPT-4. arXiv preprint arXiv:2303.11032.
Manzanares-Salor, B., D. S´anchez, and P. Lison 2022. Automatic evaluation of disclosure risks of text anonymization methods. In Privacy in Statistical Databases: International Conference, PSD 2022, Paris, France, September 21–23, 2022, Proceedings, Berlin, Heidelberg, pp. 157–171. Springer-Verlag.
Mozes, M. and B. Kleinberg. 2021. No intruder, no validity: Evaluation criteria for privacy-preserving text anonymization. arXiv preprint arXiv:2103.09263.
Neamatullah, I., M.M. Douglass, H.L. Li-wei, A. Reisner, M. Villarroel, W.J. Long, P. Szolovits, G.B. Moody, R.G. Mark, and G.D. Clifford. 2008. Automated deidentification of free-text medical records. BMC Medical Informatics and Decision Making 8 (1): 32.
Olstad, A.W., A. Papadopoulou, and P. Lison 2023, May. Generation of replacement options in text sanitization. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), T´orshavn, Faroe Islands, pp. 292–300. University of Tartu Library.
Papadopoulou, A., P. Lison, L. Øvrelid, and I. Pil´an 2022, June. Bootstrapping text anonymization models with distant supervision. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, pp. 4477–4487. European Language Resources Association.
Papadopoulou, A., Y. Yu, P. Lison, and L. Øvrelid 2022, November. Neural text sanitization with explicit measures of privacy risk. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online only, pp. 217–229. Association for Computational Linguistics.
Pil´an, I., P. Lison, L. Øvrelid, A. Papadopoulou, D. S´anchez, and M. Batet. 2022, 12. The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. Computational Linguistics 48 (4): 1053–1101. a 00458.pdf.
Ribeiro, B., V. Rolla, and R. Santos 2023, May. INCOGNITUS: A toolbox for automated clinical notes anonymization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Dubrovnik, Croatia, pp. 187–194. Association for Computational Linguistics.
Samarati, P. and L. Sweeney 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International.
Sasada, T., M. Kawai, Y. Taenaka, D. Fall, and Y. Kadobayashi 2021. Differentiallyprivate text generation via text preprocessing to reduce utility loss. In 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 042–047.
Scaiano, M., G. Middleton, L. Arbuckle, V. Kolhatkar, L. Peyton, M. Dowling, D.S. Gipson, and K. El Emam. 2016. A unified framework for evaluating the risk of reidentification of text de-identification tools. Journal of biomedical informatics 63: 174–183.
Serrano, S. and N.A. Smith 2019, July. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2931–2951. Association for Computational Linguistics.
Shi, X., J. Mueller, N. Erickson, M. Li, and A. Smola 2021. Multimodal automl on structured tables with text fields. In 8th ICML Workshop on Automated Machine Learning (AutoML).
Sweeney, L. 1996. Replacing personally-identifying information in medical records, the scrub system. In Proceedings of the AMIA annual fall symposium, pp. 333–337. American Medical Informatics Association.
S´anchez, D. and M. Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 67 (1): 148–163.
S´anchez, D., L. Mart´ınez-Sanahuja, and M. Batet. 2018. Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Information Systems 73: 50–60
Tjong Kim Sang, E.F. and F. De Meulder 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147.
Weitzenboeck, E.M., P. Lison, M. Cyndecka, and M. Langford. 2022, March. The GDPR and unstructured data: is anonymization possible? International Data Privacy Law 12 (3): 184–206. org/10.1093/idpl/ipac008.
Yang, H. and J.M. Garibaldi. 2015. Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics 58: 30 – 38. org/org/10.1016/j.jbi.2015.06.015.
Yermilov, O., V. Raheja, and A. Chernodub 2023, July. Privacy- and utility-preserving NLP with anonymized data: A case study of pseudonymization. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Toronto, Canada, pp. 232–241. Association for Computational Linguistics.
Yogarajan, V., M. Mayo, and B. Pfahringer. 2018. A survey of automatic deidentification of longitudinal clinical narratives. arXiv preprint arXiv:1810.06765.
Zarcone, A., M. Van Schijndel, J. Vogels, and V. Demberg. 2016. Salience and attention in surprisal-based accounts of language processing. Frontiers in psychology 7: 844.