Annotation Guidelines for Personal Information in Medical Texts.

Luis Miranda (1), Jocelyn Dunstan (1), Fredy Núñez (2)

  1. Department of Computer Science, School of Engineering. Pontificia Universidad Católica de Chile.
  2. Department of Language Sciences, Faculty of Linguistics and Literature. Pontificia Universidad Católica de Chile.

Versión en Español: Guía Deidentificación Clínica

Introduction

In an increasingly digitized world, where information constantly flows through various channels and platforms, the protection of privacy and the security of sensitive data has become an unavoidable priority. Consequently, the correct annotation and labeling of sensitive information is essential to ensure that the highest standards of confidentiality are met, while also complying with legal regulations established to preserve the integrity of personal and medical data. Thus, this guide provides efficient, precise, and well-defined instructions for carrying out the manual annotation process of key entities related to clinical personal information.

This project aims to create a textual dataset (corpus), based on authentic texts, labeled with clinical personal information. From the above, this labeling guide is a support for the work of the labelers. In its development, the INCEpTION labeling software was used. Its user interface is as follows:

Law No. 19.628 on the Protection of Private Life

The privacy of patient data is a central concern in the health field in Chile. In this regard, Chilean law establishes specific regulations to protect the confidentiality of medical and personal information of patients. Law No. 19.628 on the protection of private life regulates the collection and processing of personal data. This law distinguishes the following concepts:

  • Personal information: refers to any data or set of data from which it is possible to directly or indirectly identify a natural person. This includes data such as names, surnames, identification numbers, addresses, phone numbers, email addresses, birth dates, among others. In short, it is any information that allows identifying or relating to a specific person.

  • Sensitive information: refers to a subset of personal information that encompasses particularly delicate data or data that can have a more significant impact on a person’s privacy. This category may include information related to health status or condition, sexual orientation, religious or philosophical beliefs, criminal records, union affiliations, or other data that, if disclosed or misused, could cause harm, discrimination, or negative impact on a person’s life.

Law No. 20.584 on the Rights and Duties of Patients

In the medical field, Law No. 20.584 on the rights and duties of patients, along with its respective regulation, establishes clear guidelines for handling sensitive patient information. Thus, it is possible to ensure that their rights and privacy are respected:

  • Patient rights: The law guarantees patients a series of fundamental rights in relation to their medical care and the privacy of their clinical data. These rights include access to information from their medical history, obtaining a copy of their medical records, informed consent before any procedure or treatment, and the right to confidentiality and security of their medical data.

  • Informed consent: The law emphasizes the importance of informed consent as the right of patients to receive complete and understandable information, before undergoing any medical procedure, about their diagnosis, treatment options, risks, and benefits. This consent must be given voluntarily and explicitly by the patient, after receiving adequate information from a healthcare professional.

  • Privacy and Confidentiality: The law establishes strict rules for the confidentiality of patients’ clinical data. Healthcare professionals and medical institutions must ensure that clinical information is handled securely and shared only with those who have the corresponding authorization. As cited in article 3:

    “For the purposes of personal data processing, it is understood that the provider is responsible for keeping the records or databases of patients generated during the management of health support systems, and providers will have the responsibilities of a representative, as provided in Law No. 19.628 on the protection of private life.”

Health Insurance Portability and Accountability Act HIPAA in the USA:

Within the medical field, protected health information (PHI) refers to medical, mental health, or clinical history information protected by the Health Insurance Portability and Accountability Act (HIPAA) in the United States. This law establishes rigorous standards for protecting the privacy and security of patients’ medical and personal data. HIPAA identifies a specific set of 18 data categories, known as the “18 PHI Identifiers.” These categories are considered particularly sensitive and must receive additional protection. These identifiers encompass information that could be used to identify a particular individual. Some of the identifiers include:

    1. Name
    1. Address (all geographic subdivisions smaller than a state, including street address, city, county, and zip code)
    1. All elements (except years) of dates related to an individual (including birth date, admission date, discharge date, death date, and exact age if over 89 years)
    1. Phone numbers
    1. Fax numbers
    1. Email addresses
    1. Social Security numbers
    1. Medical record numbers
    1. Health plan beneficiary numbers
    1. Account numbers
    1. Certificate or license numbers
    1. Vehicle identifiers and serial numbers, including license plate numbers
    1. Device identifiers and serial numbers
    1. Web URLs
    1. Internet Protocol (IP) addresses
    1. Fingerprints or voiceprints
    1. Photographic images: photographic images are not limited to images of the face
    1. Any other characteristic that could uniquely identify an individual

HIPAA establishes rigorous guidelines for the protection of health information in general and these identifiers in particular. This includes the implementation of physical, administrative, and technical security measures to prevent unauthorized access, misuse, and disclosure of medical and personal data. Compliance with these regulations is essential to safeguard the privacy and confidentiality of patients and their medical information in the United States.

These 18 PHI identifiers have been adapted for application in the Chilean context, especially in the field of medical texts linked to the Asociación Chilena de Seguridad (ACHS), which include admission reports and medical histories of work or commute accidents. These texts, which are fundamental for the evaluation and follow-up of clinical cases, must be treated with the same rigor regarding the protection of personal/sensitive data. The adoption of appropriate security measures, both physical and technological, is essential to preserve the privacy and confidentiality of patients, ensuring that their medical information is protected and aligned with the highest standards of security and privacy.

Medical Texts, Medical History, and Admission Report:

In the Asociación Chilena de Seguridad (ACHS), medical texts are highly relevant in situations related to work and commute accidents. One of these texts is the medical history, defined as the set of questions directed to the patient to obtain relevant medical information about their health history, symptoms, and pre-existing conditions. An example of a medical history is shown below:

  • AM:- RAM:- TODAY APPROX 9:30 SLIPPED ON SLIPPERY FLOOR MOVING A FEW METERS BUT REGAINED BALANCE WITHOUT FALLING TO THE GROUND, STARTING WITH SUDDEN PROGRESSIVE PAIN IN THE POSTERIOR ASPECT OF THE RIGHT THIGH. PLAN: DUE TO CLINIC AND REFRACTORY PAIN REQUEST URGENT DOPPLER ULTRASOUND. DVT VS SEMIMEMBRANOSUS TEAR FOLLOW-UP -DOPPLER ULTRASOUND (-) FOR DVT. (+) FOR TEAR WITH HEMATIC COLLECTION -PERSISTS WITH PAIN AND FUNCTIONAL IMPAIRMENT. PLAN 2: -PARTIAL REFERRAL. INSUFFICIENT MECHANISM-CONTROL THROUGH THEIR INSURANCE SYSTEM. FOLLOW-UP AT 15:43 DR. JARA.

Another medical text is the admission report. In the case of the ACHS, the admission report consists of a detailed record of the events that caused an accident, providing a clear context for case analysis and follow-up. An example text for the admission report is as follows:

  • AT THE TIME OF THE ACCIDENT, I WAS ON MY WAY TO WORK. WHAT HAPPENED WAS THAT WHILE GOING DOWN THE STAIRS OF THE METRO STATION, I SLIPPED AND FELL TO THE FLOOR. THE ACCIDENT OCCURRED WITH OTHERS. THERE ARE NO WITNESSES TO MY ACCIDENT, I NOTIFIED THE COMPANY, THE NAME AND POSITION OF THE PERSON IS ANDREA CARRASCO, SUPERVISOR, DATE AND TIME I NOTIFIED THE COMPANY ABOUT THE ACCIDENT: 02.10.2018 AT 09:40:00

Each example used in this annotation guide was extracted from the corpus or created for explanatory purposes from the same corpus (all personal information was modified).

Manual Entity Annotation Rules

The fundamental premises governing all annotation rules in this protocol are:

  • Annotate the shortest and most general possible expression, considering that it must still fully describe the entity. In this sense, modifying words should be excluded. This helps preserve the coherence and flow of the text after anonymization.

  • Maintain the clinical context, ensuring that the annotations of named entities in clinical texts retain the medical and contextual information accompanying those entities, as these texts often contain detailed medical information that may be crucial for proper understanding and treatment of patients.

The annotation rules can be classified into 4 types:

  • General Rules (Rules-G): positive and negative rules that apply to all mention labels (including general orthotypographic rules).
  • Positive Rules (Rules-P): rules that specify the entities that should be annotated.
  • Negative Rules (Rules-N): rules that specify the entities that should NOT be annotated.
  • Multiword Rules (Rule-M): rules that specify if a group of words should be annotated under a single label or not.

General Rules

  • Do not include spaces or orthotypographic signs that appear before or after each mention in the label.

    • Correct:
FECHA Y HORA EN QUE AVISO A SU EMPRESA SOBRE EL ACCIDENTE: 09.03.2023 A LAS 08:30:00. ACCIDENTE OCURRIO EN...
T1 Time 76 84 

.

  • Incorrect:
FECHA Y HORA EN QUE AVISO A SU EMPRESA SOBRE EL ACCIDENTE: 09.03.2023 A LAS 08:30:00. ACCIDENTE OCURRIO EN...
T1 Time 75 85 
  • Annotate each mention of a named entity that can be generally associated with any of the entities defined in this guide as a single entity, as shown in (3), avoiding overspecification as shown in (4).

    • Correct:
...Paciente estuvo en Jumbo La Reina, en donde mientras trabajaba se accidentó...
T1 Company 22 39 

.

  • Incorrect:
...Paciente estuvo en Jumbo La Reina, en donde mientras trabajaba se accidentó...
T1 Company 22 30 
T2 Location 31 39 

.

Specific Rules

The rules associated with each of the entities to be annotated are defined and explained in:

  • Entities

    • Age: Label for a person’s age.

    • Company: Label for institution names.

    • Date Part: Label for component(s) of a date.

    • Email: Label for email.

    • First_Name: Label for any name of a patient, doctor, or any person mentioned.

    • Full_Date: Label for an exact date.

    • Health_Care_Unit: Label for names of health institutions.

    • Last_Name: Label for a person’s last name.

    • Location: Label for any findable geographical subdivision.

    • Occupation: Label for a person’s job, profession, or occupation.

    • Phone_Number: Label for phone number.

    • RUN: Label for National Unique Roll (RUN).

    • Queries for problematic cases. If you identify doubtful or ambiguous cases, you are encouraged to search the internet for company names, surnames, first names, etc. If the doubt persists or relates to the annotation rule, write to the researcher Luis Miranda.