Skip to main content

This glossary was created by the consortium partners of the European project HealthyCloud with the aim of establishing agreed definitions of commonly used terms to harmonise the work of the project. The aim is that the glossary can also be used by other projects working in similar areas. The glossary was created through a series of glossary working group calls, during which each definition was discussed and agreed upon. This is a living document that will be updated regularly with the addition of new terms and modifications, if needed, of the terms and definitions currently there.

For the source of the specific definitions, please check the original HealthyCloud listed below.

Few terms are specific to the PHIRI's activities and are not part of the HealthyCloud glossary.


Irene Kesisoglou, Shona Cosgrove, Pascal Derycke, Petronille Bogaert, Annika Jacobsen, Marco Roos, Anna Niemeyer, Alicia Martinez Garcia, Adrian Thorogood, Petr Holub, Irene Schluender, Salvador Capella-Gutierrez, Juan Gonzalez Garcia, Celia Alvarez Romero, Teresa D'Altri, Amy Curwin, Laura Portell Silva, & Lidia López Cuesta. (2022). Glossary of commonly used terms in the field of health data research - developed by the EU project HealthyCloud (1.0). Zenodo.

Zenodo link to the source



Aggregated data: Aggregated data is pooled data. Statistical data about several individuals that have been combined to show general trends or values within the data. Aggregated data are not necessarily anonymised data.

Anonymisation: The processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject. Removing personally identifiable information, so as to definitively not allow the identification of the data subjects. The methods used to anonymise the data are context dependent.

Cloud: Network of computing facilities providing remote data storage and processing services through the internet.

Cloud computing: Paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with administration on-demand.

Comparative population health research: Comparisons of health status, determinants and service use across countries, and/or over time.

Consent: An individual’s agreement e.g. to participate in research, undergo a healthcare procedure, to personal data processing. Within the context of personal data, the GDPR defines consent as: Any freely given, specific, informed and unambiguous indication of the data subject's wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her.

Data: Discrete observations of attributes or events that carry little meaning when considered alone

  • Data can be defined as the recorded factual material that is commonly accepted in the scientific community as information that is required to support research findings.
  • Refers to any digital representation of acts, facts or information and any compilation of such acts, facts or information, including in the form of sound, visual or audiovisual recording.
  • There are four major categorical types of data for where the data comes from: observational; experimental; simulated and derived.
  • Data is information available for processing.


Data access: The processing of data by a data user, which was provided by a data holder, in accordance with specific technical, legal, or organisational requirements, without necessarily implying the transmission or downloading of such data.

Data access right: the ability, right or permission to act on data in a defined location.

Data altruism: Consent by data subjects to process personal data pertaining to them, or permissions of other data holders to allow the use of their non-personal data without seeking a reward, for purposes of general interest, such as scientific research purposes or improving public services.

Data centric health research computational infrastructure: This infrastructure provides data as a service. This infrastructure includes services, such as data visualisation, hosting and processing of data. In particular, it can process health-related sensitive data. Technological infrastructures for data analysis, exploitation and/or processing.

Data controller: Under Regulation (EU) 2018/1725, as well as under the GDPR, the data controller is the party that, alone or jointly with others, determines the purposes and means of the processing of personal data. The actual processing may be delegated to another party, called the data processor. The controller is responsible for the lawfulness of the processing, for the protection of the data, and respecting the rights of the data subject. The controller is also the entity that receives requests from data subjects to exercise their rights.

Data curator: A person who is responsible for the quality and FAIRness of the health-related data, and to make sure the value of the data is discovered and accessible. This role also considers the possibility of enriching data when increasing its quality. Importantly, data curators might play a role regarding being processors, e.g. responsible for the data at hand.

Data discoverability: The ability or a mechanism to browse and locate available data relevant to a specific user’s purpose (e.g., research project) in a non-targeted search. Data is more discoverable if the data collection has metadata and the metadata is publicly accessible. Discoverability is related to findability from the FAIR principles.

Data model: In a distributed research infrastructure, data models are a formal description of data sources (entities, their attributes and their relationships) and metadata specific to a scientific study, that are the basis for semantic interoperability, thus allowing reliable comparative research.

Data governance: Assembly of policies and processes, coordination aspects, data usage and accessibility principles and data management procedures for a certain health data infrastructure to ensure legal compliance, consistency and good data quality throughout the different stages of the data life cycle.

Data processor: According to Article 3 (12) of Regulation (EU) 2018/1725, a processor shall mean "a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller." The essential element is therefore that the processor only acts "on behalf of the controller" and thus only subject to his instructions. In some cases, the processor may choose not to process the data himself, but may have recourse to a subcontractor who processes the data on his behalf. In practice, this will depend upon the processor agreement entered into with the controller.

Data provider/holder: Any natural or legal person, which is an entity or a body in the health or care sector, or performing research in relation to these sectors, as well as European Union institutions, bodies, offices and agencies who has the right or obligation, or the ability to make available, including to register, provide, restrict access or exchange certain data.

Data quality: The degree to which a set of inherent characteristics of data fulfills requirements. Notes: The requirements are defined by the purpose of the processing and hence data quality can be viewed in other words also as a “fitness for purpose”. The purpose can be any use of the data, including primary use or secondary use. For the purpose of data protection, data quality refers to a set of principles laid down in Article 5 of the GDPR and Article 4 of Regulation (EU) 2018/1725, namely:

  • Lawfulness, fairness and transparency
  • Purpose limitation
  • Data minimisation
  • Accuracy
  • Storage limitation
  • Integrity and confidentiality


Data reuse: The most simple form of data reuse is using the same data in the same way more than once (i.e. starting with an original dataset, and drawing different research inferences). The data can also be repurposed and used for another intent (i.e. using data from health insurance registries for health monitoring, or for research).

Data sharing: Provision of data by a data controller to a data user for the purpose of joint or individual use of the shared data, based on conditions of use, directly or through an intermediary.

Data subject: As defined in the GDPR, a data subject is a person who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

Data user: A natural or legal person/organisation who has lawful access to certain personal or non-personal data and is authorised to use that data for commercial or noncommercial purposes.

Dataset: Collection of data that is represented in a particular form. Datasets will vary depending upon the type of intended use, and how the collecting organization has decided to organize their data upon collection. Dataset is essentially a heterogeneous term that could be made up of any type of collection for any type of data. 

Data curator: A person who is responsible for the quality and FAIRness of the health-related data, and to make sure the value of the data is discovered and accessible. This role also considers the possibility of enriching data when increasing its quality. Importantly, data curators might play a role regarding being processors, e.g. responsible for the data at hand.

Dataset catalogue: A collection of datasets descriptions, which is arranged in a systematic manner and consists of a user-oriented public part, where information concerning individual dataset parameters is accessible by electronic means through an online portal.

Data steward: A person who has an administrative role; they do not really use the data. They create guidelines to make data FAIR and advice on how to do it. Stewards might have direct responsibility on the data at hand (processors) or not.

Distributed research infrastructure: A decentralized and organised network of resources.

FAIR Data principles: Principles to define the Findability, Accessibility, Interoperability, and Reuse of resources for humans and computers at the source. For example, the principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data. 

  • Findable: Data and supplementary materials have sufficiently rich metadata and a unique and persistent identifier.
  • Accessible: Metadata and data are understandable to humans and machines. Data is deposited in a trusted repository
  • Interoperable: Metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • Re-usable: Data and collections have a clear usage licenses and provide accurate information on provenance.


Federated data analysis: Federated data analysis describes an analysis that is performed on multiple (often geographically) separated datasets. During this analysis, the data is not exchanged and can stay, for example, behind a given institution’s firewall. Only the interim results of a local analysis are exchanged between the data-hosting sites). The aggregated non-identifiable results from each local analysis are pooled and returned to the data user.

Federated database: A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer network and may be geographically decentralized.

Federated learning: This is a specific case of federated data analysis, for machine learning purposes. It is a learning technique that allows users to collectively reap the benefits of shared models trained from rich data collections. The learning task is conducted across multiple separate sites coordinated centrally. Each site has a local training dataset which is never shared . Instead, each site computes an update to the current global model maintained centrally, and only this updated model is communicated

Health data: Personal data related to the physical or mental health of a natural person, including the provision of health care services, which reveal information about his or her health status. Health data collection: A technical infrastructure that holds datasets, makes datasets available for use, and organises data in a logical manner. The datasets may come from different sources, hospitals and/or research institutes from the same country (national data repositories) or different countries (international data repositories). Data collections may also cover appropriate, subject-specific locations where researchers can submit their data. Data collections may have specific requirements concerning subject or research domain; data reuse and access; file format and data structure; and the types of metadata that can be used.

Minimal inclusion criteria:

  • A digital platform that receives and stores data
  • It receives data from a single source and/or multiple sources
  • Allows discovery of the stored health data
  • It must have control over the data stored


Other possible characteristics of a data collection:

  • It could have a specific thematic, data type that it collects (e.g. a particular disease, a particular data type: genomic data, clinical data, EHRs…)
  • It could be part of one or more overarching data hubs
  • It could generate data


Health data hub:

Minimal inclusion criteria:

  • A digital technical infrastructure with the core mission of enabling health data sharing
  • It provides health data from different sources
  • It allows discovery of health datasets
  • It has a metadata discovery service
  • It has a data accessibility mechanism in accordance with existing regulation 
  • It has an authorization functionality, provided by the same Data Hub or by an external institution.


Health information: All organised and contextualised data on population health and health service activities and performance, individual or aggregated, that improves health promotion, prevention, care and policy-making.

Health information systems: A health information system is the total of resources, stakeholders, activities and outputs enabling evidence-informed health policy-making. The health information system manages all types of health data, from EHRs to imaging data and population health data. HIS activities include: data collection, interpretation (analysis and synthesis), health reporting, and knowledge translation, i.e. stimulating and enhancing the uptake of health information into policy and practice. Health information system governance relates to the mechanisms and processes to coordinate and steer all elements of a health information system. 

Infodemic: An “infodemic” is an overabundance of information – some accurate and some not – that occurs during an epidemic. It spreads between humans in a similar manner to an epidemic, via digital and physical information systems. It makes it hard for people to find trustworthy sources and reliable guidance when they need it. 

Information: Data which is contextualised, i.e. reduced, summarized and adjusted for variations such as the age and sex of population so that comparison over time and place are possible.

Infrastructure provider: The responsible organisation to support the physical management of health-related data following existing regulations. Parent definition for data hub, data collection and secure processing environment.

Input Data: Data provided to or directly acquired by an AI system on the basis of which the system produces an output.

Intelligence: the product of information being transformed through integration and processing with experience and perceptions based on social and political values.

Interoperability: Following the European Interoperability Framework, interoperability refers to a) a full compliance with the legal and ethical provisions in each constituent node; b) an organisation that supports knowledge exchange and software transference across nodes; c) a compatible technological environment that supports the communication between nodes and allows the deployment of the computational tasks; and d) the existence of common data models that enables semantic standardisation across data sources. In a distributed research infrastructure, interoperability is a key feature for its governance and achievements.

Knowledge translation: The appropriate exchange, synthesis and ethically sound application of knowledge to interventions that strengthen the healthcare system and improve health.

Machine learning: A subset of AI techniques based on the use of statistical and mathematical modelling techniques to define and analyse data. Such learned patterns are then applied to perform or guide certain tasks and make predictions.

Metadata: A set of data that defines and describes a resource (e.g., data, dataset, sample...) so that it can be understood, discovered and reused. There are different levels of metadata. Since metadata can be used to describe different aspects of data, we can group metadata properties in terms of quality, availability, provenance, processing, among others. Then there are metadata catalogues that can be developed to describe the available data collections in a repository or hub. Metadata is important to make data understandable, and can contribute to increase the findability, accessibility, interoperability and reusability of the data. Metadata can be collected or compiled in repositories to improve the FAIRness level of the data collections.

Mock-up data: Low quality synthetic data that do not have pattern such as correlation between variables (i.e social factor depending of area). Mock-up data can be used to know how the real data look like, or to create analysis script. However, they cannot be used to make real analysis (machine learning) as they do not reflect well the real data*.

National Node: A National Node (NN) is an organisational entity, often linked to a national institution or governmental unit that functions as a national liaison and brings together relevant national stakeholders in the country in a systematic way. The relevant stakeholders may include, for example, the national statistical office, the national public health institutes, representatives from ministries of health, research and/or science, and others. In addition, the NN may function as a discussion and advisory forum in matters of health data and information both for national or international matters. Examples include aspects of the governance of data, indicators and health reporting at the international level and health information stakeholders at national level.

Non-personal data: All data other than personal data.

Open data: Data that is freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Open license is a license agreement which contains provisions that allow other individuals to reuse another creator's work, giving them four major freedoms. Without a special license, these uses are normally prohibited by copyright law or commercial license. Most free licenses are worldwide, royalty-free, non-exclusive, and perpetual (see copyright durations). Free licenses are often the basis of crowdsourcing and crowdfunding projects. 

Open science: The movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open-notebook science, and generally making it easier to publish and communicate scientific knowledge. 

Personal data: According to Article 3 (1) of Regulation (EU) 2018/1725: "‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person". Name and the social security number are two examples of personal data which relate directly to a person. However, the definition extends further and also encompasses for instance e-mail addresses and the office phone number of an employee. Other examples of personal data can be found in information on physical disabilities, in medical records and in an employee's evaluation. Personal data which is processed in relation to the work of the data subject remain personal/individual in the sense that they continue to be protected by the relevant data protection legislation, which strives to protect the privacy and integrity of natural persons. As a consequence, data protection legislation does not address the situation of legal persons (apart from the exceptional cases where information on a legal person also relates to a physical person).

Personal data breach: A breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorised disclosure of, or access to, personal data transmitted, stored or otherwise processed.

Population Health Data/Health information: All organised and contextualised data on health and health service activities and performance, at individual or population level, that is fit-for-use and contributes to  health promotion, prevention, care and policy-making.

Primary use of data: The use of any data for the purpose for which it was originally collected.

Processing (personal and non-personal): Any operation or set of operations which is performed on data or on datasets, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction.

Profiling: Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person's performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.

Pseudonymisation: The processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.

Public health monitoring and reporting: The activities necessary to obtain health data and information and bring this information into health policy and practice.  

Research Network: A Research Network (RN) is an active network of national and/or regional experts from several countries that perform comparative research in a specific health area (information domain).

Restriction of processing (personal and non-personal data): As defined by the GDPR, methods by which to restrict the processing of data could include, inter alia, temporarily moving the selected data to another processing system, making the selected personal data unavailable to users, or temporarily removing published data from a website. In automated filing systems, the restriction of processing should in principle be ensured by technical means in such a manner that the personal data are not subject to further processing operations and cannot be changed. The fact that the processing of data is restricted should be clearly indicated in the system. 

Secondary use of data/data re-use: Secondary use refers to using data for a different purpose than the one it was originally collected for (i.e. than the primary use). According to the European Data Governance Act 2020 ‘re-use’ means the use by natural or legal persons of data held by public sector bodies, for commercial or noncommercial purposes other than the initial purpose within the public task for which the data were produced, except for the exchange of data between public sector bodies purely in pursuit of their public tasks. Clinical definition: Secondary use of health data applies personal health information (PHI) for uses outside of direct health care delivery.

Secure processing environment: The physical or virtual environment and organisational means to provide the opportunity to re-use data in a manner that allows for the operator of the secure processing environment to determine and supervise all data processing actions, including to display, storage, download, export of the data and calculation of derivative data through computational algorithms.

Sensitive data: Information that is regulated by law due to possible risk for plants, animals, individuals and/or communities and for public and private organisations. Sensitive personal data include information related to racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership and data concerning the health or sex life of an individual. These data that could be identifiable and potentially cause harm through their disclosure.

Synthetic data: The concept of synthetic data generation is to take an original data source (dataset) and create new, artificial data, with similar statistical properties from it. Keeping the statistical properties means that anyone analysing the synthetic data, a data analyst for example, should be able to draw the same statistical conclusions from the analysis of a given dataset of synthetic data as he/she would if given the real (original) data. The use of synthetic data is growing in many fields: from training of artificial intelligence models within the health sector to computer vision, image recognition and robotics fields.

*Synthetic data should not be confused with "mockup data", which, although close, is not synonymous. Mockup data is primarily used for initial testing purposes, e.g. in design phases to prototype applications. It does not necessarily maintain the statistical properties of real data. Learn more about the difference between synthethic vs mockup data here.

Testing Data: Data used for providing an independent evaluation of the trained and validated AI system in order to confirm the expected performance of that system before its placing on the market or putting into service.

Training Data: Data used for training machine learning algorithms (e.g., an artificial intelligence (AI) system) through fitting its learnable parameters. 

Validation Data: Data used for providing an evaluation of the trained AI system and for tuning its non-learnable parameters and its learning process.

Use case: A software and system engineering term that describes how a user interacts within a system to accomplish a particular goal. A use case acts as a software modeling technique that defines the features to be implemented and the resolution of any errors that may be encountered.


*= Definitions not part of the HealthyCloud Glossary