This is an old revision of the document!

OntoDM-core - Ontology of Core Data Mining Entities


In data mining, the data used for analysis are organized in the form of a dataset. Every dataset consists of data examples. The task of data mining is to produce some type of a generalization from a given dataset. Generalization is a broad term that denotes the output of a data mining algorithm. A data mining algorithm is an algorithm, that is implemented as computer program and is designed to solve a data mining task. Data mining algorithms are computer programs and when executed they take as input a dataset and give as output a generalization.

In this context, the OntoDM-core sub-ontology formalizes the key data mining entities needed for the representation of mining structured data in the context of a general framework for data mining (Dzeroski, 2006).


OntoDM-core is expressed in OWL-DL , a de facto standard for representing ontologies. The ontology is being developed using the Protege ontology editor. The ontology is freely available at this page and at BioPortal.

In order to ensure the extensibility and interoperability of OntoDM-core with other resources, in particular with biomedical applications, the OntoDM-core ontology follows the Open Bio-Ontologies (OBO) Foundry design principles, such as the

  • use of an upper-level ontology,
  • the use of formal ontology relations,
  • single inheritance, and
  • the re-use of already existing ontological resources where possible.

The application of these design principles enables cross-domain reasoning, facilitates wide re-usability of the developed ontology, and avoids duplication of ontology development efforts. Consequently, OntoDM-core imports the upper-level classes from the BFO version 1.1 and formal relations from the OBO Relational Ontology and an extended set of RO relations.

Following best practices in ontology development, the OntoDM-core ontology reuses appropriate classes from a set of ontologies, that act as mid-level ontologies for OntoDM-core. These include the

For representing the mining of structured data, we import the OntoDT ontology of datatypes. Classes that are referenced and reused in OntoDM-core are imported into the ontology by using the Minimum Information to Reference an External Ontology Term (MIREOT) principle and extracted using the OntoFox web service.

Ontology Structure

For the domain of DM, we propose a horizontal description structure that includes three layers:

  • a specification layer,
  • an implementation layer, and
  • an application layer.

Having all three layers represented separately in the ontology will facilitate different uses of the ontology. For example, the specification layer can be used to reason about data mining algorithms; the implementation layer can be used for search over implementations of data mining algorithms and to compare various implementations; and the application layer can be used for searching through executions of data mining algorithms.

This description structure is based on the use of the upper-level ontology BFO and the extensive reuse of classes from the mid-level ontologies OBI and IAO. The proposed three layer description structure is orthogonal to the vertical ontology architecture which comprises an:

  • upper-level,
  • a mid-level, and
  • a domain level.

This means that each vertical level contains all three description layers.


The specification layer contains BFO: generically dependent continuants at the upper-level, and IAO: information content entities at the mid-level. In the domain of data mining, example classes are data mining task and data mining algorithm.

The implementation layer describes BFO: specifically dependent continuants, such as BFO: realizable entities (entities that are executable in a process). At the domain level, this layer contains classes that describe the implementations of algorithms.

The application layer contains classes that aim at representing processes, e.g., extensions of BFO: processual entity. Examples of (planned) process entities in the domain of data mining are the execution of a data mining algorithm and the application of a generalization on new data, among others.

Relations between layers

The entities in each layer are connected using general relations, that are layer independent, and layer specific relations. Examples of general relations are is-a and part-of: they can only be used to relate entities from the same description layer. For example, an information entity (member of the specification layer) can not have as parts processual entities (members of the application layer). Layer specific relations can be used only with entities from a specific layer. For example, the relation precedes is only used to relate two processual entities. The description layers are connected using cross-layer relations. An entity from the specification layer is-concretized-as an entity from the implementation layer. Next, an implementation entity is-realized-by an application entity. Finally, an application entity, e.g., a planned process achieves-planned-objective, which is a specification entity.

Key OntoDM-core classes

The ontology includes the representation of the following entities: data specification and dataset, data mining task, generalization, data mining algorithm, constraints and constraint based data mining tasks and algorithms, and data mining scenario.


The main ingredient in the process of data mining is the data. In OntoDM-core, we model the data with a data specification entity that describes the datatype of the underlying data. For this purpose, we import the mechanism for representing arbitrarily complex datatypes from OntoDT ontology.

Descriptive and output data specification

In OntoDM-core, we distinguish between a descriptive data specification, that specifies the data used for descriptive purposes (e.g., in the clustering and pattern discovery), and output data specification, that specifies the data used for output purposes (e.g., classes/targets in predictive modeling). A tuple of primitives or a graph with boolean edges and discrete nodes are examples of data specified only by a descriptive specification. Feature-based data with primitive output and feature-based data with structured output are examples of data specified by both descriptive and output specifications.


OntoDM-core imports the IAO class dataset (defined as `a data item that is an aggregate of other data items of the same type that have something in common') and extends it by further specifying that a DM dataset has part data examples.

OntoDM-core also defines the class dataset specification to enable reasoning about data and datasets. It specifies the type of the dataset based on the type of data it contains. Using data specifications and the taxonomy of datatypes from the OntoDT ontology, in OntoDM-core we build a taxonomy of datasets.

Data mining task

The task of data mining is to produce a generalization from given data. In OntoDM-core, we use the term generalization to denote the outcome of a data mining task. A data mining task is defined as sub-class of the IAO class objective specification. It is an objective specification that specifies the objective that a data mining algorithm needs to achieve when executed on a dataset to produce as output a generalization.

Taxonomy of data mining tasks

The definition of a data mining task depends directly on the data specification, and indirectly on the datatype of the data at hand. This allows us to form a taxonomy of data mining tasks based on the type of data. Dzeroski (2006) proposes four basic classes of data mining tasks based on the generalizations that are produced as output: clustering, pattern discovery, probability distribution estimation, and predictive modeling. These classes of tasks are included as the first level of the OntoDM-core data mining task taxonomy. They are fundamental and can be defined on an arbitrary type of data. An exception is the predictive modeling task that is defined on a pair of datatypes (for the descriptive and output data separately). At the next levels, the taxonomy of data mining task depends on the datatype of the descriptive data (in the case of predictive modeling also on the datatype of the output data).

Taxonomy of predictive modeling tasks

If we focus only on the predictive modeling task and using the output data specification as a criterion, we distinguish between the primitive output prediction task and the structured output prediction task. In the first case, the output datatype is primitive (e.g., discrete, boolean or real); in the second case, it is some structured datatype (such as a tuple, set, sequence or graph).

Primitive output prediction tasks

Primitive output prediction tasks can be feature-based or structure-based, depending on the datatype of the descriptive part. The feature-based primitive output prediction tasks have a tuple of primitives (a set of primitive features) on the description side and a primitive datatype on the output side. This is the most exploited data mining task in traditional single-table data mining, described in all major data mining textbooks. If we specify the output datatype in more detail, we have the binary classification task, the multi-class classification task and the regression task; where the output datatype is boolean, discrete or real, respectively. Structure-based primitive output prediction tasks operate on data that have some structured datatype (other than tuple of primitives) on the description side and a primitive datatype on the output side.

Structured output prediction tasks

In a similar way, structured output prediction tasks can be feature-based or structure-based. Feature-based structured output prediction tasks operate on data that have a tuple of primitives on the description side and a structured datatype on the output side. Structure-based structured output prediction tasks operate on data that have structured datatypes both on the description side and the output side.

If we focus just on feature-based structured output tasks and further specify a structured output datatype, we can represent a variety of structured output prediction tasks. For example, we can represent the following tasks: multi-target prediction (which has as output datatype tuple of primitives), multi-label classification (having as output datatype set of discrete), time-series prediction (having as output datatype sequence of real) and hierarchical classification (having as output datatype labeled graph with boolean edges and discrete nodes). Multi-target prediction can be further divided into: multi-target binary classification, multi-target multi-class classification, and multi-target regression.


We take generalization to denote the outcome of a data mining task. In OntoDM-core, we consider and model three different aspects of generalizations, each aligned with a different description layer:

  • the specification of a generalization,
  • a generalization as a realizable entity, and
  • the process of executing a generalization.

Many different types of generalizations have been considered in the data mining literature. The most fundamental types of generalizations, as proposed by Dzeroski (2006) are in line with the data mining tasks. These include clusterings, patterns, probability distributions, and predictive models.

Generalization specification

In OntoDM-core, the generalization specification class is a subclass of the OBI class data representational model. It specifies the type of the generalization and includes as part the data specification for the data used to produce the generalization, and the generalization language, for the language in which the generalization is expressed. Examples of generalization language formalisms for the case of a predictive model include the languages of: trees, rules, Bayesian networks, graphical models, neural networks, etc.

As in the case of datasets and data mining tasks, we can construct a taxonomy of generalizations. In OntoDM-core, at the first level, we distinguish between a single generalization specification and an ensemble specification. Ensembles of generalizations have as parts single generalizations. We can further extend this taxonomy by taking into account the data mining task and the generalization language.

Dual nature of generalizations

Generalizations have a dual nature. They can be treated as data structures and as such represented, stored and manipulated. On the other hand, they act as functions and are executed, taking as input data examples and giving as output the result of applying the function to a data example. In OntoDM-core, we define a generalization as a sub-class of the BFO class realizable entity. It is an output from a data mining algorithm execution.

The dual nature of generalizations in OntoDM-core is represented with two classes that belong to two different description layers: generalization representation, which is a sub-class of information content entity and belongs to the specification layer, and generalization execution, which is a subclass of planned process and belongs to the application layer.

A generalization representation is a sub-class of the IAO class information content entity. It represents a formalized description of the generalization, for instance in the form of a formula or text. For example, the output of a decision tree algorithm execution in any data mining software usually includes a textual representation of the generated decision tree. A generalization execution is a sub-class of the OBI class planned process that has as input a dataset and has as output another dataset. The output dataset is a result of applying the generalization to the examples from the input dataset.

Versions and Download

Release version 1


Panov P., Soldatova L., Džeroski S. Ontology of core data mining entities. Data Mining and Knowledge Discovery 28(5-6):1222-1265, 2014 DOI 10.1007/s10618-014-0363-0

QR Code
QR Code OntoDM-core - Ontology of Core Data Mining Entities (generated for current page)