This is an old revision of the document!


OntoDM-core - Ontology of Core Data Mining Entities

Background

In data mining, the data used for analysis are organized in the form of a dataset. Every dataset consists of data examples. The task of data mining is to produce some type of a generalization from a given dataset. Generalization is a broad term that denotes the output of a data mining algorithm. A data mining algorithm is an algorithm, that is implemented as computer program and is designed to solve a data mining task. Data mining algorithms are computer programs and when executed they take as input a dataset and give as output a generalization.

In this context, the OntoDM-core sub-ontology formalizes the key data mining entities needed for the representation of mining structured data in the context of a general framework for data mining (Dzeroski, 2006).

Design

OntoDM-core is expressed in OWL-DL , a de facto standard for representing ontologies. The ontology is being developed using the Protege ontology editor. The ontology is freely available at this page and at BioPortal.

In order to ensure the extensibility and interoperability of OntoDM-core with other resources, in particular with biomedical applications, the OntoDM-core ontology follows the Open Bio-Ontologies (OBO) Foundry design principles, such as the

  • use of an upper-level ontology,
  • the use of formal ontology relations,
  • single inheritance, and
  • the re-use of already existing ontological resources where possible.

The application of these design principles enables cross-domain reasoning, facilitates wide re-usability of the developed ontology, and avoids duplication of ontology development efforts. Consequently, OntoDM-core imports the upper-level classes from the BFO version 1.1 and formal relations from the OBO Relational Ontology and an extended set of RO relations.

Following best practices in ontology development, the OntoDM-core ontology reuses appropriate classes from a set of ontologies, that act as mid-level ontologies for OntoDM-core. These include the

For representing the mining of structured data, we import the OntoDT ontology of datatypes. Classes that are referenced and reused in OntoDM-core are imported into the ontology by using the Minimum Information to Reference an External Ontology Term (MIREOT) principle and extracted using the OntoFox web service.

Ontology Structure

For the domain of DM, we propose a horizontal description structure that includes three layers:

  • a specification layer,
  • an implementation layer, and
  • an application layer.

Having all three layers represented separately in the ontology will facilitate different uses of the ontology. For example, the specification layer can be used to reason about data mining algorithms; the implementation layer can be used for search over implementations of data mining algorithms and to compare various implementations; and the application layer can be used for searching through executions of data mining algorithms.

This description structure is based on the use of the upper-level ontology BFO and the extensive reuse of classes from the mid-level ontologies OBI and IAO. The proposed three layer description structure is orthogonal to the vertical ontology architecture which comprises an:

  • upper-level,
  • a mid-level, and
  • a domain level.

This means that each vertical level contains all three description layers.

Layers

The specification layer contains BFO: generically dependent continuants at the upper-level, and IAO: information content entities at the mid-level. In the domain of data mining, example classes are data mining task and data mining algorithm.

The implementation layer describes BFO: specifically dependent continuants, such as BFO: realizable entities (entities that are executable in a process). At the domain level, this layer contains classes that describe the implementations of algorithms.

The application layer contains classes that aim at representing processes, e.g., extensions of BFO: processual entity. Examples of (planned) process entities in the domain of data mining are the execution of a data mining algorithm and the application of a generalization on new data, among others.

Relations between layers

The entities in each layer are connected using general relations, that are layer independent, and layer specific relations. Examples of general relations are is-a and part-of: they can only be used to relate entities from the same description layer. For example, an information entity (member of the specification layer) can not have as parts processual entities (members of the application layer). Layer specific relations can be used only with entities from a specific layer. For example, the relation precedes is only used to relate two processual entities. The description layers are connected using cross-layer relations. An entity from the specification layer is-concretized-as an entity from the implementation layer. Next, an implementation entity is-realized-by an application entity. Finally, an application entity, e.g., a planned process achieves-planned-objective, which is a specification entity.

Key OntoDM-core classes

Constraints and constraint-based data mining tasks and algorithms

Constraints play a central role in data mining and constraint-based data mining (CBDM) is now growing in importance. A general statement of the problem involves the specification of a language of generalization and a set of constraints that a generalization needs to satisfy. In CBDM, constraints are propositions or statements about generalizations. They can be classified along three dimensions:

  1. primitive and composite constraints;
  2. language and evaluation constraints; and
  3. hard (Boolean) constraints, soft constraints and optimization constraints.

Taxonomy of constraints

A constraint specification is defined in OntoDM-core as a sub-class of OBI data representational model and is the top-level class of a taxonomy of constraints that we propose. At the first level of the taxonomy, we have the primitive and complex constraints. Primitive constraints are based on atomic and complex constraints on non-atomic propositions. Complex constraints have as parts primitive constraints and a combination function specification that defines how the primitive constraints are combined to form a complex constraint.

At the second level, if we focus on the primitive constraints, we have primitive language constraints and primitive evaluation constraints. Language constraints concern the representation of a generalization and only refer to its form. Commonly used types of language constraints are subsumption constraints (e.g., all itemsets must contain the item 'bread`) and language cost constraints (e.g., itemsets should contain at most three items). Evaluation constraints concern the semantics of a generalization when applied to a dataset. They usually include evaluation functions, where the evaluation functions measure the validity of a generalization on a given dataset (e.g., classification accuracy).

At the last level the primitive language cost-function constraint is extended with three sub-classes that include: primitive hard language cost-function constraint, primitive soft language cost-function constraint, and primitive optimization language cost-function constraint. Hard constraints represent boolean functions on generalizations and the constraint can be either satisfied or not satisfied. Soft constraints do not dismiss a generalization that violates a constraint, but rather penalize it for violating a constraint. Optimization constraints ask for a fixed-size set of generalizations that have some extreme values for a given cost or evaluation function. In a similar way, we define the sub-classes of the primitive evaluation constraint class.

Constraint-based data mining task

The task of CBDM is to find a set of generalizations that satisfy a set of constraints, given a dataset that consists of examples of a specific datatype, a data mining task, a generalization specification and a specifications of the set of constraints. In the OntoDM-core ontology, we represent a CBDM task as a sub-class of the objective specification class (reused from IAO). It has as parts a data mining task and a set of constraint specifications. We further define a CBDM algorithm as an algorithm that solves a CBDM task . Finally, this structure allows us to form a taxonomy of CBDM tasks, where at the first level of the taxonomy the basic CBDM task classes that are aligned with the fundamental data mining tasks, and then at the next levels depend on the data specification and the type of constraints.

Versions and Download

Release version 1

Papers

Panov P., Soldatova L., Džeroski S. Ontology of core data mining entities. Data Mining and Knowledge Discovery 28(5-6):1222-1265, 2014 DOI 10.1007/s10618-014-0363-0


QR Code
QR Code OntoDM-core - Ontology of Core Data Mining Entities (generated for current page)