Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
ontodm-core [2014/09/15 14:19]
admin [Constraints and constraint-based data mining tasks and algorithms]
ontodm-core [2016/04/12 10:56] (current)
admin [Release version 1]
Line 22: Line 22:
 For representing the mining of structured data, we import the [[ontodt|OntoDT ontology of datatypes]]. Classes that are referenced and reused in OntoDM-core are imported into the ontology by using the [[http://​obi-ontology.org/​page/​MIREOT|Minimum Information to Reference an External Ontology Term (MIREOT) principle]] ​ and extracted using the [[http://​ontofox.hegroup.org|OntoFox]] web service. For representing the mining of structured data, we import the [[ontodt|OntoDT ontology of datatypes]]. Classes that are referenced and reused in OntoDM-core are imported into the ontology by using the [[http://​obi-ontology.org/​page/​MIREOT|Minimum Information to Reference an External Ontology Term (MIREOT) principle]] ​ and extracted using the [[http://​ontofox.hegroup.org|OntoFox]] web service.
 =====Ontology Structure===== =====Ontology Structure=====
-For the domain of DM, we propose a horizontal description structure that includes three layers: ​+For the domain of DM, we propose a [[layers|horizontal description structure that includes three layers]]
   * a specification layer, ​   * a specification layer, ​
   * an implementation layer, and    * an implementation layer, and 
Line 28: Line 28:
 Having all three layers represented separately in the ontology will facilitate different uses of the ontology. For example, the specification layer can be used to reason about data mining algorithms; the implementation layer can be used for search over implementations of data mining algorithms and to compare various implementations;​ and the application layer can be used for searching through executions of data mining algorithms. ​ Having all three layers represented separately in the ontology will facilitate different uses of the ontology. For example, the specification layer can be used to reason about data mining algorithms; the implementation layer can be used for search over implementations of data mining algorithms and to compare various implementations;​ and the application layer can be used for searching through executions of data mining algorithms. ​
  
-This description structure is based on the use of the upper-level ontology [[http://​www.ifomis.org/​bfo/​|BFO]] and the extensive reuse of classes from the mid-level ontologies [[http://​obi-ontology.org/​page/​Main_Page|OBI]] and [[https://​code.google.com/​p/​information-artifact-ontology/​|IAO]]. The proposed three layer description structure is orthogonal to the vertical ontology architecture which comprises ​ an:  
-  * upper-level, ​ 
-  * a mid-level, and  
-  * a domain level. ​ 
-This means that each vertical level contains all three description layers. ​ 
-==== Layers ==== 
-{{ ::​fig1-page1.png?​400|}} 
-The specification layer contains //BFO: generically dependent continuants//​ at the upper-level,​ and //IAO: information content entities// at the mid-level. In the domain of data mining, example classes are //data mining task// and //data mining algorithm//​. ​ 
  
-The implementation layer describes //BFO: specifically dependent continuants//,​ such as //BFO: realizable entities// (entities that are executable in a process). At the domain level, this layer contains classes that describe the implementations of algorithms. ​ 
- 
-The application layer contains classes that aim at representing processes, e.g., extensions of //BFO: processual entity//. Examples of (planned) process entities in the domain of data mining are the execution of a data mining algorithm and the application of a generalization on new data, among others. 
-==== Relations between layers ==== 
- 
-The entities in each layer are connected using general relations, that are layer independent,​ and layer specific relations. Examples of general relations are //is-a// and //​part-of//:​ they can only be used to relate entities from the same description layer. For example, an information entity (member of the specification layer) can not have as parts processual entities (members of the application layer). Layer specific relations can be used only with entities from a specific layer. For example, the relation //​precedes//​ is only used to relate two processual entities. The description layers are connected using cross-layer relations. An entity from the specification layer //​is-concretized-as//​ an entity from the implementation layer. Next, an implementation entity //​is-realized-by//​ an application entity. Finally, an application entity, e.g., a planned process //​achieves-planned-objective//,​ which is a specification entity. 
 =====Key OntoDM-core classes===== =====Key OntoDM-core classes=====
-The ontology includes the representation of the following entities: ​data specification and dataset, data mining task, generalization,​ data mining algorithm, constraints and constraint based data mining tasks and algorithms, and data mining scenario. +The ontology includes the representation of the following entities:  
-{{ :​ontodm-coreentities.png?​direct&600 |}} +{{ :​ontodm-coreentities.png?​600|}} 
- +  ​* ​[[data|data]],  
-==== Data ==== +  ​* ​[[data mining task|data mining task]],  
-The main ingredient in the process of data mining is the data. In OntoDM-core,​ we model the data with a //data specification//​ entity that describes the datatype of the underlying data. For this purpose, we import the mechanism for representing arbitrarily complex datatypes from [[ontodt|OntoDT ontology]].  +  * [[generalization|generalization]],  
-=== Descriptive and output data specification === +  * [[data mining algorithm|data mining algorithm]],  
- +  ​* [[constraints|constraints ​and constraint based data mining tasks and algorithms]], and  
-In OntoDM-corewe distinguish between a //​descriptive data specification//,​ that specifies the data used for descriptive purposes (e.g., in the clustering and pattern discovery), and //output data specification//,​ that specifies the data used for output purposes (e.g., classes/​targets in predictive modeling). A tuple of primitives or a graph with boolean edges and discrete nodes are examples of data specified only by a descriptive specification. Feature-based data with primitive output and feature-based data with structured output are examples of data specified by both descriptive and output specifications. +  * [[data mining ​scenario|data mining ​scenario]].
- +
-=== Dataset=== +
- +
-OntoDM-core imports the IAO class dataset (defined as `a data item that is an aggregate of other data items of the same type that have something in common'​) and extends it by further specifying that a //DM dataset// has part //data examples//​. +
- +
-OntoDM-core also defines the class //dataset specification//​ to enable reasoning about data and datasets. It specifies the type of the dataset based on the type of data it contains. Using data specifications and the taxonomy of datatypes from the [[ontodt|OntoDT ontology]], in OntoDM-core we build a taxonomy of datasets. +
- +
-==== Data mining task ==== +
- +
-The task of data mining is to produce a generalization from given data. In OntoDM-core,​ we use the term generalization to denote the outcome of a data mining task. A //data mining task// is defined as sub-class of the IAO class //objective specification//​. It is an objective specification that specifies the objective that a data mining algorithm needs to achieve when executed on a dataset to produce as output a generalization. +
- +
-=== Taxonomy of data mining tasks === +
- +
-The definition of a data mining task depends directly ​ on the data specification,​ and indirectly on the datatype of the data at hand. This allows us to form a taxonomy of data mining tasks based on the type of data. Dzeroski (2006) proposes four basic classes of data mining tasks based on the generalizations that are produced as output: //​clustering//,​ //pattern discovery//,​ //​probability distribution estimation//,​ and //​predictive modeling//. These classes of tasks are included as the first level of the OntoDM-core data mining task taxonomy. They are fundamental and can be defined on an arbitrary type of data. An exception is the predictive modeling task that is defined on a pair of datatypes (for the descriptive and output data separately). At the next levels, the taxonomy of data mining task depends on the datatype of the descriptive data (in the case of predictive modeling also on the datatype of the output data).  +
- +
-=== Taxonomy of predictive modeling tasks === +
- +
-If we focus only on the predictive modeling task and using the output data specification as a criterion, we distinguish between the //primitive output prediction task// and the //​structured output prediction task//. In the first case, the output datatype is primitive (e.g., discrete, boolean or real); in the second case, it is some structured datatype (such as a tuple, set, sequence or graph). +
- +
-{{ ::​fig3-page1.png?​600 ​|}} +
- +
-== Primitive output prediction tasks == +
-//Primitive output prediction tasks// can be feature-based or structure-based,​ depending on the datatype of the descriptive part. The //​feature-based primitive output prediction tasks// have a tuple of primitives (a set of primitive features) on the description side and a primitive datatype on the output side. This is the most exploited ​data mining task in traditional single-table data miningdescribed in all major data mining textbooks. If we specify the output datatype in more detail, we have the //binary classification task//, the //​multi-class classification task// and the //​regression task//; where the output datatype is boolean, discrete or real, respectively. //​Structure-based primitive output prediction//​ tasks operate on data that have some structured datatype (other than tuple of primitives) on the description side and a primitive datatype on the output side. +
- +
-== Structured output prediction tasks == +
-In a similar way, //​structured output prediction tasks// can be feature-based or structure-based. //​Feature-based structured output prediction tasks// operate on data that have a tuple of primitives on the description side and a structured datatype on the output side. //​Structure-based structured output prediction tasks// operate on data that have structured datatypes both on the description side and the output side. +
- +
-If we focus just on feature-based structured output tasks and further specify a structured output datatype, we can represent a variety of structured output prediction tasks. For example, we can represent the following tasks: //​multi-target prediction//​ (which has as output datatype //tuple of primitives//​),​ //​multi-label classification//​ (having as output datatype //set of discrete//​),​ //​time-series prediction//​ (having as output datatype //sequence of real//) and //​hierarchical classification//​ (having as output datatype //labeled graph with boolean edges and discrete nodes//). //​Multi-target prediction//​ can be further divided into: //​multi-target binary classification//,​ //​multi-target multi-class classification//,​ and //​multi-target regression//​. +
- +
-==== Generalization ==== +
-We take generalization to denote the outcome of a data mining task. In OntoDM-core,​ we consider and model three different aspects of generalizations,​ each aligned with a different description layer: ​ +
-  * the specification of a generalization,  +
-  * a generalization ​as a realizable entityand  +
-  * the process of executing a generalization.  +
- +
-Many different types of generalizations have been considered in the data mining literature. The most fundamental types of generalizations,​ as proposed by Dzeroski (2006) are in line with the data mining tasks. These include clusterings,​ patterns, probability distributions,​ and predictive models.  +
- +
-=== Generalization specification === +
- +
-In OntoDM-core,​ the //​generalization specification//​ class is a subclass of the OBI class //data representational model//. It specifies the type of the generalization and includes as part the //data specification//​ for the data used to produce the generalization,​ and the //​generalization language//, for the language in which the generalization is expressed. Examples of generalization language formalisms for the case of a //​predictive model// include the languages of: trees, rules, Bayesian networks, graphical models, neural networks, etc.  +
- +
-As in the case of datasets and data mining tasks, we can construct a taxonomy of generalizations. In OntoDM-core,​ at the first level, we distinguish between a //single generalization specification//​ and an //ensemble specification//​. Ensembles of generalizations have as parts single generalizations. We can further extend this taxonomy by taking into account the data mining task and the generalization language. +
- +
-=== Dual nature of generalizations === +
- +
-Generalizations have a dual nature. They can be treated as data structures and as such represented,​ stored and manipulated. On the other hand, they act as functions and are executed, taking as input data examples and giving as output the result of applying the function to a data example. In OntoDM-core,​ we define a generalization as a sub-class of the BFO class //​realizable entity//. It is an output from a //data mining algorithm ​execution//​.  +
- +
-The dual nature of generalizations in OntoDM-core is represented with two classes that belong to two different description layers: //​generalization representation//,​ which is a sub-class of information content entity and belongs to the specification layer, and //​generalization execution//,​ which is a subclass of planned process and belongs to the application layer.  +
- +
-A //​generalization representation//​ is a sub-class of the IAO class //​information content entity//. It represents a formalized description of the generalization,​ for instance in the form of a formula or text. For example, the output of a decision tree algorithm execution in any data mining software usually includes a textual representation of the generated decision tree. A //​generalization execution// is a sub-class of the OBI class //planned process// that has as input a //dataset// and has as output another //​dataset//​. The output dataset is a result of applying the //​generalization//​ to the examples from the input dataset. +
- +
-==== Data mining algorithm ​==== +
-A //data mining algorithm// is an algorithm (implemented in a computer program)designed to solve a data mining task. It takes as input a dataset of examples of a given datatype and produces as output a generalization (from a given class) on the given datatype. A specific data mining algorithm can typically handle examples of a limited set of datatypes: For example, a rule learning algorithm might handle only tuples of Boolean attributes and a boolean class. ​ +
- +
-In the OntoDM-core ontological framework, we consider three aspects of the DM algorithm entity: +
- a DM algorithm (as a specification),​ +
- a DM algorithm implementation,​ and  +
-a DM algorithm execution.  +
- +
-=== Data mining algorithm as a specification ​ === +
- +
-//Data mining algorithm// as a specification is a subclass of the IAO class //plan specification//​ having as parts a //data mining task//, an //action specification//​ (reused from IAO), a //​generalization specification//, ​and a //​document//​ (reused from IAO). The //data mining task// defines the objective that the realized plan should fulfill at the end giving as output a generalization,​ while the //action specification//​ describes the actions of the data mining algorithm realized in the process of execution. The //​generalization specification//​ denotes the type of generalization produced by executing the algorithm. Finally, having a //​document//​ class as a part allows us to connect the algorithm to the annotations of documents (journal articles, workshop articles, technical reports) that publish knowledge about the algorithm.  +
- +
-In analogy with the taxonomy of datasets, ​data mining tasks and generalizations,​ in OntoDM-core we also construct a taxonomy of data mining ​algorithms. As criteriawe use the data mining task and the generalization produced as the output of the execution of the algorithm. ​ +
- +
-=== Data mining algorithm implementation === +
- +
-Data mining algorithm implementation is defined as a sub-class of the BFO class //​realizable entity//. It is a concretization of a //data mining ​algorithm//,​ in the form of a runnable computer program, and has as qualities //​parameters//​. The parameters of the algorithm affect its behavior when the algorithm implementation is used as an operator. A parameter itself is specified by a //parameter specification//​ that includes its name and description.  +
- +
-=== Data mining software === +
- +
-In OntoDM-core,​ we define ​data mining ​softwareas a sub-class of //directive information entity// (reused from IAO). It represents a specification of a //data mining algorithm implementation//​. It has as parts all the meta-information entities about the software implementation such as: //source code//, //software version specification//,​ //​programming language//, //software compiler specification//,​ //software manufacturer//,​ the //data mining software toolkit// it belongs to, etc. Finally, a //data mining software toolkit// is a specification entity that contains as parts //data mining software// entities.  +
- +
-=== Data mining operator === +
- +
-//Data mining operator// is defined as sub-class of the BFO class //role//. In that context, it is a role of a data //mining algorithm implementation//​ that is realized (executed) by a //data mining algorithm execution// process.  +
-//Data mining operator// has information about the specific //parameter setting// of the algorithm, in the context of the realization of the operator in the process of execution. The //parameter setting// is a subclass of //data item// (reused from IAO), which is a quality specification of a //​parameter//​+
  
-=== Data mining algorithm execution ​===+===== Ontology evaluation ​===== 
 +We assess the quality of OntoDM-core from three different evaluation aspects: 
 +  * [[ontology metrics|we analyze a set of ontology metrics]];  
 +  * [[design criteria assessment|assess how well the ontology meets a set of predefined design criteria and ontology best practices]];​ and  
 +  * [[competency questions assessment|assess the ontology toward a set of competency questions]].
  
-In OntoDM-core,​ we define //data mining algorithm execution// as a sub-class of //planned process// (reused from the OBI ontology). A //data mining algorithm execution// realizes (executes) a //data mining operator//, has as input a //​dataset//,​ has as output a //​generalization//,​ has as agent a //​computer//,​ and achieves as a planned objective a //data mining task//. 
  
-==== Data mining scenario ==== 
-A scenario is [[http://​oxforddictionaries.com/​definition/​scenario|"​a postulated sequence or development of events"​]]. Therefore, a data mining scenario comprises a logical sequence of actions to infer some type of generalization from a dataset, a sequence of actions for applying a generalization on a new dataset, and a sequence of actions for evaluating the obtained generalizations. OntoDM-core represents a data mining scenario in three different description layers in the ontology: ​ 
-  * data mining scenario (as a specification), ​ 
-  * data mining workflow (as an implementation),​ and  
-  * data mining workflow execution (as an application). 
  
-In OntoDM-core,​ a //data mining scenario// is an extension of the OBI class //​protocol//​. It includes as parts other information entities such as: //title of scenario//, //scenario description//,​ //author of scenario//, and //​document//​. From the protocol class it also inherits as parts //objective specification//​ and //action specification//​. A //data mining workflow// is a concretization of a data mining scenario, and extends the //plan// entity (defined by OBI). Finally, a data mining workflow is realized (executed) through a //data mining workflow execution// process. 
  
-OntoDM-core does not represent scenarios and workflows that belong to other phases of the Knowledge Discovery process, such as application understanding,​ data understanding,​ data preprocessing,​ data mining process evaluation, and deployment. These are the subjects of representation in the [[ontodm-kdd|OntoDM-KDD ontology]]. Because both OntoDM-core and [[ontodm-kdd|OntoDM-KDD]] are built by using the same design principles, the same upper-level ontology, and same type of relations they can be used together to represent the complete knowledge discovery process. 
  
-==== Constraints and constraint-based data mining tasks and algorithms ==== 
-Constraints play a central role in data mining and constraint-based data mining (CBDM) is now growing in importance. A general statement of the problem involves the specification of a language of generalization and a set of constraints that a generalization needs to satisfy. In CBDM, constraints are propositions or statements about generalizations. They can be classified along three dimensions: ​ 
-  - primitive and composite constraints; ​ 
-  - language and evaluation constraints;​ and  
-  - hard (Boolean) constraints,​ soft constraints and optimization constraints. 
  
-=== Taxonomy of constraints === 
  
-A //​constraint specification//​ is defined in OntoDM-core as a sub-class of OBI //data representational model// and is the top-level class of a taxonomy of constraints that we propose. At the first level of the taxonomy, we have the //​primitive//​ and //complex constraints//​. Primitive constraints are based on atomic and complex constraints on non-atomic propositions. //Complex constraints//​ have as parts //primitive constraints//​ and a //​combination function specification//​ that defines how the primitive constraints are combined to form a complex constraint. ​ 
  
-At the second level, if we focus on the primitive constraints,​ we have //primitive language constraints//​ and //primitive evaluation constraints//​. //Language constraints//​ concern the representation of a generalization and only refer to its form. Commonly used types of language constraints are //​subsumption constraints//​ (e.g., all itemsets must contain the item '​bread`) and language cost constraints (e.g., itemsets should contain at most three items). //​Evaluation constraints//​ concern the semantics of a generalization when applied to a dataset. They usually include evaluation functions, where the evaluation functions measure the validity of a generalization on a given dataset (e.g., classification accuracy). ​ 
  
-At the last level the //primitive language cost-function constraint//​ is extended with three sub-classes that include: //primitive hard language cost-function constraint//,​ //primitive soft language cost-function constraint//,​ and //primitive optimization language cost-function constraint//​. //Hard constraints//​ represent boolean functions on generalizations and the constraint can be either satisfied or not satisfied. Soft constraints do not dismiss a generalization that violates a constraint, but rather penalize it for violating a constraint. //​Optimization constraints//​ ask for a  fixed-size set of generalizations that have some extreme values for a given cost or evaluation function. In a similar way, we define the sub-classes of the //primitive evaluation constraint//​ class. 
  
-=== Constraint-based data mining task === 
-  
-The task of CBDM is to find a set of generalizations that satisfy a set of constraints,​ given a dataset that consists of examples of a specific datatype, a data mining task, a generalization specification and a specifications of the set of constraints. In the OntoDM-core ontology, we represent a CBDM task as a sub-class of the //objective specification//​ class (reused from IAO). It has as parts a //data mining task// and a set of c//​onstraint specifications//​. We further define a //CBDM algorithm// as an algorithm that solves a CBDM task . Finally, this structure allows us to form a taxonomy of CBDM tasks, where at the first level of the taxonomy the basic CBDM task classes that are aligned with the fundamental data mining tasks, and then at the next levels depend on the data specification and the type of constraints. 
  
  
Line 168: Line 61:
 {{ :​file_structure.jpg?​direct&​300|}} {{ :​file_structure.jpg?​direct&​300|}}
   * All files in one zip archive{{:​ontodm_v_1_r.zip|OntoDM-coreV1.zip}}   * All files in one zip archive{{:​ontodm_v_1_r.zip|OntoDM-coreV1.zip}}
-  * OntoDM-core main file [[http://kt.ijs.si/panovp/​OntoDM/​OntoDM.owl|OntoDM-core.owl]] +  * OntoDM-core main file [[http://ontodm.com/ontodm-core/​OntoDM.owl|OntoDM-core.owl]] 
-  * File that OntoDM-core imports directly and contains external classes [[http://kt.ijs.si/panovp/​OntoDM/​external.owl|external.owl]] +  * File that OntoDM-core imports directly and contains external classes [[http://ontodm.com/ontodm-core/​external.owl|external.owl]] 
-  * File that external file imports and contains OBI classes[[http://​kt.ijs.si/panovp/​OntoDM/​external-OBI.owl|external-OBI.owl]] +  * File that external file imports and contains OBI classes [[http://ontodm.com/ontodm-core/​external-OBI.owl|external-OBI.owl]] 
-  * OntoDT ontology of datatypes [[http://kt.ijs.si/panovp/​OntoDM/​OntoDT.owl|OntoDT.owl]]+  * OntoDT ontology of datatypes [[http://ontodm.com/ontodm-core/​OntoDT.owl|OntoDT.owl]]
   * {{:​clus_instances.owl}}   * {{:​clus_instances.owl}}
   * {{:​clus_inferred.owl}}   * {{:​clus_inferred.owl}}

QR Code
QR Code OntoDM-core - Ontology of Core Data Mining Entities (generated for current page)