Кластерный метод описания модели данных информационных систем

В статье представлен кластерный метод описания многомерных кубов данных, используемых в аналитических информационных системах. Автор рассматривает проблему высокой разреженности кубов при работе с большим количеством аспектов анализа. Метод базируется на семантическом анализе предметной области, выявлении групп взаимосвязанных элементов и формировании кластеров комбинаций, что позволяет эффективно организовать хранение и анализ данных в многомерных системах.

Содержание

Cluster Method of Description of Information System Data Model Based on Multidimensional Approach
1 Introduction
2 Structure of sparse multidimensional data cube
3 Cluster approach to the description of the analytical space

Cluster Method of Description of Information System Data Model Based on Multidimensional Approach

Maxim Fomin
Department of Information Technologies, RUDN University
Miklukho-Maklaya st. 6, Moscow, 117198, Russia
mfomin@sci.pfu.edu.ru

Abstract. Multidimensional data cube is a data model at the information systems based on the multidimensional approach. If one uses a large set of aspects for the analysis of data domain the data cubes are characterized by substantial sparseness. It complicates the organization of data storage. The proposed cluster method of description of multidimensional data cube is based on the investigation of data domain semantics. The dimensionalities of the multidimensional cube are the dimensions corresponding to the aspects of analysis. The basis of the cluster method is a construction of the groups of members which are semantically related to the groups of other members. Building of associations between the groups of different members allows to reveal the clusters in the data cube – the sets of cells with similar properties which may be described in a same way. Clusters are used as the main element of information system data model.

Keywords: multidimensional information system, multidimensional data model, sparse data cube, set of possible member combinations, cluster of member combinations.

1 Introduction

Multidimensional information systems based on the principles of OLAP are used for the operational analysis of large datasets. Analytical space in a system of this type is a multidimensional data cube. The role of the cube dimensionalities is played by the dimensions corresponding to various aspects of the observed phenomenon for which description the system is developed. If we use a large amount of semantically heterogeneous data for the description of the observed phenomenon the multidimensional cube is characterized by high sparseness and irregular filling [1–8]. As a result, there is a problem of developing an adequate way to describe the structure of an analytical space which use would make it possible to effectively organize the data analysis process [9–18]. Such a correct way should provide the accounting of semantics of the observed phenomenon.

2 Structure of sparse multidimensional data cube

The structure of analytical space of multidimensional information system should reflect the characteristics of those aspects of the observed phenomenon which are used in the data analysis process. Each aspect corresponds to one dimension of a multidimensional cube H. A full set of dimensions forms a set D(H) = {D1, D2, …, Dn}, there Di is i-dimension, and n = dim(H) – dimensionality of multidimensional cube. Each dimension is characterized by a set of members D(H) = {di1, di2, …, diki}, there i is a number of dimension, ki – the quantity of members. Members of Di are drawn from a set of positions of the basic classifier which corresponds to an aspect of the observed phenomenon associated with Di [19–24].

The multidimensional data cube is a structured set of cells. Each cell c is defined by a combination of members c = {d1i1, d2i2, …, dnin}. The combination includes one member for each of the dimensions. If the analysis of the observed phenomenon is performed using a large set of diverse aspects, not all member combinations define the possible cells of multidimensional cube, i.e. the cells corresponding to a certain fact. This effect occurs due to semantic inconsistencies of some members from different dimensions to each other and generates a sparseness in the cube.

The complex structure of the compatibility of members may lead to a situation where a certain dimension becomes semantically uncertain if combined with a set of members from other dimensions. In this situation, while describing the possible cell of multidimensional cube we will use the special value “Not in use” to set the member of semantically unspecified dimension [25].

Thus, the structure of a multidimensional information system analytical space defines a set of possible member combinations comporting with a set of possible cells of multidimensional cube. To denote this set we will use the abbreviation “SPMC”. To set the members during the process of SPMC combinations forming we will use the data taken from the classifiers which match the dimensions, and the special member “Not in use”. The set of possible member combinations should meet the following requirements:

if there is a combination in SPMC in which a special member “Not in use” is set for one or more dimensions in combination with a certain set of other members, the other combination with the same set of other members can not exist in SPMC. In other words, the dimension is either used or not in combination with a certain set of other members;
there should be no combination in SPMC in which all dimensions are defined with a special member “Not in use”.

The observed phenomenon is characterized by the values of measures specified in the possible cells of multidimensional cube. The full set of measures composes the set V (H) = {v1, v2, …, vm}, there vj is a j measure, and m – the quantity of measures in a hypercube. Not all measures from V (H) may be set in a possible cell. The possibility of such a situation arises in the case of semantic mismatch between the cell-defining members and some measures. Describing the analytical space for each possible cell requires to specify its own set of V (c) = {v1, v2, …, vmc}, consisting of measures specified in this cell, mc ≤ m. To describe the measures in cell c outside the set V (c) we introduce the special value “Not in use”. The rule must be hold: a set of measures V (c) defined in a possible cell c can not be empty. Description of measures in cells of multidimensional cube matching the combinations of members not included into the SPMC does not make sense.

The challenge here is to develop a formal approach to describing of SPMC, which allows to present the metadata of multidimensional information system in a compact form reflecting the semantics of the observed phenomenon.

3 Cluster approach to the description of the analytical space

To properly describe the structure of an analytical space one should perform a semantic analysis of the compatibility of members. There may be regularities in the compatibility of two, three or more members defining the structure of SPMC, but in most cases the rules of SPMC compatibility are specified by the pairwise associations between dimensions. Let us limit ourselves to such situation. As an illustrative example we consider the structure of an analytical space of information system that describes the observed phenomenon of “Granting of loans”. The data of the system measures will be represented in six aspects corresponding to the following dimensions: “Time of loan granting”, “Place of loan granting”, “Debtor type”, “Debtor gender”, “Occupation” and “Type of loan”.

The first dimension is based on calendar data specified in the time range which is used in the analysis. The second dimension is based on the reference book of the territorial administrative division. The remaining dimensions are defined with the following members:

Debtor type = {“Legal entity”, “Natural person”};
Debtor gender = {“Male”, “Female”};
Occupation = {“Construction engineering”, “Trade”, “Banking”};
Type of loan = {“Operating”, “Interbank”, “Mortgage”, “Consumer”}.

The source of information about the semantic relationships between the dimensions is the normative documentation relating to the observed phenomenon. The analyst should formalize this information in the form of rules of compatibility allowing to build SPMC. If pairwise associations are analyzed the rules should determine which pair of two members can occur in the SPMC combinations, and which members of one dimension are incompatible with all members of the other dimension. This approach allows to allocate the groups of members in a set of members. The group of members is a set including one or several members which combine with the members of some other dimension within SPMC in a similar way.

The method based on the allocation of groups in a set of members allows to describe the pairwise relations between dimensions. These pairwise relations are specified by the determination of conformity between the two groups of members from the different dimensions for which the “identity” of compatibility or consistency between the group in one dimension and “Not in use” member in the other one were revealed. For the pairwise relations the following conditions must be held:

If some member of the first dimension is included in the group that corresponds to the group in the second dimension, it can not be included in the group which corresponds to the “Not in use” member;
If “Not in use” member for the second dimension corresponds to a certain group of members of the first dimension, the members of this group can present in SPMC only in combination with the “Not in use” member for the second dimension;
If a certain member of the first dimension is included into the group that corresponds to the group in the second dimension, for the combination of SPMC including this member the second dimension must either take the member from the second group, or there must be the “Not in use” member set for it.

Кластерный метод описания модели данных информационных систем на основе многомерного подхода

Cluster Method of Description of Information System Data Model Based on Multidimensional Approach

1 Introduction

2 Structure of sparse multidimensional data cube

3 Cluster approach to the description of the analytical space