论文部分内容阅读
Abstract: The development of dataspace support systems is far from reality as individuals and enterprises are faced with the huge challenge of data management. Critical to this is the need to provide a model that represents the relationships between the entities collaborating in a dataspace. A dataspace is a new abstraction and target architecture to data management that does not require up-front semantic data integration. This paper models a dataspace using the set theory with entity mappings. A technique for identity resolution and pay-as-you-go data integration is explained. In order to provide a strong degree of assurance, the authors subject the model to certain real world entities that might form part of a global dataspace.
Key words: Dataspaces, entity collaboration, integration, geo data, data management.
1. Introduction
The overall observation supports the fact that large volumes of data are continuously being stored in data repositories around the world [1]. As data is continuously stored in data stores around the world, the need for effective and efficient techniques of data management is growing. Data appear in myriad of forms some in structured sources, e.g., Database Management Systems (DBMS) and some not. There is the demand to provide coherence between these sources. These data sources are becoming a part of a dataspace. Such a new abstraction is described in Ref. [2] as a new abstraction to data integration.
Despite of traditional (enterprise) databases with a given schema the goal is to manage a rich collection of structured, semi-structured, and unstructured data, spread in more enterprise repositories and on the Web. To control such data space of course does not mean other data integration approach. Data in data space rather coexists; semantic integration is not a necessity here, in order to operate parts of the system. Fig. 1 adopted from Ref. [2] shows a categorization of current solution of data management in two dimensions. Administrative proximity indicates how close various data sources are in terms of administrative control.“Near” means that sources have the same or at least coordinated control. Semantic integration is a measure, how closely the schemas of different data sources match [3].
A complete dataspace should be a plug and play architecture that is customizable (can be modeled) to the domain of interest. The concept of a domain is often ignored thought very important both in efficiency and clarity.
A dataspace should contain all of the information relevant to a particular organization regardless of its format and location, and model a rich collection of relationships between data repositories. Hence, the authors model a dataspace as a set of participants and relationships [3].
The participants in a dataspace are the individual data sources: They can be relational databases, XML repositories, text databases, web services and software packages. They can be stored or streamed (managed locally by data stream systems), or even sensor deployments [3].
Some participants may support expressive query languages, while others are opaque and offer only limited interfaces for posing queries (e.g., structured files, web services, or other software packages). Participants vary from being very structured (e.g., relational databases) to semi-structured (XML, code collections) to completely unstructured. Some sources will support traditional updates, while others may be append-only (for archiving purposes), and still others may be immutable [4].
Two of the main services that a Dataspace Support Platform (DSSP) will support are search and query. While DBMSs have excelled at providing support for querying, search has emerged as a primary mechanism for end users to deal with large collections of unfamiliar data. Search has the property that it is more forgiving than query, being based on similarity and providing ranked results to end users, and supporting interactive refinement so that users can explore a data set and incrementally improve their results. A DSSP should enable a user to specify a search query and iteratively refine it, when appropriate, to a database-style query. A key tenet of the dataspaces approach is that search should be applicable to all of the contents of a dataspace, regardless of their formats[4].
Universal search and query should extend to meta-data as well as data. Users should be able to discover relevant data sources and inquire about their completeness, correctness and freshness. In fact, a DSSP should also be aware of gaps in its coverage of the domain [3]. The paper is organized as follows: Section 2 reviews related literature particularly sets and dataspaces; section 3 introduces dataspace entity relationships; section 4 presents results and discussions; section 5 gives conclusions.
References
[1] B. Shibwabo, I. Ateya, Respository integration: The disconnect and way forward through repository virtualization supporting business intelligence, International Journal of Current Research 3 (4) (2011) 015-020.
[2] M. Franklin, A. Halevy, D. Maier, From databases to dataspaces: A new abstraction for information management, ACM SIGMOD Record 34 (4) (2005) 27-33.
[3] J. Pokorny, Databases in the 3rd millennium: Trends and research directions, Journal of Systems Integration 1(2010) 3-15.
[4] M. Franklin, A. Halevy, D. Maier, Principles of dataspace systems, in: Proc. of 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2006), ACM Press, pp. 1-9.
[5] F. Diego, K. Jónsdóttir, D. Maier, Associative operations on a three-element set, The Montana Mathematics Enthusiast 5 (2&3) (2008) 257-268.
[6] Available online at: http://www.sgi.com/tech/stl/set.html.
[7] P. Ziegler, K. Dittrich, Data integration—problems, approaches, and perspectives, in: J. Krogstie, A.L. Opdahl, S. Brinkkemper (Eds.), Conceptual Modelling in Information Systems Engineering, Springer, Berlin Heidelberg, 2007, pp. 39-58.
[8] S. Stefanov, V. Dragieva, Evolution of sets systems and homotopy groups of spheres, in: Proceedings of the 41st Spring Conference of the Union of Bulgarian Mathematicians, 2012, pp. 202-206.
Key words: Dataspaces, entity collaboration, integration, geo data, data management.
1. Introduction
The overall observation supports the fact that large volumes of data are continuously being stored in data repositories around the world [1]. As data is continuously stored in data stores around the world, the need for effective and efficient techniques of data management is growing. Data appear in myriad of forms some in structured sources, e.g., Database Management Systems (DBMS) and some not. There is the demand to provide coherence between these sources. These data sources are becoming a part of a dataspace. Such a new abstraction is described in Ref. [2] as a new abstraction to data integration.
Despite of traditional (enterprise) databases with a given schema the goal is to manage a rich collection of structured, semi-structured, and unstructured data, spread in more enterprise repositories and on the Web. To control such data space of course does not mean other data integration approach. Data in data space rather coexists; semantic integration is not a necessity here, in order to operate parts of the system. Fig. 1 adopted from Ref. [2] shows a categorization of current solution of data management in two dimensions. Administrative proximity indicates how close various data sources are in terms of administrative control.“Near” means that sources have the same or at least coordinated control. Semantic integration is a measure, how closely the schemas of different data sources match [3].
A complete dataspace should be a plug and play architecture that is customizable (can be modeled) to the domain of interest. The concept of a domain is often ignored thought very important both in efficiency and clarity.
A dataspace should contain all of the information relevant to a particular organization regardless of its format and location, and model a rich collection of relationships between data repositories. Hence, the authors model a dataspace as a set of participants and relationships [3].
The participants in a dataspace are the individual data sources: They can be relational databases, XML repositories, text databases, web services and software packages. They can be stored or streamed (managed locally by data stream systems), or even sensor deployments [3].
Some participants may support expressive query languages, while others are opaque and offer only limited interfaces for posing queries (e.g., structured files, web services, or other software packages). Participants vary from being very structured (e.g., relational databases) to semi-structured (XML, code collections) to completely unstructured. Some sources will support traditional updates, while others may be append-only (for archiving purposes), and still others may be immutable [4].
Two of the main services that a Dataspace Support Platform (DSSP) will support are search and query. While DBMSs have excelled at providing support for querying, search has emerged as a primary mechanism for end users to deal with large collections of unfamiliar data. Search has the property that it is more forgiving than query, being based on similarity and providing ranked results to end users, and supporting interactive refinement so that users can explore a data set and incrementally improve their results. A DSSP should enable a user to specify a search query and iteratively refine it, when appropriate, to a database-style query. A key tenet of the dataspaces approach is that search should be applicable to all of the contents of a dataspace, regardless of their formats[4].
Universal search and query should extend to meta-data as well as data. Users should be able to discover relevant data sources and inquire about their completeness, correctness and freshness. In fact, a DSSP should also be aware of gaps in its coverage of the domain [3]. The paper is organized as follows: Section 2 reviews related literature particularly sets and dataspaces; section 3 introduces dataspace entity relationships; section 4 presents results and discussions; section 5 gives conclusions.
References
[1] B. Shibwabo, I. Ateya, Respository integration: The disconnect and way forward through repository virtualization supporting business intelligence, International Journal of Current Research 3 (4) (2011) 015-020.
[2] M. Franklin, A. Halevy, D. Maier, From databases to dataspaces: A new abstraction for information management, ACM SIGMOD Record 34 (4) (2005) 27-33.
[3] J. Pokorny, Databases in the 3rd millennium: Trends and research directions, Journal of Systems Integration 1(2010) 3-15.
[4] M. Franklin, A. Halevy, D. Maier, Principles of dataspace systems, in: Proc. of 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2006), ACM Press, pp. 1-9.
[5] F. Diego, K. Jónsdóttir, D. Maier, Associative operations on a three-element set, The Montana Mathematics Enthusiast 5 (2&3) (2008) 257-268.
[6] Available online at: http://www.sgi.com/tech/stl/set.html.
[7] P. Ziegler, K. Dittrich, Data integration—problems, approaches, and perspectives, in: J. Krogstie, A.L. Opdahl, S. Brinkkemper (Eds.), Conceptual Modelling in Information Systems Engineering, Springer, Berlin Heidelberg, 2007, pp. 39-58.
[8] S. Stefanov, V. Dragieva, Evolution of sets systems and homotopy groups of spheres, in: Proceedings of the 41st Spring Conference of the Union of Bulgarian Mathematicians, 2012, pp. 202-206.