Intelligence Community Massive Digital Data Systems Initiative
Below is some information about the Intelligence Community Massive Digital Data Systems Initiative. Summary: - new data 2 - 5 terabytes (10^12 bytes) per day - total size about 20 petabytes (20 * 10^15 bytes) - 300 terabytes on-line, the rest accessible in a few minutes - funding (for the research initiative, not for the final system): 3-5 million USD per year estimated for investments Now, how much is 2 - 5 terabytes per day? - 20 - 50.000.000 jpeg images (100kB/image, relatively high-quality) per day - 20 - 50.000.000 minutes of GSM-quality phone intercepts per day - 1.000.000 - 2.500.000 minutes of compressed (256kbit/sec) video per day - 1.000.000.000 - 3.000.000.000 e-mail messages per day - you can continue the list; most available data sets turn out to be much smaller How much is 20 petabytes? Assuming you want to collect information about 100.000.000 people worldwide, this makes 200 megabytes per person (on the average for each of those 100 million people). 200 megabytes per person on the average is quite a lot, since for many of those people you probably don't have all that much data. Maybe 90% of the data for 10% of the people? (Of course, in a database like this you might also have a lot of data like aerial imaginery, satellite imaginery, economical information, etc., so it is a little exaggarated to talk about all of it being on individual people.) The full text is below. Crypto relevance? Makes you think whether you should protect your data. Tatu From: dbowner@cs.wisc.edu ( Dbowner) To: bal@mitre.org, mike@nobozo.CS.Berkeley.EDU, shosani@csr.lbl.gov, gray@sfbay.enet.dec.com, livny@cs.wisc.edu, ragrawal@almaden.ibm.com, manola@gte.com, heiler@gte.com, dayal@hplabs.hpl.hp.com, shan@hplabs.hpl.hp.com, toby@almaden.ibm.com, reiner@ksr.com, jag@allegra.att.com, randy@allspice.berkeley.edu, mcleod@vaxa.isi.edu, nick@MIMSY.CS.UMD.EDU, ake@purdue.edu, laney@ccr-p.ida.org, darema@watson.ibm.com, grossman@math.uic.edu, dbusa@cs.wisc.edu, metadata@llnl.gov, jmaitan@mosaic.uncc.edu, whm@thumper.bellcore.com Cc: susan@mitre.org, connie@mitre.org Subject: Call For Papers MDDS Date: Thu, 18 Nov 93 11:08:03 EST Resent-To: dbworld-people@cs.wisc.edu Comments: IF YOU REPLY TO THIS MESSAGE, BE SURE TO EDIT THE to: AND cc: LISTS. The dbworld alias reaches many people, and should only be used for messages of general interest to the database community. Mail sent to dbedu goes to the subset of addresses with a .edu suffix; mail sent to dbusa goes to the subset of US addresses. Please use the smaller lists when appropriate. Requests to get on or off dbworld should go to dbworld-request@cs.wisc.edu. Reply-To: (Susan L. Hanlon) <susan@linus.mitre.org> Resent-Reply-To: (Susan L. Hanlon) <susan@linus.mitre.org> 3 November 1993 Dear Colleague: Subject: Call for Abstracts for Massive Digital Data Systems Future intelligence systems must effectively manage massive amounts of digital data (i.e., multi-terabytes or greater). Issues such as scalability, design, and integration need to be addressed to realize a wide spectrum of intelligence systems ranging from centralized terabyte and petabyte systems comprised of many large objects (e.g., images) to distributed heterogeneous databases that contain many small and large objects (e.g., text). The Community Management Staff's Massive Digital Data Systems (MDDS) Working Group on behalf of the intelligence community, is sponsoring a two day invitation-only unclassified workshop on the data management of massive digital data systems with government, industry, and academia. The workshop will be held on the 1st and 2nd of February 1994 in Reston, Virginia. The objective of the workshop is to make industry and academia aware of intelligence community needs, stimulate discussion of the technical issues and possible solutions, and identify potential research efforts that warrant further investigation for possible government funding. The amount of funding estimated for investments is three to five million dollars per year over the next 2-3 years. Last July, a one-day, classified, government-only workshop was held to characterize the magnitude of the problem and identify the major challenges. The needs, issues, and in some cases, lessons learned, were presented for different data types including Imagery, Text, Voice, Video, and Multi-media. Enclosure 1, "Massive Digital Data System Issues", is an unclassified description of the consolidated challenges. The Massive Digital Data Systems Working Group is soliciting one-page abstracts related to the issues of the data management of massive digital systems including (but not limited to) scalability, architecture and data models, and database management functions. The focus of the abstract should be on potential solutions for the longer term research challenges (i.e., 5-10 years out) that must be addressed today in order to effectively manage data of massive proportions in the future. The solutions need not be limited to proven approaches today but can foster new approaches and paradigms. Issues relating to the storage media and analysis tools, while important to the intelligence community, are not within the scope of the workshop. Selection for attendance will be based upon technical relevance, clarity, and quality of the proposed solution. Call for Abstracts Page 2 Each one-page abstract should follow the abstract format enclosed (Enclosure 2). All submissions must be UNCLASSIFIED. To allow enough time for proper evaluation of each abstract, the deadline for submission is 01 December 1993. You will be notified of acceptance to attend by 17 December 1993. Abstracts should be forwarded to one of the following: Jackie Booth, P.O. Box 9146, Rosslyn Station, Arlington, VA 22219 Jackie Booth, ORD/SETA, fax number (703) 351-2629 boothj@mcl.saic.com (Internet) Please pass this call for abstracts on to other colleagues that are working on solutions in this area. Sincerely, Dr. David Charvonia Director, Advanced Technology Office Community Management Staff Enclosures: 1. Massive Digital Data Systems Issues 2. Abstract Format Enclosure 2 ABSTRACT FORMAT Title: Author(s): Organization/Affiliation: Address: Phone: FAX: Description: Status: (Research, Prototype, Operational) Scope: (Size of effort in terms of dollars and/or staff months; Size of system in terms of amount of data, number of databases, nodes, users, etc.) Customer: (if applicable) Operational Use: (if applicable) ******************************************************************** Forward to one of the following: Jackie Booth, P.O. Box 9146, Rosslyn Station, Arlington, VA 22219 Jackie Booth, ORD/SETA, fax number (703) 351-2629 boothj@mcl.saic.com (Internet) ------------------------------------------------------------------- MASSIVE DIGITAL DATA SYSTEMS ISSUES EXECUTIVE SUMMARY Future intelligence systems must effectively manage massive amounts of digital data (i.e., multi-terabytes or greater). Issues such as scalability, design, and integration need to be addressed to realize a wide spectrum of intelligence systems ranging from centralized terabyte and petabyte systems comprising many large objects (e.g. images) to distributed heterogeneous databases that contain many small and large objects (e.g. text). Consequently, Massive Digital Data Systems (MDDS) are needed to store, retrieve, and manage this data for the intelligence community (IC). While several advances have been made in database management technology, the complexity and the size of the database as well as the unique needs of the IC require the development of novel approaches. This paper identifies a set of data management issues for MDDS. In particular, discussions of the scalability issues, architectural and data modeling issues, and functional issues are given. The architectures for MDDS could be centralized, distributed, parallel, or federated. The functions of MDDS include query processing, browsing, transaction management, metadata management, multimedia data processing, integrity maintenance, and realtime data processing. Representing complex data structures, developing appropriate architectures, indexing multimedia data, optimizing queries, maintaining caches, minimizing secondary storage access and communications costs, enforcing integrity constraints, meeting realtime constraints, enforcing concurrency control, recovery, and backup mechanisms, and integrating heterogeneous schemas, are some of the complex tasks for massive database management. The issues identified in this paper will provide the basis for stimulating efforts in massive database management for the IC. 1.0 INTRODUCTION 1.1 The Challenge The IC is challenged to store, retrieve, and manage massive amounts of digital information. Massive Digital Data Systems (MDDS), which range from centralized terabyte and petabyte systems containing many large objects (e.g., images) to distributed heterogeneous databases that contain many small and large objects (e.g., open source), are needed to manage this information. Although technologies for storage, processing, and transmission are rapidly advancing to support centralized and distributed database applications, more research is still needed to handle massive databases efficiently. This paper describes issues on data management for MDDS including scalability, architecture, data models, and database management functions. Issues related to storage media, analysis tools, and security while important to the IC are not within the scope of this paper. The key set of data management issues for MDDS include: Developing architectures for managing massive databases Utilizing data models for representing the complex data structures Formulating and optimizing queries Developing techniques for concurrency control and recovery Integrating heterogeneous schemas Meeting timing constraints for queries and transactions Indexing multimedia data Maintaining caches and minimizing secondary storage access and communications costs Enforcing integrity constraints. 1.2 Background The IC provides analysis on current intelligence priorities for policy makers based upon new and historical data collected from intelligence sources and open sources (e.g., news wire services, magazines). Not only are activities becoming more complex, but changing demands require that the IC process different types as well as larger volumes of data. Factors contributing to the increase in volume include continuing improvements in collection capabilities, more worldwide information, and open sources. At the same time, the IC is faced with decreasing resources, less time to respond, shifting priorities, and wider variety of interests. Consequently, the IC is taking a proactive role in stimulating research in the efficient management of massive databases and ensuring that IC requirements can be incorporated or adapted into commercial products. Because the challenges are not unique to any one agency, the Community Management Staff (CMS) has commissioned a Massive Digital Data Systems Working Group to address the needs and to identify and evaluate possible solutions. 1.3 Assumptions and Project Requirements Future intelligence systems must provide a full suite of services for gathering, storing, processing, integrating, retrieving, distributing, manipulating, sharing and presenting intelligence data. The information to be shared is massive including multimedia data such as documents, graphics, video, and audio. It is desired that the systems be adapted to handle new data types. The goal is to be able to retain the data for potential future analysis in a cost effective manner. The more relevant data would remain on-line, say for 5 years, organized with the most relevant data accessible in the least amount of time. It is expected that 2 to 5 terabytes of new data has to be processed each day. Thus, the total size of the database (both on-line and off-line) could be as large as 20 petabyes with about 300 terabytes of data stored on-line. It is assumed that storage devices (primary, secondary, and even tertiary) for the large multimedia databases as well as data pathways with the required capacity will exist. The access times are about 5 seconds for the data less than a week old, about 30 seconds for data under two months old, and on the order of minutes for data up to 10 years old. 2.0 SCALABILITY ISSUES A particular data management approach can be scaled to manage larger and larger databases. That is, a database can often sustain a certain amount of growth before it becomes too large for a particular approach. For example, more memory, storage, and processors could be added, a new hardware platform or an operating system could be adopted, or a different microprocessor could be used (e.g. using a 32 bit microprocessor instead of a 16 bit microprocessor). Once the size of the database has achieved its limit with a particular approach, then a new approach is required. This new approach could be a new architecture, a new data model, or new algorithms to implement one or more of the functions of the database management system (DBMS), or a combination of these features. Discussions of these three features are given below. Architectures: The type of architecture impacts the size and response time of the DBMS. Centralized approaches are being migrated to distributed and parallel approaches to handle large databases. Some architectures such as a the shared nothing parallel architectures are scalable to thousands of processors, but will have multiprocessor communication issues. Current approaches need to be assessed to determine their scalability limits. New approaches may be required for handling massive databases. Data Models: Data models which support a rich set of constructs are desired for next generation database applications. However, the search and access time of the DBMS would depend on the data model used. For example DBMSs which support complex data structures use large caches, access data through pointers, and work well with large main memory in general, while DBMSs based on simpler data models maintain index files and provide associative access to the secondary storage. The limits of these models within the context of massive databases need to be understood. New or modified approaches may be required. DBMS Functions: The techniques to implement the DBMS functions have to be modified to handle massive databases. For example, as the size of the database increases, new approaches for query optimization, concurrency control, recovery, and backup, access methods and indexing, and metadata management will be required. The architectural, data modeling, and functional issues that need to be addressed for MDDS will be elaborated in sections 3 and 4. 3.0 ARCHITECTURAL AND DATA MODELING ISSUES 3.1 Architectural Issues This section describes some of the architectural issues that need to be addressed for an MDDS. In the case of the centralized approach, a major issue is managing the data transfer between the main memory and secondary storage. One could expect the data that is a week old to be cached in main memory, the data that is less than two months old to be in secondary memory, and data that is a few years old to be in tertiary storage. In designing the data management techniques (such as those for querying, updating, and transaction processing), data transfer between the main and secondary memories needs to be minimized. There is also a need to reflect patterns of use (e.g., in migrating items to lower/higher levels of storage hierarchy). Another issue is the relationship between the size of the cache and the size of the database. When one migrates to distributed and parallel architectures, a goal is to maintain a larger number of smaller databases. It is assumed that processors and storage devices are available. A major issue is the communication between the processors. In designing the data management mechanisms, an objective would be to minimize the communication between the different processors. For example, in the case of a join operation between several relations in a relational DBMS, each fragmented across multiple sites, an issue is whether to merge all of the fragments of a relation and then perform the join operation or whether to do several join operations between the fragments and then merge the results to form the final result. Different configurations of the distributed and parallel architectures also need to be examined. For example, there could be point-to-point communication between every processor, or the processors could be arranged in clusters and communication between clusters is carried out by designated processors. Another issue in migrating to a distributed architecture is handling data distribution. For example, if the data model is relational, then how could one fragment the various relations across the different sites? If the relations are to be replicated for availability, then how could consistency of the replicated copies be maintained? Another issue is what data could be cached within the distributed system, how could data be cached, and for how long could the cache be maintained. While distributed and parallel architectures are being investigated for managing massive databases, federated architectures are needed to integrate the existing different and disparate databases. The existing databases could be massive centralized databases or they could be distributed databases. Furthermore, they could be relational, object-oriented and even legacy systems. An issue in heterogeneous database integration is developing standard uniform interfaces which can be accessed via an integration backplane. If the environment is a federated one, where the nodes have some autonomy, then a major issue is the ability to share each other's data while maintaining the autonomy of the individual DBMSs. This is hard because cooperation and autonomy are conflicting goals. The techniques to implement the DBMS functions for data retrieval, updates, and maintaining integrity have to be adapted or new approaches have to be developed for federated architectures. Extensible architectures are also being investigated for massive databases. With such architectures, DBMSs are extended with inferencing modules which make deductions from data already in the database. This way, one need not store all of the data in the database explicitly. Instead, appropriate inference rules are used to make deductions and derive new data. This way the size of the database is reduced. The issues include determining what data is to be stored in the database and what data is to be stored in the knowledge base manipulated by the inferencing module, effective management of the knowledge base, and adapting the functions of the DBMS to handle extensible architectures. 3.2 Data Modeling Issues In selecting an appropriate data model for massive databases, several issues must be considered. Providing a data model powerful enough to support the representation of complex data must be addressed. For example, with a multimedia document, one may need to devise a scheme to represent the entire document in such a way to facilitate browsing and updating. Since the age of a document could be used to move it between different storage media, it is desirable for the data model to support the representation of temporal constructs. The representation of different types of multimedia devices and grouping of documents are also important considerations in selecting a data model. The data model chosen has an impact on the techniques to implement the functions of a DBMS. For example, DBMSs based on some models use associative access while those based on some other models use pointer traversal. In migrating to a distributed/parallel architecture, if it is assumed that the data model is the same for all databases, then a major issue is whether it is feasible to provide a conceptual view of the entire massive database to the user. However, in the case of a federated architecture, since it is generally assumed that the individual data models are different, several additional issues need to be considered. For example, could the users have a global view of the massive database or could they have their own individual views? In either case, it would be desirable for the users to access the distributed databases in a transparent manner. If a global view is enforced, the query processor could transform the queries on the global view to the views of the individual databases. If each user has his own view, then the query processor could transform the users view into the views of the individual databases. Other issues for a federated architecture include the representation of the individual schemas (which describe the data in the databases), determining which schemas to be exported to the federation, filtering appropriate information from the schemas at different echelons, integrating the schemas to provide a global view, and generating the external schemas for the users. In integrating the different schemas, the semantic and syntactic inconsistencies between the different representations need to be resolved. For example, the address in database A could include the house number and the street name while in database B it could just be the city and the state. 4.0 FUNCTIONAL ISSUES The techniques to implement the functions of MDDS will be impacted by the architectures and data models as well as requirements such as integrity and multimedia data processing. Therefore some of the functional issues have already been addressed in section 3. This section provides a more detailed overview of the functional issues. First the basic functional issues for MDDS (such as issues on query processing and transaction management) will be discussed and then the impact of maintaining integrity, realtime processing, and multimedia data processing will be given. 4.1 Querying, Browsing, and Filtering The query operation is a means by which users can retrieve data from the database. Closely related is the browsing operation where users traverse various links and subsequently scan multiple documents either sequentially or concurrently. To determine if the new information warrants viewing by the analyst and/or to enforce access control, automatic filtering of the data is needed. Some issues in query management for massive databases are using an appropriate language for specifying queries and developing optimization techniques for the various operations involved in a query. The goals are to make it easier for users to formulate queries and also to minimize data transfer between primary and secondary storage. Query management in a federated environment must provide the means for formulating and processing queries seamlessly and efficiently. This involves designing an interface for formulating queries over multiple sources. There is a need for query optimization, in order to prevent degradation in performance in the distributed system. In addition to determining the execution strategy for a query, query optimization techniques could also determine which portion of the query processing is to remain under direct and unshared control at the analyst's workstation. Methods need to be developed for browsing the integrated information space and for displaying results obtained from multiple sources. Finally, data from local databases have to be filtered according to the various constraints (such as security constraints) and enforced before sending it to the remote sites. Query processing algorithms in an extensible architecture need to incorporate inferencing techniques. The usefulness of inferencing techniques for intelligence applications can best be illustrated with a simple example. Suppose parts A, B, C and D are needed to build a nuclear weapon, and also suppose that the following constraint is enforced: " if three of the four parts are shipped to country X, then the fourth part should not be shipped to X." Therefore, if parts A, B, and C are already shipped to X and there is a request from X for part D, then the inferencing module will determine that this part cannot be shipped. An issue in developing an inference module is determining the deduction strategies to be implemented. These strategies could be just logical deduction or could include more sophisticated techniques such as reasoning under uncertainty and inductive inference. With most inference strategies one runs into the problem of an infinite loop; therefore appropriate time limits must be enforced to control the computation. In general, the issues to be addressed in query management will include: Query optimization. Handling data distribution Making intelligent deductions Uniform vs. user-tailored query language 4.2 Update Transaction Processing Multi-user updates are supported in general to improve performance. The goal is for multiple users to be able to update the database concurrently. A major issue here is ensuring that the consistency of the database is maintained. The techniques that ensure consistency are concurrency control techniques. Often update requests are issued as part of transactions. A transaction is a program unit that must be executed in its entirety or not executed at all. Therefore, if the transaction aborts due to some error, such as system failure, then the database is recovered to a consistent state. Several concurrency control algorithms have been designed and developed for different environments. Some algorithms are suitable for short transactions in business processing applications and some others are suitable for long transactions which often involve multimedia data. To handle long transactions efficiently, weaker forms of consistency conditions have been formulated. Several recovery techniques have also been developed to maintain the consistency of the database. If the transaction is long, then the log files that record the actions of the transaction may be quite large. Efficient management of log files becomes an issue. As the size of the database increases, a transaction would take a longer time for execution. Adapting the concurrency control and recovery algorithms or developing new algorithms to work with the massive databases becomes an issue. Update transaction processing gets more complicated in distributed and federated environments. For example, if replicated copies are to be maintained, then making them consistent will have an impact on the performance. Therefore, an issue here is whether to maintain strict consistency or select a subset of the copies and make them consistent immediately so that the remaining copies could be updated at a later time. One of the problems with a federated environment is the different concurrency control and recovery algorithms used by the individual DBMSs. In such a situation synchronizing the different techniques becomes a major issue. 4.3 Access Methods and Index Strategies To enhance the performance of query and update algorithms, efficient access methods and index strategies have to be enforced. That is, in generating strategies for executing query and update requests, the access methods and index strategies that are used need to be taken into consideration. The access methods used to access the database would depend on the indexing methods. Therefore creating and maintaining appropriate index files is a major issue in a DBMS. Usually, the size of the index file grows with the size of the database. In some cases, the index file could be larger than the database itself. Some of the issues include determining what type of indexes are to be maintained for massive databases. Is it feasible to have dense indexing where there is an entry in the index file for every entry in the database? If so, the index file could have as many entries as there are in the database. Is it better to have sparse indexing so that the size of the index file could be reduced? If so, is there a strategy to determine which entries in the database are to be indexed? For multimedia data, indexing could be done not only by content but by type, language, context (i.e., where, how, when it was collected), author (i.e., for documents), and speaker (i.e., for voice). The challenge is how to index and to provide improved mechanisms for extraction of the information used for indexing. For example, the ability to automatically index voice is desired. Additionally, the ability to index voice and video (with associated voice) with their transcriptions (i.e., time alignment) is necessary. Various storage structures have been proposed. These include B-Trees and Parent-Child links. The question is, are these methods suitable for massive databases? Voice and video data require segmentation into logical units for storage and access. Additionally, the ability for automatic segmentation within documents of embedded drawings and figures and their interpretation (via seamless integration with image handling tools) is needed. Other challenges include providing user transparent hierarchical storage management (i.e., store the most relevant or most recent information on the fastest media) and the ability to reposition data in the storage hierarchy based upon changing importance, migration mechanisms for transferring information to newer storage media or a new architecture (failure to do so can lead to exorbitant costs to maintain discontinued storage media drives or inaccessible data), archival technology/policies for older/less important information, and synchronization of information distributed across multiple repositories Compression can decrease the costs of storage and transmission especially for the larger objects such as vector and raster spatial data types, voice, imagery, and video. Real-time conversion of heterogeneous voice and video compression and file formats in network broadcasts/multicasts is an issue. For imagery, a capability such as pyramidal decomposition for providing reduced resolution images is needed for browsing purposes. 4.4 Managing the Metadata The metadata includes a description of the data in the database (also referred to as the schemas), the index strategies and access methods used, the integrity mechanisms enforced, and other information for administrative purposes. Metadata management functions include representing, querying, and updating, the metadata. In massive databases, if the metadatabase is much smaller than the database, then the traditional techniques could be applied to manage the metadata. If the metadatabase becomes massive, then new techniques need to be developed. An issue here is whether the techniques for massive databases could be applied for massive metadatabases also. Support for schema evolution is desired in many new generation applications. For example, the structures of the entities in the database could change with time. An entity could acquire new attributes or existing attributes could be deleted. The metadata needs to be represented in a manner that would facilitate schema evolution. That is, appropriate models to represent the metadata are desired. Since the metadata has to be accessed for all of the functions of a DBMS, the module that is responsible for accessing the metadata needs to communicate with all the other modules. Efficient implementation of this module is necessary to avoid performance bottlenecks. Certain types of metadata, such as the schemas, are usually accessible to the external users. An issue here is whether to provide a view to the users that is different from the system's view of the metadata. For example, a different representation of the metadata could be sued for the users. Also, if the metadatabase is massive, then subsets of it could be presented to the users. 4.5 Integrity Concurrency control and recovery issues discussed in section 4.3 are some of the issues that need to be dealt with in order to maintain the integrity (i.e. consistency) of the database. Other types of integrity include maintaining the referential integrity of entities and enforcing application dependent integrity constraints. Referential integrity mechanisms must ensure that the entities referenced exist. The question is, how could the references to an entity be deleted when an entity itself is deleted? If the databases are massive, then there will probably be more references to the deleted entity. Deleting all these references in a timely manner is an issue. In the case of application specific integrity constraints, they could trigger a series of updates when one or more items in the database gets updated. Again, as the size of the database increases, the number of updates that are triggered could also increase. The issue here is ensuring that the updates are carried out in a timely manner. 4.6 Realtime / Near Realtime Processing Within a massive digital data system, the challenges of realtime or near realtime processing will be compounded. For realtime or near realtime applications, timing constraints may be enforced on the transactions and/or the queries. In the case of a hard realtime environment, meeting the timing constraints may cause the integrity of the data to suffer. In the case of soft realtime constraints (also referred to as near real-time), there is greater flexibility in meeting the deadlines. The issues for real-time processing include: If a transaction misses its deadline, then what are the actions that could be taken? Could a value function be associated with a transaction which can be used to determine whether the transaction should continue after it misses its deadline? Could the transaction be aborted if the value of the data approaches zero? What is the impact on the scheduling algorithms when timing constraints are present? How can the techniques be extended for a distributed/federated architecture? In the case of realtime updates in a distributed replicated environment, is it possible to maintain the consistency of the replicated copies and still meet the timing constraints? What is the impact on the techniques for multimedia data processing? 4.7 Multimedia Data Processing By nature, multimedia data management has to deal with many of the requirements for indexing, browsing, retrieving, and updating of the individual media types. Implementing multimedia data types will require new paradigms for representing, storing, processing, accessing, manipulating, visualizing, and displaying data from various sources in different media. One of the major issues here is synchronizing the display of different media types such as voice and video. Other issues include selecting/developing appropriate data models for representing the multimedia data and developing appropriate indexing techniques such as maintaining indexes on textual, voice, and video patterns. For example, the ability to index voice and video simultaneously may be desired. In addition to the manipulation of multimedia data, frameworks for the integration of multimedia objects as well as handling different granularity of multimedia objects (i.e., 1 hour video clip versus a spreadsheet cell) need to be considered. A flexible environment has to be provided so that the linked and embedded distributed multimedia objects can accommodate geographic/network changes. Finally, the data manipulation techniques as well as the frameworks need to be extensible to support new and diverse data types. 4.8 Backup and Recovery On-line backup procedures are being used for massive databases. This is because off-line procedures will consume too much time for massive databases. Even if the backup procedures are carried out on-line, the system could be slowed down and therefore the performance of other data management functions would suffer. The issue here is to develop improved techniques for backup so that it will not impact functions such as querying, browsing, and updating. Recovery issues for transaction management were discussed in section 4.3. Other recovery issues include whether to maintain multiple copies of the database, and if so, the number of copies to be maintained, and whether the checkpointing, roll-back and recovery procedures proposed for traditional databases could be used for massive databases or is there a need to develop special mechanisms? 5.0 SUMMARY Massive digital data systems will require effective management, retrieval, and integration of databases which are possibly heterogeneous in nature. Achieving this concept of massive intelligence information systems will require new technologies and novel approaches for data management. While hardware is rapidly advancing to provide massive data storage, processing, and transmission, the software necessary for the retrieval, integration, and management of data remains an enormous challenge. This paper has identified a set of issues for managing the data in massive digital data systems with a focus on intelligence applications. First, an overview of the current approaches to data management and the scalability of the current approaches were discussed Then some architectural and data modeling issues were given. Finally, a discussion of the issues for the various functions of MDDS were given. The set of issues identified is by no means considered a complete list. As the progression of research, prototyping, and deployments continue, new or hidden challenges will arise.
participants (1)
-
Tatu Ylonen