Nothing Special   »   [go: up one dir, main page]

CN103412917B - The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method - Google Patents

The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method Download PDF

Info

Publication number
CN103412917B
CN103412917B CN201310343157.XA CN201310343157A CN103412917B CN 103412917 B CN103412917 B CN 103412917B CN 201310343157 A CN201310343157 A CN 201310343157A CN 103412917 B CN103412917 B CN 103412917B
Authority
CN
China
Prior art keywords
data
database
domain
module
hierarchical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310343157.XA
Other languages
Chinese (zh)
Other versions
CN103412917A (en
Inventor
陈宁江
肖中正
董世龙
胡丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanning Super Cube Science And Technology Co Ltd
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN201310343157.XA priority Critical patent/CN103412917B/en
Publication of CN103412917A publication Critical patent/CN103412917A/en
Application granted granted Critical
Publication of CN103412917B publication Critical patent/CN103412917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种可扩展的多类型领域数据协调管理的数据库系统和管理方法,包括数据资源本体库模块、层次型领域数据库模块、网络型领域数据库模块和领域数据演化模块,其中数据资源本体库模块和多种类型数据库、层次型领域数据库模块以及网络型领域数据库模块共同组成数据库集。本发明可以构建数量庞大的面向业务领域的数据存储库,在此基础上构建可扩展的数据资源本体库系统,快速地扩展出不同类型的领域层次型数据库和网络型领域数据库,并且可以从非结构化的原始文本数据中抽取出新的数据对象以构建新的领域数据。

An extensible database system and management method for coordinated management of multi-type domain data, including a data resource ontology library module, a hierarchical domain database module, a network-type domain database module, and a domain data evolution module, wherein the data resource ontology library module and multiple The species type database, the hierarchical domain database module and the network domain database module together form the database set. The present invention can build a large number of business-oriented data storage libraries, build an expandable data resource ontology library system on this basis, and quickly expand different types of domain-level databases and network-type domain databases. Extract new data objects from structured raw text data to construct new domain data.

Description

一种可扩展的多类型领域数据协调管理的数据库系统和管理方法A database system and management method for scalable multi-type domain data coordinated management

技术领域technical field

本发明涉及一种可扩展、多类型领域数据协调管理的数据库系统和管理方法,属于数据库和人工智能领域。The invention relates to an extensible database system and a management method for coordinated management of data in multiple types of fields, belonging to the fields of databases and artificial intelligence.

背景技术Background technique

数据库是按照数据结构来组织、存储和管理数据的仓库,是一个单位或者一个应用领域的通用数据处理系统。随着信息技术和市场的发展,数据管理不再仅仅是存储和管理数据,而转变成用户所需要的各种数据管理的方式。数据库有很多种类型,从最简单的存储有各种数据的表格到能够进行海量数据存储的大型数据库系统都在各个方面得到了广泛的应用。随着信息化进程的加快以及“大数据”时代的到来,企业数据越来越趋于海量化、无结构化和复杂化。人工智能和数据库两项计算机技术的有机结合,促进了数据库智能化发展。一般的应用程序是把问题求解的知识隐含地编码在程序中,而基于智能数据库的系统则将应用领域的问题求解要素显式地表达,并单独地组成一个相对独立的程序实体。A database is a warehouse that organizes, stores and manages data according to the data structure, and is a general data processing system for a unit or an application field. With the development of information technology and market, data management is no longer just storing and managing data, but has transformed into various data management methods required by users. There are many types of databases, ranging from the simplest tables storing various data to large database systems capable of storing massive amounts of data, which have been widely used in various aspects. With the acceleration of the informatization process and the arrival of the "big data" era, enterprise data tends to be more and more massive, unstructured and complex. The organic combination of two computer technologies, artificial intelligence and database, has promoted the intelligent development of database. General application program implicitly encodes the knowledge of problem solving in the program, while the system based on intelligent database explicitly expresses the problem solving elements in the application field, and forms a relatively independent program entity separately.

随着信息化进程加快,海量复杂数据管理越来越受到企业的重视,但是企业在进行资源管理的过程中,经常会碰到以下问题:海量企业数据存储、管理困难;查找缓慢、效率低下;领域数据版本管理混乱;数据安全缺乏保障;各领域数据库无法有效协作共享。因此应对海量复杂无结构化数据的管理,需要可扩展、可演化并且多类型领域协同管理的智能数据库来对这些数据加以存储、处理和分析。With the acceleration of the informatization process, the management of massive and complex data has attracted more and more attention from enterprises. However, in the process of resource management, enterprises often encounter the following problems: the storage and management of massive enterprise data is difficult; the search is slow and inefficient; Domain data version management is chaotic; data security is not guaranteed; databases in various domains cannot be effectively collaborated and shared. Therefore, to deal with the management of massive and complex unstructured data, an intelligent database that is scalable, evolvable, and collaboratively managed in multiple fields is required to store, process, and analyze these data.

发明内容Contents of the invention

本发明的技术解决问题是:解决非结构化的、具有多种复杂关系的数据的高效存储组织、索引以及查询问题,提供一种可扩展的多类型领域数据协调管理的数据库系统和管理方法。The technical solution of the present invention is to solve the problems of efficient storage organization, indexing and query of unstructured data with multiple complex relationships, and provide an expandable database system and management method for coordinated management of multi-type field data.

本发明的技术解决方案:一种可扩展的多类型领域数据协调管理的数据库系统,包括:数据资源本体库模块、网络型领域数据库模块、层次型领域数据库模块和领域数据演化模块,其中:The technical solution of the present invention: an expandable multi-type domain data coordinated management database system, including: a data resource ontology library module, a network-type domain database module, a hierarchical domain database module and a domain data evolution module, wherein:

数据资源本体库模块,定义顶层数据资源模型,实现基本数据单元的逻辑视图设计以及存储结构设计,提供数据存储与访问基础支撑能力,并建立包含大量业务数据对象、关系和概念的数据库;数据资源本体库模块为网络型领域数据库模块、层次型领域数据库模块提供顶层数据抽象规则和数据访问规则;The data resource ontology library module defines the top-level data resource model, realizes the logical view design and storage structure design of basic data units, provides basic support capabilities for data storage and access, and establishes a database containing a large number of business data objects, relationships and concepts; data resources Ontology library module provides top-level data abstraction rules and data access rules for network domain database module and hierarchical domain database module;

网络型领域数据库模块,根据数据对象的属性、关系网络以及其他特殊属性,在数据资源本体库的基础上构建基于网络型数据属性的数据库,实现网络型数据对象的数据结构设计、存储设计以及索引设计,形成了包含大量网络型数据对象的关系网,并实现对外部提供网络型数据库访问接口;网络型领域数据模块是对数据资源本体库属于继承以及在网络型领域数据上的实例化实现;向用户、其他模块以及外部系统提供基于网络型领域数据的查询接口;The network-type domain database module, according to the attributes of data objects, relational networks and other special attributes, builds a database based on network-type data attributes on the basis of the data resource ontology library, and realizes the data structure design, storage design and indexing of network-type data objects Design and form a relational network containing a large number of network-type data objects, and realize the external access interface of the network-type database; the network-type domain data module is the inheritance of the data resource ontology library and the instantiation realization on the network-type domain data; Provide users, other modules and external systems with query interfaces based on network-based domain data;

层次型领域数据库模块,根据层次型数据对象之间的隶属、相邻、交叉、同级等关系特点,构建专门表示数据对象及其层次隶属相关信息的数据库,并实现对外部提供数据对象及其层次隶属数据库的访问接口;层次性领域数据库模块是对网络型领域数据模块的进一步演化,将仅具有层次性结构领域数据以树的形态进行存储组织,实现层次语义,并对用户、其他模块以及外部系统提供基于层次性领域数据的查询接口;Hierarchical domain database module, according to the relationship characteristics of hierarchical data objects such as subordination, adjacency, intersection, and same level, builds a database that specifically represents data objects and their hierarchical affiliation related information, and realizes the external provision of data objects and their related information. The access interface of the hierarchical subordinate database; the hierarchical domain database module is a further evolution of the network domain data module, which stores and organizes domain data with only hierarchical structure in the form of a tree, realizes hierarchical semantics, and provides information to users, other modules and The external system provides a query interface based on hierarchical domain data;

领域数据演化模块,追踪和控制数据资源本体库、网络型领域数据库和层次性领域数据库中的领域数据在用户使用过程中的变化,建立数据对象版本历史,并对用户提供的原始数据集结合已有数据进行分析,从而得到新的领域数据并通过筛选输入到领域数据库中,领域数据演化模块为上述三个库提供基于记录的数据版本控制,自动从用户输入的原始数据中发现新的领域数据,并且使用其接口进行相应的演化管理。The domain data evolution module tracks and controls the changes of the domain data in the data resource ontology library, network domain database and hierarchical domain database during the user's use, establishes the version history of the data object, and combines the original data set provided by the user with the Data is analyzed to obtain new domain data and input into the domain database through screening. The domain data evolution module provides record-based data version control for the above three libraries, and automatically discovers new domain data from the original data input by users , and use its interface for corresponding evolution management.

所述数据资源本体库模块包括数据持久化模块、底层词库建立模块、关系定义模块、数据索引模块以及接口模块;The data resource ontology library module includes a data persistence module, a bottom layer lexicon establishment module, a relationship definition module, a data index module and an interface module;

数据持久化模块,定义面向接口的实现方法,根据不同的硬件环境、上下文环境以及其他需求灵活的配置数据持久化实现;基于对象序列化技术,定义领域数据相关对象的序列化和反序列化协议,数据持久化时通过文件组织协议把对象序列化后得到的二进制流输出到文件、数据库或者网络位置;当需要装载未载入对象缓冲池的对象时,根据上层请求发来的逻辑地址信息读取相应数据流,再通过对象的反序列化协议重构对象;数据在文件里的逻辑组织方式为块存储方式,块的管理采用堆结构进行管理。数据资源本体库的数据持久化模块同时也是网络型领域数据库和层次型领域数据库的数据持久化抽象,后两个模块的数据存储功能基于该持久化模块按照不同的持久化协议进行定制和扩展,形成特定数据类型的持久化库;The data persistence module defines an interface-oriented implementation method, and flexibly configures data persistence implementation according to different hardware environments, context environments, and other requirements; based on object serialization technology, defines serialization and deserialization protocols for domain data-related objects , when the data is persisted, output the binary stream obtained by serializing the object to a file, database or network location through the file organization protocol; when it is necessary to load an object that has not been loaded into the object buffer pool, read it according to the logical address information sent by the upper layer request Take the corresponding data stream, and then reconstruct the object through the deserialization protocol of the object; the logical organization method of data in the file is the block storage method, and the block management adopts the heap structure for management. The data persistence module of the data resource ontology library is also the data persistence abstraction of the network domain database and the hierarchical domain database. The data storage functions of the latter two modules are customized and extended according to different persistence protocols based on the persistence module. Form a persistence library of a specific data type;

底层词库建立模块,存储无扩展属性和关系的数据对象,建立基本的领域数据对象、序列化协议、反序列化协议以及存储管理器;底层词库的单一数据形式,为网络型领域数据和层次型领域数据的定义提供了数据基础。网络型领域数据库模块和层次型领域数据库模块实现了这里定义的序列化与反序列化接口;The underlying lexicon building module stores data objects without extended attributes and relationships, and establishes basic domain data objects, serialization protocols, deserialization protocols, and storage managers; the single data form of the underlying lexicon is network-type domain data and The definition of hierarchical domain data provides the data basis. The network domain database module and the hierarchical domain database module implement the serialization and deserialization interfaces defined here;

关系定义模块,在底层词库实现的基础上,用新的文件对底层数据对象数据库的词条建立同义关系、反义关系、隶属关系;高度抽象的、一般化的关系定义、组织、存储以及管理,使得网络型领域数据库在此基础上实现灵活扩展;The relationship definition module, based on the implementation of the underlying lexicon, uses new files to establish synonymous, antisense, and subordinate relationships for entries in the underlying data object database; highly abstract and generalized relationship definition, organization, and storage And management, so that the network domain database can be flexibly expanded on this basis;

数据索引模块,先对领域数据对象进行摘要定义,通过快速的双编码算法,将领域数据摘要与领域数据对象的逻辑存储信息进行映射,达到快速检索以及访问控制的目的;网络型领域数据库模块和层次型领域数据库模块都含有索引部分,其中关键字都通过双编码计算后得到一个长整型数字对来实现;The data index module first defines the abstract of the domain data object, and maps the domain data abstract and the logical storage information of the domain data object through a fast double-coding algorithm to achieve the purpose of fast retrieval and access control; the network domain database module and Hierarchical domain database modules all contain an index part, in which keywords are realized by obtaining a pair of long integer numbers after double-coding calculations;

接口模块,基于EJB3.0规范实现,以EJB接口和Web Service接口形式进行发布,实现跨平台服务,网络型领域数据库和层次型领域数据库通过继承数据资源本体库接口模块,实现定制的接口发布功能。The interface module is implemented based on the EJB3.0 specification and published in the form of EJB interface and Web Service interface to realize cross-platform services. The network-type domain database and hierarchical domain database inherit the interface module of the data resource ontology library to realize the customized interface publishing function .

所述网络型领域数据库模块实现过程如下:The implementation process of the network domain database module is as follows:

(1)在存储设计上基于所述数据资源本体库的存储管理层,定义网络型领域数据相关的领域数据对象、持久化协议;(1) In terms of storage design, based on the storage management layer of the data resource ontology library, define domain data objects and persistence protocols related to network domain data;

(2)在定义好的网络型领域数据对象基础上建立存储层,定义属性部分基本结构和过程。将属性部分分为两部分,一部分是数据库设计时即存在的属性,称为基本属性;另一部分是用户自定义属性,称为扩展属性;(2) Establish the storage layer on the basis of the defined network-type domain data objects, and define the basic structure and process of the attribute part. The attribute part is divided into two parts, one part is the attribute that exists when the database is designed, called the basic attribute; the other part is the user-defined attribute, called the extended attribute;

(3)在网络型领域数据对象存储结构之上建立数据索引,基于带有快速缓冲区的B树和Bloom Filter实现,在插入网络型数据对象的时候动态生成的B树,并且不限制B树的最大层数,针对网络型数据对象同名的情况,将属性块用指针连在一起形成一个属性块链表,当以名称为关键字查询网络型数据对象时,快速得到同名的网络型数据对象列表;(3) Establish a data index on top of the data object storage structure in the network domain, based on the B-tree and Bloom Filter implementation with a fast buffer, dynamically generate a B-tree when inserting a network-type data object, and do not limit the B-tree The maximum number of layers. For the case of network data objects with the same name, the attribute blocks are linked together with pointers to form a list of attribute blocks. When the name is used as the keyword to query the network data object, the list of network data objects with the same name can be quickly obtained. ;

(4)在实现数据索引之后,针对数据记录的更新,通过实现检查点和日志文件,保障系统的高效存取和高容错性。(4) After the data index is realized, for the update of data records, checkpoints and log files are implemented to ensure efficient access and high fault tolerance of the system.

所述层次型领域数据库模块实现过程如下:The implementation process of the hierarchical domain database module is as follows:

(1)在存储设计上基于所述数据资源本体库的存储管理层,定义层次型领域数据库相关的领域数据对象、持久化协议;(1) In terms of storage design, based on the storage management layer of the data resource ontology library, define the domain data objects and persistence protocols related to the hierarchical domain database;

(2)在层次型领域数据对象以及基于层次型领域数据结构扩展到持久化协议的基础上,进行层次型结构的存储设计;(2) On the basis of the hierarchical domain data object and the extension of the hierarchical domain data structure to the persistence protocol, the storage design of the hierarchical structure is carried out;

(3)存储层完成之后,通过限定的二叉树来组织层次型数据对象之间的关系结构。索引文件中的关键字是一个数据对象的双编码数偶构成的,由于此数偶的唯一性,所以无须考虑冲突的问题。在属性文件中,存放父数据对象和子数据对象也是采用此数偶来存储;检索时,通过计算该数据对象的数偶,到索引文件中匹配相同的数偶,得到相应的属性的指针,读出属性,若有多个子数据对象,可以由属性的指向下个属性的指针找到所有的下级数据对象;(3) After the storage layer is completed, organize the relationship structure between hierarchical data objects through a limited binary tree. The keywords in the index file are composed of a double-coded number pair of a data object. Due to the uniqueness of this number pair, there is no need to consider the problem of conflicts. In the attribute file, the parent data object and the child data object are also stored using this number pair; when searching, by calculating the number pair of the data object, matching the same number pair in the index file, and obtaining the corresponding attribute pointer, read If there are multiple child data objects, all the subordinate data objects can be found by the pointer of the attribute pointing to the next attribute;

(4)索引完成之后,在网络型领域数据库模块的检查点与日志文件功能基础上,构建适合层次型结构的简化日志文件。(4) After the index is completed, based on the checkpoint and log file functions of the network domain database module, a simplified log file suitable for the hierarchical structure is constructed.

所述领域数据演化模块实现过程如下:The implementation process of the domain data evolution module is as follows:

(1)首先收集各个数据库的用户活动记录,监测领域数据对象的活性程度的变化。(1) First collect user activity records of each database, and monitor changes in the activity level of domain data objects.

(2)对收集到的数据对象活性变化数据进行分析,将活性低于系统阈值的领域数据对象纳入警戒后备库;(2) Analyze the collected data object activity change data, and include domain data objects whose activity is lower than the system threshold into the alert backup database;

(3)进一步分析用户的活动记录,对核心属性发生变化的领域数据对象建立该数据对象的版本变化记录;(3) Further analyze user activity records, and establish version change records for domain data objects whose core attributes have changed;

(4)系统还将用户提供的或者互联网上的文本数据进行分析,构建一个巨大的数据对象分析库;当新的数据输入到数据对象分析库后,将触发读取相关联数据对象的版本信息,然后通过分析数据对象与关联数据对象版本之间的关系,计算出当前数据对象为新的领域数据的概率,并自动或者用户手动对新的领域数据进行修改,并加入对应的领域数据库中。(4) The system also analyzes the text data provided by the user or on the Internet to build a huge data object analysis library; when new data is input into the data object analysis library, it will trigger the reading of the version information of the associated data object , and then by analyzing the relationship between the data object and the version of the associated data object, the probability that the current data object is new domain data is calculated, and the new domain data is automatically or manually modified by the user and added to the corresponding domain database.

一种可扩展的多类型领域数据协调管理的数据库管理方法,实现步骤如下:An extensible database management method for coordinated management of multi-type domain data, the implementation steps are as follows:

(1)对用户提供的文本数据文件进行预处理,去掉包括停词、语气词以及标点符号在内的非核心领域数据,获得预处理文本数据;(1) Preprocess the text data files provided by users, remove non-core field data including stop words, modal particles, and punctuation marks, and obtain preprocessed text data;

(2)将步骤(1)中输出的预处理数据输入到LDA概率模型中,与已建立数据模型进行匹配,获得其中的领域相关数据对象;(2) Input the preprocessed data output in step (1) into the LDA probability model, match with the established data model, and obtain the domain-related data objects;

(3)将步骤(2)中输出的领域相关数据对象进行后缀树的构建,并融合已有的后缀树,对合并的后缀树进行逐步遍历,获得高频度的字符串,然后初始化一个领域相关数据对象;(3) Construct the suffix tree of the domain-related data objects output in step (2), merge the existing suffix tree, and traverse the merged suffix tree step by step to obtain high-frequency strings, and then initialize a domain related data objects;

(4)将步骤(3)获得的领域相关数据对象输入到数据资源本体库进行类型、关系判断和匹配,获得与该领域相关数据对象的类型,即层次型、网络型还是用户自定义类型,以及与该数据对象关联的其他领域数据对象;(4) Input the field-related data objects obtained in step (3) into the data resource ontology library for type and relationship judgment and matching, and obtain the type of data objects related to the field, that is, hierarchical type, network type, or user-defined type, and other domain data objects associated with this data object;

(5)将步骤(4)输出的领域相关数据对象及其关联数据输入到对应类型的领域数据库,建立数据变更日志记录,并将该领域相关数据对象输入双编码算法中,获得对应的索引数偶;(5) Input the domain-related data objects and their associated data output in step (4) into the corresponding type of domain database, create data change log records, and input the domain-related data objects into the double-encoding algorithm to obtain the corresponding index number I;

(6)将步骤(5)获得的数偶与数据对象以及相关领域数据对象进行业务组合,最终输出含有领域相关性、多种关系、多属性的领域数据对象。(6) Combine the number pair obtained in step (5) with the data objects and related domain data objects, and finally output domain data objects containing domain correlation, multiple relationships, and multiple attributes.

本发明与现有技术相比的优点在于:The advantage of the present invention compared with prior art is:

(1)本发明基于块存储、堆管理以及多日志组的存储技术,保证底层存储的高效、安全;(1) The present invention is based on block storage, heap management and multi-log group storage technology to ensure the efficiency and security of the underlying storage;

(2)基于可扩展的设计方法,通过自定义数据关系,在数据资源本体库之上可以扩展出多个子领域的领域数据库;(2) Based on the scalable design method, by customizing the data relationship, the domain database of multiple sub-domains can be expanded on the data resource ontology database;

(3)本发明实现对大量文本数据自动探测、分析和提取功能,将获取的数据对象在最新版本数据资源本体库的基础上演化新领域数据。(3) The present invention realizes the functions of automatic detection, analysis and extraction of a large amount of text data, and evolves the acquired data objects into new field data on the basis of the latest version of the data resource ontology library.

附图说明Description of drawings

图1为本发明系统的组成框图;Fig. 1 is a block diagram of the system of the present invention;

图2为本发明数据资源本体库模块概念关系示意图;Fig. 2 is a schematic diagram of the concept relationship of the module of the data resource ontology library in the present invention;

图3为本发明数据资源本体库模块三元组基本数据模型示意图;Fig. 3 is a schematic diagram of the basic data model of triples in the data resource ontology library module of the present invention;

图4为本发明数据资源本体库索引示意图;Fig. 4 is a schematic diagram of the index of the data resource ontology database of the present invention;

图5为本发明数据资源本体库模块访问过程示意图;Fig. 5 is a schematic diagram of the access process of the data resource ontology library module of the present invention;

图6为本发明数据资源本体库定义新字段名流程示意图;Fig. 6 is a schematic diagram of the process of defining a new field name in the data resource ontology library of the present invention;

图7为本发明数据资源本体库插入新关系流程示意图;Fig. 7 is a schematic diagram of the process of inserting a new relationship into the data resource ontology database of the present invention;

图8为本发明数据资源本体库检索数据对象过程示意图;Fig. 8 is a schematic diagram of the process of retrieving data objects in the data resource ontology database of the present invention;

图9为本发明网络型数据库基本结构与关系示意图;Fig. 9 is a schematic diagram of the basic structure and relationship of the network database of the present invention;

图10为本发明层次型领域数据库基本结构与关系示意图;Fig. 10 is a schematic diagram of the basic structure and relationship of the hierarchical domain database of the present invention;

图11为本发明层次型数据领域库查询流程示意图;Fig. 11 is a schematic diagram of the query process of the hierarchical data domain database of the present invention;

图12为本发明层次型领域数据库新建领域数据对象流程示意图;Fig. 12 is a schematic diagram of the flow of new domain data objects in the hierarchical domain database of the present invention;

图13为本发明领域数据演化模块LDA模型表示图;Fig. 13 is a representation diagram of the LDA model of the data evolution module in the field of the present invention;

图14为本发明领域数据演化版本控制结构示意图。Fig. 14 is a schematic diagram of the data evolution version control structure in the field of the present invention.

具体实施方式detailed description

本发明通过高效的数据存储组织方法,实现数据对象、关系的有效存储、查询。并且可以通过用户提供的大量数据发现新的数据对象和领域数据;实现了数据资源本体库、网络型领域数据库、层次型领域数据库,并且可以进行领域数据库的扩展。The invention realizes effective storage and query of data objects and relationships through an efficient data storage organization method. And new data objects and domain data can be found through a large amount of data provided by users; data resource ontology database, network domain database, and hierarchical domain database can be realized, and domain database can be expanded.

本发明包括数据资源本体库模块、层次型领域数据库模块、网络型领域数据库模块和领域数据演化模块,其中数据资源本体库模块、层次型领域数据库模块和网络型领域数据库模块共同组成数据库集,提供进行信息抽取和数据对象的检索,建立领域数据库系统所需的各种相关数据对象库,为多类型领域数据协调管理提供数据存储与访问基础支撑能力;领域数据演化模块追踪和控制领域数据的变化,实现领域数据库本身的演化和领域数据版本管理。The present invention includes a data resource ontology library module, a hierarchical domain database module, a network-type domain database module and a domain data evolution module, wherein the data resource ontology library module, the hierarchical domain database module and the network-type domain database module together form a database set, providing Carry out information extraction and data object retrieval, establish various related data object libraries required by the domain database system, and provide data storage and access basic support capabilities for multi-type domain data coordination management; domain data evolution module tracks and controls changes in domain data , to realize the evolution of the domain database itself and domain data version management.

1、数据资源本体库模块1. Data resource ontology library module

属性的定义:一个事物的性质与关系,称为这个事物的属性。某类事物的特有属性,是从具体事物中抽象出来的,比如人的语音、思维是特有属性。某类事物的偶然属性,是某类的有些事物具有、但不是所有事物都具有属性,比如人的肤色、民族都是偶然属性。Definition of attribute: The nature and relationship of a thing is called the attribute of this thing. The unique attributes of a certain type of things are abstracted from specific things, such as human voice and thinking are unique attributes. The accidental attribute of a certain type of thing is that some things of a certain type have attributes, but not all things have attributes. For example, the skin color and nationality of a person are all accidental attributes.

概念的定义:概念是反映事物的特有属性(固有属性或者本质属性)的思维形态。概念具有抽象性、普遍性。概念有真假之分,真实的概念是正确反映事物的特有属性的概念。概念的内涵是概念所反映的事物的特有属性。概念的外延,是具有概念所反映的特有属性的事物。一般用“is-a”关系来表示某个概念模型是某个概念模型的外延。关系是概念的一个外延。角色也是概念的一个外延。概念要求是明确的,即要求从内涵和外延两个方面去明确一个概念。根据概念的外延是一个事物还是多个事物,概念可以分为单独概念和普遍概念。单独概念的外延是一个独一无二的事物,比如具体的时间和具体的空间。而普遍概念的外延,可以包含许多的事物。比如“城市”、“商品”这样的概念。“城市”包含很多个具体城市,“商品”也是大量实体商品的概念集合。“is”和“is-a”都是一种独一无二的概念。概念还可以分为集合概念和非集合概念。集合概念是反映集合体的概念。非集合概念是不反映集合体的概念。概念还可以分为正概念和负概念。正概念是反映具有某种属性的事物的概念。负概念是反映不具有某种属性的事物的概念。概念还分为相对概念和绝对概念。相对概念是反映具有某种关系的事物的概念。绝对概念是反映具有某种性质的事物的概念。Definition of concept: A concept is a thinking form that reflects the unique attributes (inherent attributes or essential attributes) of things. Concepts are abstract and universal. Concepts can be divided into true and false, and a true concept is a concept that correctly reflects the unique attributes of things. The connotation of a concept is the unique attribute of the thing reflected by the concept. The extension of a concept is something that has the unique attributes reflected by the concept. The "is-a" relationship is generally used to indicate that a certain conceptual model is an extension of a certain conceptual model. A relationship is an extension of a concept. A role is also an extension of a concept. The concept requirements are clear, that is, it is required to clarify a concept from two aspects of connotation and extension. According to whether the extension of the concept is one thing or multiple things, the concept can be divided into individual concept and general concept. The extension of a single concept is a unique thing, such as a specific time and a specific space. The extension of the general concept can include many things. For example, concepts such as "city" and "commodity". "City" includes many specific cities, and "commodity" is also a conceptual collection of a large number of physical commodities. Both "is" and "is-a" are unique concepts. Concepts can also be divided into aggregate concepts and non-aggregate concepts. The aggregate concept is a concept that reflects an aggregate. A non-aggregate concept is a concept that does not reflect an aggregate. Concepts can also be divided into positive concepts and negative concepts. Positive concepts are concepts that reflect things with certain attributes. Negative concepts are concepts that reflect things that do not have a certain attribute. Concepts are also divided into relative concepts and absolute concepts. Relative concepts are concepts that reflect things that have a certain relationship. An absolute concept is a concept that reflects something with a certain quality.

概念之间的关系有以下4种基本关系,如图2所示:The relationship between concepts has the following four basic relationships, as shown in Figure 2:

(1)全同关系:如果所有的a都是b,同时,b都是a,那么a和b就有全同关系。两个概念具有全同关系,那么这两个概念的外延是同一的。(1) Identity relationship: If all a is b, and b is a at the same time, then a and b have identity relationship. If two concepts have an identical relationship, then the extensions of the two concepts are the same.

(2)隶属关系:如果所有b都是a,但是有a不属于b,那么a与b是上属关系,b与a的关系是下属关系。(2) Subordination relationship: If all b are a, but there is a that does not belong to b, then the relationship between a and b is a superior relationship, and the relationship between b and a is a subordinate relationship.

(3)交叉关系:如果有的a是b,而且,有的a不是b,而且,有的b又不是a,那么,a与b是交叉关系。(3) Cross relation: If some a is b, and some a is not b, and some b is not a, then a and b are cross relation.

(4)全异关系:所有的a都不是b,那么a与b是全异关系。(4) Disparate relationship: All a is not b, then a and b are disparate relations.

“Concept A-Relation-Concept B”这种三元组的基本数据模型是领域数据库的基础,即是基础的逻辑结构,也是基础的物理存储结构(图3)。“概念”是领域数据库中的基本元素,它表达了人们对事物的认知,是反映事物的特有属性的思维形态。概念既可以是一种真实存在的物体,也可能是一种人类的想象或设计,即可以是一种属性,也可能是一种活动。该三元组表达的逻辑意思是“概念A”与“概念B”之间是这个“关系(Relation)”,“概念A”在这三元组中,是拥有者的身份,“概念B”在这个三元组中是参与者的身份。“概念A”和“概念B”还可以拥有或参与其它关系。从而通过关系在n个概念中建立一个非常复杂的网状结构,这个结构复杂程度由用户或实际运行环境决定。“概念”根据抽象层次可分为“类”(Class)和“个体”(Individual)。“关系”用来在概念之间建立联系,可分为“类间关系”、“类与个体关系”和“个体间关系”。在数据模型设计中,把“关系”都统一看成为一种“概念”,因此,领域数据库中一切都是“概念”,每个“概念”都有一个唯一的标识。所有领域数据均可采用这种概念-关系-概念的方式来表达,如果某个领域的数据构成很复杂,那么领域数据表的这种结构,很可能是一个非常复杂的网状结构,但是脉络却很清晰。世界上很多领域数据直接或间接的都会有一定联系,通过领域数据和关系这种表达结构,可以很容易找到关联的数据,与关系数据结构不同,这里的概念关系这种结构,几乎能够容纳所有的关系和数据,能够随意修改和创建新的关系和数据,而不像关系数据库,从数据库设计完成以后,数据与数据之间的关系就不再被改变,从这一点上看,概念关系这种数据组成方式,在进行模型搜索或数据挖掘的时候,占有无可比拟的优势。The basic data model of the triplet "Concept A-Relation-Concept B" is the basis of the domain database, which is the basic logical structure and the basic physical storage structure (Figure 3). "Concept" is the basic element in the domain database, which expresses people's cognition of things, and is a thinking form that reflects the unique attributes of things. A concept can be a real object, or a human imagination or design, that is, it can be an attribute or an activity. The logical meaning expressed by the triplet is the "relationship (Relation)" between "concept A" and "concept B", "concept A" in this triplet is the identity of the owner, "concept B" Within this triplet is the identity of the participant. "Concept A" and "Concept B" may also have or participate in other relationships. Thus, a very complex network structure is established in n concepts through relationships, and the complexity of this structure is determined by the user or the actual operating environment. "Concept" can be divided into "Class" (Class) and "Individual" (Individual) according to the level of abstraction. "Relationship" is used to establish a connection between concepts, which can be divided into "inter-class relationship", "class-individual relationship" and "inter-individual relationship". In data model design, all "relationships" are regarded as a "concept". Therefore, everything in the domain database is a "concept", and each "concept" has a unique identifier. All domain data can be expressed in this concept-relationship-concept way. If the data composition of a certain domain is very complex, then the structure of the domain data table is likely to be a very complex network structure, but the context But very clear. Many fields of data in the world are directly or indirectly related to each other. Through the expression structure of domain data and relations, it is easy to find related data. Unlike relational data structures, the structure of concept relations here can accommodate almost all The relationship and data of the database can modify and create new relationships and data at will, unlike relational databases. After the database design is completed, the relationship between data and data will not be changed. From this point of view, the conceptual relationship This method of data composition has incomparable advantages when performing model search or data mining.

1.1数据资源本体库结构描述1.1 Data resource ontology library structure description

在数据资源本体库中,库里面存放的本体数据具有概念明确、词形简练、单义性等特点。一个本体只表述一个概念,而且是以名词和名词性词组为主。有些数据对象有着特定的关系,例如同义关系、反义关系和隶属关系。首先设计一个底层数据库,里面是无扩展属性和关系的数据对象。底层数据库为实现同义、反义和隶属三种关系提供服务。另一个重要部分是内存索引,其结构有对象Hash值、磁盘块地址;对象Hash值是唯一的数字,假设两个数据对象发生冲突,即Hash值相等,则它们存到磁盘相同的块中,这样一个块就可存放多条记录。In the ontology library of data resources, the ontology data stored in the library have the characteristics of clear concept, concise word form, and univocality. An ontology only expresses a concept, and it is mainly based on nouns and noun phrases. Some data objects have specific relationships, such as synonymous, antonymous, and affiliation. First, design an underlying database, which contains data objects without extended attributes and relationships. The underlying database provides services for realizing the three relationships of synonym, antonym and affiliation. Another important part is the memory index. Its structure includes object Hash value and disk block address; the object Hash value is a unique number. If two data objects conflict, that is, the Hash value is equal, they will be stored in the same block on the disk. Such a block can store multiple records.

磁盘文件:将指定大小的磁盘空间划分为一个块;一个块存放多条记录,如果记录多出块的上限,则通过指针指向下一块。通过程序把源数据对象文件通过映射,写入二进制数据文件。在数据文件中,是以数据块为单位,一个数据块中存储的是Hash值相同,即发生Hash冲突的数据对象。这个数据块中存储的是数据对象的多个属性字段。如果有需要,可以添加其他字段,比如为同义数据库添加它的同义数据对象的地址字段。要查找某个数据对象时,就利用Hash函数计算其Hash值,根据映射关系,就可以直接找到数据对象的相对物理块号,然后交给数据库管理器和对象管理器进行数据读取与解封。该方法中间设置有数据库快速缓冲区,避免了将大量数据对象调入内存,节省存储空间,存取效率很高。Disk file: Divide the disk space of the specified size into one block; one block stores multiple records, and if the record exceeds the upper limit of the block, the pointer points to the next block. Through the program, the source data object file is written into the binary data file through mapping. In the data file, the data block is used as the unit, and a data block stores data objects with the same Hash value, that is, a Hash conflict occurs. This data block stores multiple attribute fields of the data object. If necessary, other fields can be added, such as adding the address field of its synonymous data object for the synonymous database. When looking for a data object, use the Hash function to calculate its Hash value. According to the mapping relationship, you can directly find the relative physical block number of the data object, and then hand it over to the database manager and object manager for data reading and unpacking . In this method, a database fast buffer is set in the middle, which avoids transferring a large number of data objects into memory, saves storage space, and has high access efficiency.

独立数据对象:即数据对象跟其他数据对象没有任何关系,查询时只返回词该数据对象,没有展示其他的关系。这个部分直接利用底层数据库,而底层数据库没有实现数据对象之间的关系。领域数据存储逻辑结构包括(结构如图4所示):Independent data object: That is, the data object has no relationship with other data objects. When querying, only the data object is returned, and no other relationships are displayed. This part makes direct use of the underlying database, which does not implement relationships between data objects. The logical structure of domain data storage includes (the structure is shown in Figure 4):

(1)索引:数据对象通过哈希函数计算得到的一个值,用来定位数据对象地址。(1) Index: A value calculated by the data object through the hash function, which is used to locate the address of the data object.

(2)块地址:存放数据对象的块的相对物理地址。(2) Block address: the relative physical address of the block where the data object is stored.

(3)数据对象记录:数据对象本身。(3) Data object record: the data object itself.

带关系数据对象:这个关系包括同义关系、反义关系和隶属关系。这些关系在底层数据库的基础上,利用其提供的数据对象的地址来实现。Data objects with relation: This relation includes synonymous relation, antonymous relation and affiliation relation. These relationships are implemented on the basis of the underlying database using the addresses of the data objects it provides.

数据对象关系实现如下:在底层数据库的基础上,分别建立索引表文件和记录集文件。索引表文件和记录集文件在物理存储结构上分别被划分成特定大小的连续的块,此处将一块简称为一条记录,每条记录对应唯一的编号(即首记录编号为0,第二条记录编号为1,以此类推)。在索引表文件中,索引记录包含有正关联组编号、逆关联组编号、下级数据对象组编号、上级数据对象相对物理地址(这些编号都是指在记录集文件中的编号)。在记录集文件中,每条记录都有两个标识来记录本记录的前一条记录编号和后续记录编号,同时可存储一定数量的数据对象相对物理地址,当一条记录空间用完时,可根据需要再分配其后续记录。当要给某数据对象添加正关联(逆关联/下级关联)数据对象时,首先用Hash函数处理数据对象,产生唯一的数据对象相对物理地址,然后在索引表文件中给该数据对象分配一个索引记录,根据数据对象地址在数据块管理层特定位置记录该数据对象的索引记录编号。接着,在记录集文件分配一条记录,该记录专用于保存本数据对象的同义词正关联地址(逆关联/下级关联)数据对象相对物理地址,然后将新分配记录编号写入本数据对象的索引记录词中。之后,在记录集文件中的对应记录中添加同义词正关联地址(反义逆关联/下级关联)数据对象相对物理地址。若是添加下级词关联数据对象,则需要在下级词关联数据对象的索引词记录中设置其上级词关联数据对象的相对物理地址。当执行删除正关联、逆关联、下级关联等操作完后,要对相应的记录集文件记录、相应的索引记录进行检查,若相应记录内容为空或是索引记录内容为空,则记录它们的编号,并解除它们与相应数据对象之间的关系,这样空闲磁盘空间可回收利用,极大提高磁盘空间的利用率。The realization of the data object relationship is as follows: on the basis of the underlying database, the index table file and the record set file are respectively established. The index table file and the record set file are divided into continuous blocks of a specific size in the physical storage structure. Here, a block is referred to as a record, and each record corresponds to a unique number (that is, the first record number is 0, the second record record number 1, and so on). In the index table file, the index records include the number of the positive association group, the number of the reverse association group, the number of the lower data object group, and the relative physical address of the upper data object (these numbers refer to the numbers in the record set file). In the record set file, each record has two identifiers to record the previous record number and subsequent record number of this record, and can store a certain number of relative physical addresses of data objects. When a record space is used up, it can be used according to Its subsequent records need to be reassigned. When adding a positive association (reverse association/lower association) data object to a data object, first process the data object with the Hash function to generate a unique relative physical address of the data object, and then assign an index to the data object in the index table file Record, record the index record number of the data object at a specific location in the data block management layer according to the address of the data object. Next, allocate a record in the record set file, which is dedicated to saving the relative physical address of the synonym positive association address (reverse association/subordinate association) data object of this data object, and then write the newly allocated record number into the index record of this data object in the word. After that, add the relative physical address of the synonym positive association address (antisense reverse association/subordinate association) data object to the corresponding record in the recordset file. If adding a subordinate word-associated data object, you need to set the relative physical address of its superior word-associated data object in the index word record of the subordinate word-associated data object. After performing operations such as deleting positive associations, reverse associations, and lower-level associations, check the corresponding recordset file records and corresponding index records. If the corresponding record content is empty or the index record content is empty, record their Numbers, and release the relationship between them and the corresponding data objects, so that the free disk space can be recycled, greatly improving the utilization of disk space.

1.2定义数据库1.2 Define the database

设立本模块的目的主要是为了便于数据对象库的管理和扩展。只有定义某字段名后,数据对象库中的数据对象方可添加该字段以及其字段值;若修改某字段名,则库中拥有该字段的数据对象将会自动将该字段名修改为新的字段名,但对应字段值不变;若删除库中已经定义的某字段名,则数据库中拥有该字段的所有数据对象将自动删除该字段以及其字段值。同理,定义某关系名后,数据库中的数据对象方可建立该关系;若修改某关系名,则数据库中拥有该关系的数据对象之间将会自动将该关系修改为新的关系;若删除数据库已经定义的某关系名,则数据库中拥有该关系的所有数据对象之间将自动解除该关系。对于数据库已定义的默认字段名和关系名是不能进行修改和删除等操作,对于用户自定义的字段名和关系名,用户可进行添加、修改、删除等操作。图5显示了数据资源本体库抽象架构以及访问过程。以定义字段名为例,The purpose of setting up this module is mainly to facilitate the management and expansion of the data object library. Only after a field name is defined, the data object in the data object library can add the field and its field value; if a field name is modified, the data object that owns the field in the library will automatically modify the field name to a new one The field name, but the corresponding field value remains unchanged; if a field name defined in the library is deleted, all data objects in the database that have the field will automatically delete the field and its field value. Similarly, after defining a relationship name, the data objects in the database can establish the relationship; if a relationship name is modified, the relationship will be automatically modified into a new relationship between the data objects in the database that own the relationship; if If you delete a relationship name already defined in the database, all data objects that have the relationship in the database will automatically cancel the relationship. For the default field names and relationship names defined in the database, operations such as modification and deletion cannot be performed. For user-defined field names and relationship names, users can perform operations such as adding, modifying, and deleting. Figure 5 shows the abstract architecture and access process of the data resource ontology library. Take the defined field name as an example,

图6是定义字段名的执行流程。步骤如下:Figure 6 is the execution flow for defining field names. Proceed as follows:

(1)客户端接口服务组件向服务器伺候服务组件发出定义字段名的请求。(1) The client interface service component sends a request for defining field names to the server service component.

(2)服务端调用领域数据库管理器的方法定义字段名。数据库管理器调用数据库存取对象检查是否已经定义字段名。数据库存取对象返回检查结果。若数据库库已经定义字段名,则向服务端返回整数0。服务端向客户端返回操作结果0,则过程结束。(2) The server calls the method of the domain database manager to define the field name. The database manager calls the database access object to check whether the field name has been defined. The database access object returns the result of the check. If the database library has defined the field name, return the integer 0 to the server. The server returns the operation result 0 to the client, and the process ends.

(3)数据库库管理器向数据库存取对象发出读取记录日志中旧检查点(记录日志有效段末尾位置)的请求。(3) The database library manager sends a request to the database access object to read the old checkpoint in the record log (the position at the end of the valid segment of the record log).

(4)数据库存取器返回旧检查点,赋值给startPos。数据库管理器向抽象数据访问对象发出读取当前数据库有效备份版本号的请求。数据库存取对象返回请求结果,数据库管理器将返回结果赋给变量version。复制数据库注册表,简称新注册表。(4) The database accessor returns the old checkpoint and assigns it to startPos. The database manager issues a request to the abstract data access object to read the current database valid backup version number. The database access object returns the request result, and the database manager assigns the returned result to the variable version. Copy the database registry, referred to as the new registry.

(5)若新注册表中记录的字段名为空,则直接将字段名添加到新注册表,字段名编码为1;否则:读取新注册表中当前最大字段名编码。在新注册表中若存在空闲字段名编码i小于当前最大字段名编,则将该编码分配给新增的字段名;否则当前最大字段名编码自动加1。将当前最大字段名编码分配给新增的字段名。重新设置当前最大字段名编码。(5) If the field name recorded in the new registry is empty, directly add the field name to the new registry, and the field name code is 1; otherwise: read the current maximum field name code in the new registry. In the new registration form, if there is a free field name code i that is smaller than the current maximum field name code, this code is assigned to the newly added field name; otherwise, the current maximum field name code is automatically increased by 1. Assign the current maximum field name encoding to the newly added field name. Reset the current maximum field name encoding.

(6)将注册表中当前最大字段名编码设置为1。创建一条日志记录对象。为日志对象各个变量标识赋值,包括它所装载的数据和所要执行的动作,然后将日志记录写入记录日志文件。(6) Set the current maximum field name code in the registry to 1. Create a logging object. Assign values to each variable identifier of the log object, including the data it loads and the action to be performed, and then write the log record to the log file.

(7)返回日志文件中新的有效检查点(简称新检查点)。将新检查点写入日志文件头部。将日志文件中新写入的日志记录内容写入数据文件(简称提交)。(7) Return the new valid checkpoint in the log file (referred to as the new checkpoint). Write a new checkpoint to the head of the log file. Write the newly written log record content in the log file to the data file (commit for short).

(8)返回提交结果,若提交发生错误或者失败,数据库管理器向服务端返回操作结果-1,结束过程。若提交成功则:检查日志文件的大小,若超过一定大小,将重置日志文件。数据库管理器向服务端返回操作结果1。服务端向客户端返回请求操作结果1,定义字段名成功。(8) Return the submission result. If there is an error or failure in the submission, the database manager will return the operation result -1 to the server and end the process. If the submission is successful: check the size of the log file, if it exceeds a certain size, the log file will be reset. The database manager returns the operation result 1 to the server. The server returns the request operation result 1 to the client, and the field name is defined successfully.

1.3管理数据对象信息1.3 Manage data object information

主要负责对数据对象信息进行管理,提供增加、删除、修改数据对象信息等功能。当销毁某个数据对象时,将该数据对象的全部信息(包括该数据对象所拥有的字段以及它与其它词的关系)从数据库中彻底删除。要往数据库中添加某个数据库有三种途径:一是直接添加某数据对象,不附带其任何信息;二是在为某数据对象新增某字段时,若数据库还未存在该数据对象,则数据库将自动添加该数据对象,然后再为其添加新增字段信息;三是在为两个数据对象建立某个关系时,若其中一个数据对象或两个数据对象不存在于数据库中,则系统自动先将数据库中不存在的数据对象添加到库中,然后再为它们建立相应的关系。在为数据对象新增或修改字段时,该字段名必须是数据库已经定义的字段名;同理,在为两个数据对象建立关系时,关系名也必须是数据库已经定义的关系名。与此同时,可通过本模块对数据对象的频率词频进行设置,可将数据库文件进行导入和导出。下面以插入数据对象字段为例简介相关操作流程,图7是插入数据对象字段的顺序图。顺序图对应步骤如下:It is mainly responsible for managing data object information, providing functions such as adding, deleting, and modifying data object information. When destroying a data object, all information of the data object (including the fields owned by the data object and its relationship with other words) is completely deleted from the database. There are three ways to add a database to the database: one is to directly add a data object without any information; the other is to add a field to a data object, if the data object does not exist in the database, the database The data object will be added automatically, and then new field information will be added for it; third, when establishing a relationship between two data objects, if one or both data objects do not exist in the database, the system will automatically Data objects that do not exist in the database are added to the library first, and then the corresponding relationships are established for them. When adding or modifying a field for a data object, the field name must be a field name already defined in the database; similarly, when establishing a relationship between two data objects, the relationship name must also be a relationship name already defined in the database. At the same time, the frequency and word frequency of data objects can be set through this module, and database files can be imported and exported. The following takes inserting a data object field as an example to briefly introduce the relevant operation process. FIG. 7 is a sequence diagram for inserting a data object field. The corresponding steps in the sequence diagram are as follows:

(1)用户服务接口向服务端发出给关键词添加字段,字段值为content。(1) The user service interface sends to the server to add a field to the keyword, and the field value is content.

(2)服务端调用词库管理器的方法为数据对象添加字段以及字段内容。(2) The server calls the method of the thesaurus manager to add fields and field contents to the data object.

(3)数据库管理器调用双编码器计算关键词的双编码。(3) The database manager invokes the double coder to calculate the double codes of keywords.

(4)双编码器返回关键词双编码对象key(简称索引键);若返回结果为空,则转向步骤5;否则转向步骤7。(4) The double encoder returns the keyword double-encoded object key (referred to as the index key); if the returned result is empty, go to step 5; otherwise, go to step 7.

(5)对关键词计算双编码失败,数据对象管理器向服务端返回操作结果-1。(5) The calculation of the double code for the keyword fails, and the data object manager returns the operation result -1 to the server.

(6)服务端向客户端返回操作结果-1,宣告请求操作失败,转到步骤40。(6) The server returns the operation result -1 to the client, declaring that the requested operation failed, and go to step 40.

(7)数据对象管理器向抽象数据访问对象发出获取注册表中字段名的编码请求。(7) The data object manager sends an encoded request to the abstract data access object to obtain the field name in the registry.

(8)数据库存取对象返回对应字段名的编码,若返回值不为空值(即字段名已经定义)。则转到步骤11。(8) The database access object returns the code of the corresponding field name, if the return value is not empty (that is, the field name has been defined). Then go to step 11.

(9)数据库管理器向服务端返回操作结果0,宣告数据库未定义字段名fieldName,不能为关键词插入该字段以及其内容(字段值)。(9) The database manager returns the operation result 0 to the server, declaring that the field name fieldName is not defined in the database, and the field and its content (field value) cannot be inserted for keywords.

(10)服务端向客户端返回操作结果0,宣告词库未定义字段名fieldName,请求操作失败。(10) The server returns the operation result 0 to the client, declaring that the thesaurus does not define the field name fieldName, and the requested operation fails.

(11)数据库管理器向数据库存取对象发出获取索引表中索引键key的索引值。(11) The database manager sends the index value of the index key key in the index table to the database access object.

(12)数据库存取对象返回索引键key所映射的索引值value,若value为空,说明数据库不存在关键词,转到步骤13;否则转到步骤16。(12) The database access object returns the index value mapped to the index key key. If the value is empty, it means that the database does not contain keywords, and go to step 13; otherwise, go to step 16.

(13)向数据库添加关键词,返回添加结果(整数);若添加失败,转到步骤6,否则转到步骤14。(13) Add keywords to the database and return the added result (integer); if the addition fails, go to step 6, otherwise go to step 14.

(14)再次向数据汇库存取对象发出获取索引表中索引键key的索引值。(14) Send the index value of the index key key in the index table to the data repository access object again.

(15)数据库存取对象返回索引键key的索引值value。(15) The database access object returns the index value value of the index key key.

(16)数据库管理器向数据库存取对象发出读取记录日志中旧检查点(记录日志有效段末尾位置)的请求。(16) The database manager sends a request to the database access object to read the old checkpoint in the record log (the position at the end of the valid segment of the record log).

(17)数据库存取对象返回旧检查点,数据库管理器将返回结果赋值给变量startPos。(17) The database access object returns the old checkpoint, and the database manager assigns the returned result to the variable startPos.

(18)数据库管理器向抽象数据访问对象发出读取当前数据库有效备份版本号的请求。(18) The database manager sends a request to the abstract data access object to read the effective backup version number of the current database.

(19)数据库存取对象返回读取结果,数据库管理器将返回结果赋值给变量version。(19) The database access object returns the read result, and the database manager assigns the returned result to the variable version.

(20)复制数据库注册表,称之为新注册表。(20) Copy the database registry and call it a new registry.

(21)数据库管理器向数据库存取对象发出读取关键词的字节数据信息。(21) The database manager sends the byte data information of the read keyword to the database access object.

(22)数据库存取对象返回关键词的字节数据信息和磁盘地址集的载体(称为数据车)。(22) The database access object returns the byte data information of the keyword and the carrier of the disk address set (called a data cart).

(23)数据库管理器向数据加工工厂发出对关键词的字节数据进行加工,转化成可视信息对象的请求。(23) The database manager sends a request to the data processing factory to process the byte data of the keyword and convert it into a visual information object.

(24)数据加工工厂向数据库管理器返回关键词内容载体,数据库管理器检查关键词内容载体;若已经存在所要添加的字段名以及其对应字段值,则转到步骤25,否则转到步骤27。(24) The data processing factory returns the keyword content carrier to the database manager, and the database manager checks the keyword content carrier; if the field name and its corresponding field value to be added already exist, go to step 25, otherwise go to step 27 .

(25)数据库管理器向服务端返回操作结果2,表示所要添加的内容已经存在。(25) The database manager returns the operation result 2 to the server, indicating that the content to be added already exists.

(26)服务端向客户端返回操作结果2,表示所要添加的内容已经存在。(26) The server returns the operation result 2 to the client, indicating that the content to be added already exists.

(27)将fieldName和content添加到关键词内容载体中。(27) Add fieldName and content to the keyword content carrier.

(28)数据库管理器向数据加工工厂发出对关键词的数据进行加工,转化成字节数据。(28) The database manager sends to the data processing factory to process the data of keywords and convert them into byte data.

(29)数据加工工厂向数据库管理器返回关键词内容的字节数据,数据库管理器将返回的字节数据装进数据车。(29) The data processing factory returns the byte data of the keyword content to the database manager, and the database manager loads the returned byte data into the data cart.

(30)根据数据车的数据和新注册表的信息重新分配所需磁盘块的地址,修改新注册表的信息。(30) According to the data of the data car and the information of the new registry, the addresses of the required disk blocks are redistributed, and the information of the new registry is modified.

(31)创建新的日志记录对象,并将数据车、新注册表和所要执行的动作装载到日志记录对象中。(31) Create a new log record object, and load the data cart, the new registry and the actions to be performed into the log record object.

(32)数据库管理器向数据库存取对象发出向日志记录文件中写入新建的日志记录对象的请求。(32) The database manager sends a request to the database access object to write the newly created logging object into the logging file.

(33)数据库存取对象向数据库管理器返回日志文件新的有效检查点(简称新检查点)。(33) The database access object returns the new effective checkpoint of the log file (new checkpoint for short) to the database manager.

(34)数据管理器向数据库存取对象发出将新检查点写入日志文件头部的请求,数据库存取对象响应请求。(34) The data manager sends a request to the database access object to write the new checkpoint into the head of the log file, and the database access object responds to the request.

(35)数据库管理器向数据库存取对象发出将日志文件中新写入的日志记录内容写入数据文件(简称提交)。(35) The database manager sends to the database access object to write the newly written log record content in the log file into the data file (submission for short).

(36)数据库存取对象返回提交结果,若提交成功,则转到步骤38,否则转到步骤37。(36) The database access object returns the submission result, if the submission is successful, go to step 38, otherwise go to step 37.

(37)若提交失败或以上各步骤抛出异常,则转向步骤6。(37) If the submission fails or the above steps throw an exception, go to step 6.

(38)数据库管理器向服务端返回操作结果1。(38) The database manager returns the operation result 1 to the server.

(39)服务端向客户端返回请求操作结果1,插入字段成功。(39) The server returns the request operation result 1 to the client, and the field insertion is successful.

(40)结束。(40) END.

1.4检索数据对象信息1.4 Retrieve data object information

检索功能如下:The retrieval function is as follows:

(1)检查数据对象存在性:检查数据库中是否存在某数据对象;(1) Check the existence of data objects: check whether a data object exists in the database;

(2)检索数据对象的数据包:检索数据对象的所有可视化的数据信息(字段、字段值、关系、关系词),并封装成数据包,用于网络或其它形式传输;(2) Retrieve data packets of data objects: retrieve all visualized data information (fields, field values, relations, relational words) of data objects, and encapsulate them into data packets for network or other forms of transmission;

(3)检索数据对象字段值:检索数据对象某字段的字段值(字段内容);(3) Retrieve the field value of the data object: retrieve the field value (field content) of a certain field of the data object;

(4)检索数据关系词:检索数据对象某关系的所有关系词;(4) Retrieve data relational words: retrieve all relational words of a certain relation of data objects;

(5)按字段名检索:分为按单字段检索和按双字段组合检索。按单字段检索是指检索出拥有某一字段的所有数据对象的数据包;按双字段组合检索是指检索出同时拥有某两个字段的所有数据对象的数据包;(5) Search by field name: It is divided into search by single field and search by combination of double fields. Retrieving by single field refers to retrieving data packets of all data objects with a certain field; retrieving by combination of double fields refers to retrieving data packets of all data objects having two fields at the same time;

(6)按关系名检索:检索出存在某种关系的所有数据对象的数据包;(6) Retrieve by relation name: retrieve the data packets of all data objects with a certain relation;

(7)向后匹配检索:检索出以某关键词为首的所有数据对象的数据包;(7) Backward matching retrieval: Retrieve the data packets of all data objects headed by a certain keyword;

(8)模糊匹配检索:检索出含某关键词的所有数据对象的数据包;(8) Fuzzy matching retrieval: Retrieve the data packets of all data objects containing a certain keyword;

(9)检索高频词:按频率从高到低检索出指定数量的数据对象或数据对象数据包;(9) Retrieve high-frequency words: retrieve a specified number of data objects or data object packets according to the frequency from high to low;

(10)检索低频词:检索出词频低于某频数的所有数据对象;(10) Retrieve low-frequency words: retrieve all data objects whose word frequency is lower than a certain frequency;

(11)检索频率词频:检索某一数据对象的频率(被检索的次数);(11) Retrieval frequency Term frequency: the frequency of retrieving a certain data object (number of times retrieved);

(12)检索数据库已定义的所有字段名;(12) Retrieve all field names defined in the database;

(13)检索数据库已定义的所有关系名。(13) Retrieve all relationship names defined in the database.

图8展示检索数据对象数据包的顺序图。顺序图简介:Figure 8 shows a sequence diagram for retrieving a data object packet. Introduction to the sequence diagram:

(1)客户端向服务端发出检索关键词的数据对象的数据包的请求。(1) The client sends a request to the server to retrieve the data packet of the data object of the keyword.

(2)服务端调用数据库检索器的方法检索数据对象数据包。(2) The server calls the method of the database retriever to retrieve the data object packet.

(3)数据库管理器调用双编码器计算关键词的双编码。(3) The database manager invokes the double coder to calculate the double codes of keywords.

(4)双编码器返回关键词双编码对象key(简称索引键)。(4) The dual encoder returns the keyword double-encoded object key (referred to as the index key).

(5)数据库检索器向数据库存取对象发出获取索引表中key对应的索引值的请求。(5) The database retriever sends a request to the database access object to obtain the index value corresponding to the key in the index table.

(6)数据库存取对象向数据库检索器返回key所映射的索引值value,若value为空值,说明数据库不存在关键词,转到步骤7;否则转到步骤9。(6) The database access object returns the index value mapped to the key to the database retriever. If the value is empty, it means that the keyword does not exist in the database, and go to step 7; otherwise, go to step 9.

(7)数据库检索器向服务端返回检索结果null。(7) The database retriever returns the retrieval result null to the server.

(8)服务端向客户端返回检索结果null,转到步骤14。(8) The server returns the retrieval result null to the client, go to step 14.

(9)将关键词的频率进行加1,然后数据库检索器向数据库存取对象发出更新数据库索引表中关键词词频的请求,数据库存取对象自动响应请求。(9) Add 1 to the frequency of keywords, and then the database retriever sends a request to the database access object to update the frequency of keywords in the database index table, and the database access object automatically responds to the request.

(10)数据库检索器向数据库存取对象发出更新数据文件中关键词频率的请求,数据库存取对象自动响应请求。(10) The database retriever sends a request to the database access object to update the frequency of keywords in the data file, and the database access object automatically responds to the request.

(11)数据库检索器根据关键词在磁盘的首地址调用自身的方法检索关键词的数据包。(11) The database retriever invokes its own method to retrieve the data packet of the keyword according to the first address of the keyword on the disk.

(12)数据库检索器向服务端返回关键词的数据包。(12) The database retriever returns a packet of keywords to the server.

(13)服务端向客户端返回关键词的数据包。(13) The server returns the keyword packet to the client.

(14)结束。(14) END.

2、网络型领域数据库模块:2. Network domain database module:

2.1基本原理2.1 Basic Principles

每个网络型领域数据汇存储分为索引部分和属性部分,索引部分存储到name.dct文件中,属性部分存储到attr.dct文件中。网络型数据的名称索引部分是在插入数据对象的时候动态生成的无限制B树。Each network-type domain data collection is divided into an index part and an attribute part. The index part is stored in the name.dct file, and the attribute part is stored in the attr.dct file. The name index part of network data is an unlimited B-tree dynamically generated when inserting data objects.

在索引文件中使用了指针操作,所以定义N个指针,它们分别对应于GB2312-80编码中的N个常用汉字。指针指向以该字为首字的名字所在的B树的树根。以同一个字为首字的所有名字都存放在同一个B树中。Pointer operations are used in the index file, so N pointers are defined, which correspond to N commonly used Chinese characters in the GB2312-80 encoding. The pointer points to the root of the B-tree where the names starting with this word are located. All names starting with the same character are stored in the same B-tree.

在检索时,可以使用名称作为关键字对网络型数据进行检索,关键字是通过Hash函数,利用名称的GB2312-80编码计算得到的。检索时先找到B树,再用B树检索算法找到姓名。索引文件中的名称与属性文件中的属性是一一对应的,即在索引文件中找到了数据对象的摘要信息则在属性文件中必然存在该网络型数据对象的属性。在索引文件中以摘要信息作为关键字(key)进行查找,以摘要信息对应的网络型数据属性在属性文件中的位置为查找结果。当在索引文件中查找到摘要信息时,可以根据摘要信息前对应的属性索引直接从属性文件中直接读取相关属性。因此对属性文件的操作是非常快的,时间主要耗费在索引文件的查找上。在索引文件中,摘要信息的存储和查找使用了Hash算法和B树算法,该算法基于硬盘的检索,同时寻址操作是按照指针直接寻找的,因此算法效率较高。When retrieving, you can use the name as a keyword to retrieve network-type data. The keyword is calculated by using the GB2312-80 code of the name through the Hash function. When searching, first find the B-tree, and then use the B-tree search algorithm to find the name. There is a one-to-one correspondence between the names in the index file and the attributes in the attribute file, that is, if the abstract information of the data object is found in the index file, the attribute of the network data object must exist in the attribute file. In the index file, the summary information is used as a keyword (key) to search, and the position of the network-type data attribute corresponding to the summary information in the property file is used as the search result. When the summary information is found in the index file, relevant properties can be directly read from the property file according to the corresponding property index before the summary information. Therefore, the operation on the attribute file is very fast, and the time is mainly spent on searching the index file. In the index file, Hash algorithm and B-tree algorithm are used to store and search summary information. This algorithm is based on hard disk retrieval. At the same time, the addressing operation is directly searched according to the pointer, so the algorithm is more efficient.

2.2索引存储结构2.2 Index storage structure

在索引文件中,网络型数据的摘要信息的存储和查找使用了Hash算法和B树算法。在这里把摘要信息的第一个字符称为“首字”,去除首字的其余部分称为“后缀”。首先在索引文件中建立一个有N个表项的区位表,每个表项由单个字符与其GB2312-80值构成。每个表项中的字符只要通过Hash函数计算出其关键值即可得到其在区位表中的地址。表项中还存放指向B树树根的指针。该B树用来存储摘要信息的后缀。In the index file, Hash algorithm and B-tree algorithm are used to store and search the summary information of network data. Here, the first character of the summary information is called "initial character", and the rest after removing the first character is called "suffix". First, create a location table with N entries in the index file, and each entry consists of a single character and its GB2312-80 value. The characters in each table entry only need to calculate its key value through the Hash function to obtain its address in the location table. The entry also stores a pointer to the root of the B-tree. The B-tree is used to store the suffix of the summary information.

在存储网络型数据的名称时,先根据摘要信息首字在区位表中找到该字对应的B树,然后将其后缀插入到B树中。摘要信息的查找过程与存储类似,先根据摘要信息的首字在区位表中找到与其对应的B树,在B树中查找摘要信息的后缀。When storing the name of network-type data, first find the B-tree corresponding to the word in the location table according to the first word of the summary information, and then insert its suffix into the B-tree. The search process of the summary information is similar to the storage. Firstly, according to the first word of the summary information, the corresponding B-tree is found in the location table, and the suffix of the summary information is searched in the B-tree.

B树的结构:B树用来存储摘要信息的后缀,后缀在B树中作为关键字进行存储。为了减少磁盘读取次数,根据实际需要使用了n阶B树。B树中每个节点都包含下列信息:B-tree structure: The B-tree is used to store the suffix of the summary information, and the suffix is stored as a key in the B-tree. In order to reduce the number of disk reads, n-order B-trees are used according to actual needs. Each node in the B-tree contains the following information:

(n,C0,A1,K1,C1,A2,K2,C2,…,An,Kn,Cn,Father)(n,C 0 ,A 1 ,K 1 ,C 1 ,A 2 ,K 2 ,C 2 ,…,A n ,K n ,C n ,Father)

其中n为节点中关键字的个数;Ki(i=1,…..,n)为关键字(摘要信息的后缀),且Ki<Ki+1(i=1,…..,n);Ci(i=1,…..,n)为指向子树根节点的指针,且指针Ci-1所指子树中的关键字均小于Ki(i=1,…..,n),Cn所指子树中所有节点的关键字均大于Kn;Ai(i=1,…..,n)为属性文件的指针,该指针指向以节点所在B树对应的字符为首字、以Ki为后缀的摘要信息属性在属性文件中的位置;Father为指向双亲节点的指针。Where n is the number of keywords in the node; K i (i=1,...,n) is the keyword (suffix of summary information), and K i <K i+1 (i=1,..... ,n); C i (i=1,…..,n) is a pointer to the root node of the subtree, and the keywords in the subtree pointed to by the pointer C i-1 are all smaller than K i (i=1,… .., n ), the keywords of all nodes in the subtree pointed to by C n are greater than K n ; The corresponding character is the position in the attribute file of the abstract information attribute with the first word and the suffix K i ; Father is the pointer to the parent node.

当要查找一个数据对象时,先根据其给定摘要信息的首字通过Hash计算得到该字在区位表中的表项地址,然后读取表项的内容找到该字对应的B树树根地址,然后在B树中查找摘要信息的后缀。查找一次使用的时间为一次Hash计算的时间加上B树查找的时间,因此该检索算法效率比较高。使用内存映射技术,不需要将索引文件读入内存,只需将使用到的节点读入内存即可,大大减少了磁盘读取时间,提高内存利用率。When looking for a data object, first calculate the entry address of the word in the location table through Hash according to the first word of the given summary information, and then read the content of the entry to find the B-tree root address corresponding to the word , and then look up the suffix of the summary information in the B-tree. The time used for a search is the time of Hash calculation plus the time of B-tree search, so the retrieval algorithm is more efficient. Using the memory mapping technology, it is not necessary to read the index file into the memory, but only to read the used nodes into the memory, which greatly reduces the disk reading time and improves the memory utilization.

2.3网络型数据的属性存储结构2.3 Attribute storage structure of network data

网络型数据对象的属性中除了名称以外其它都保存在属性文件中,属性分为两部分,一部分是数据库设计时即存在的属性,称为基本属性,保存在基本属性文件中;另一部分是用户自定义属性,称为扩展属性,保存在扩展属性文件中。网络型数据对象基本属性是以属性块的形式进行存放,一个数据对象的属性存储在一个属性块中,数据对象的基本属性块根据数据对象的插入顺序依次存放在基本属性文件中。数据对象扩展属性以链表的形式保存。属性文件的存储结构如图9所示。基本属性块由数据对象的基本属性块、指针a、指针b等构成。在基本属性块中指针a指向同名属性块,指针b指向扩展属性文件中数据对象的扩展属性首地址。Except for the name, the properties of the network data object are all stored in the property file. The property is divided into two parts, one part is the property that exists when the database is designed, called the basic property, which is stored in the basic property file; the other part is the user Custom properties, called extended properties, are stored in extended properties files. The basic attributes of network data objects are stored in the form of attribute blocks. The attributes of a data object are stored in an attribute block, and the basic attribute blocks of data objects are stored in the basic attribute file in sequence according to the insertion order of the data objects. The data object extended attributes are saved in the form of linked list. The storage structure of the property file is shown in Figure 9. The basic attribute block is composed of the basic attribute block of the data object, pointer a, pointer b, etc. In the basic attribute block, the pointer a points to the attribute block with the same name, and the pointer b points to the first address of the extended attribute of the data object in the extended attribute file.

查找网络型数据对象属性时,先用数据对象的名称在索引文件中查找,在找到数据对象的名称的同时,可以在存储名称的节点中找到数据对象属性块在属性文件中的位置指针,根据该指针直接到属性文件的相应地址读取基本属性,然后根据属性块中的相关指针再到扩展属性文件中读取扩展属性。因此在属性文件中的检索效率非常高。When looking for the attribute of a network data object, first use the name of the data object to search in the index file, and at the same time find the name of the data object, you can find the location pointer of the attribute block of the data object in the attribute file in the node storing the name, according to The pointer directly goes to the corresponding address of the attribute file to read the basic attribute, and then reads the extended attribute in the extended attribute file according to the relevant pointer in the attribute block. Therefore, the retrieval efficiency in the property file is very high.

3、层次型领域数据库模块:3. Hierarchical domain database module:

3.1基本原理3.1 Basic Principles

根据层次型数据之间的关系特点可以知道,层次型数据之间主要有隶属、相邻、交叉、同指四种关系,其中隶属关系是主要关系,一个层次型数据对象可能既有上级,也有可能有下级,这里规定每个层次型数据对象直接对上一级负责,或领导下一级,如此设计是出于层次型数据中存在很多基本主属性完全相等的考虑。According to the characteristics of the relationship between hierarchical data, it can be known that there are mainly four relationships among hierarchical data: membership, adjacency, crossover, and the same reference. Among them, the subordination relationship is the main relationship. A hierarchical data object may have both superior and There may be lower levels. It is stipulated here that each hierarchical data object is directly responsible to the upper level, or leads the lower level. This design is based on the consideration that there are many basic master attributes in hierarchical data that are completely equal.

对于每个层次型数据对象,可以将其作为数据库的关键字来唯一标识该数据对象。关键字是通过基于双编码的Hash函数计算得到数偶对,作为该数据对象的逻辑地址。即每个数据对象的关键字与其在磁盘上的存储地址是一一对应的,要查找某一个数据对象,只要通过Hash函数计算出其关键值就相当于得到了该数据对象的逻辑地址,然后将数据存取任务交给对象管理器和日志管理器完成。这种方法避免了搜索匹配,时间耗费主要是在Hash值的计算上,而且不用将全部数据块调入到内存中,只要将所需要的数据对象读入内存即可,无论在算法的执行效率,还是在内存空间的利用率上,都是可行的。同时由于寻址是按照指针直接查找的,数据对象的检索效率极高。For each hierarchical data object, it can be used as a database key to uniquely identify the data object. The key is a pair of numbers calculated by a double-encoded Hash function as the logical address of the data object. That is, the key of each data object corresponds to its storage address on the disk. To find a certain data object, as long as its key value is calculated through the Hash function, it is equivalent to obtaining the logical address of the data object, and then The data access task is handed over to the object manager and the log manager to complete. This method avoids searching and matching, and the time consumption is mainly in the calculation of the Hash value, and it is not necessary to load all the data blocks into the memory, as long as the required data objects are read into the memory, regardless of the execution efficiency of the algorithm , or in the utilization of memory space, it is feasible. At the same time, because the addressing is directly searched according to the pointer, the retrieval efficiency of the data object is extremely high.

3.2索引文件的结构3.2 Structure of the index file

索引文件结构体主要包含以下各域,其作用说明如下:The index file structure mainly includes the following fields, and their functions are described as follows:

·Key即为通过Hash计算得到的关键值,这个关键值对于每个数据对象来说都是唯一的;Key is the key value calculated by Hash, which is unique for each data object;

·Father/Son域表示隶属关系,Father表示上一级层次型数据对象的Key,Son表示下一级层次型数据对象的Key;·Father/Son field indicates the affiliation relationship, Father indicates the Key of the upper-level hierarchical data object, and Son indicates the Key of the lower-level hierarchical data object;

·Neighbour域表示相邻关系,neighbour表示关系结构上相邻的层次型数据对象的Key,该域可能不止一个Key;The Neighbor field represents the adjacent relationship, and the neighbor represents the Key of the adjacent hierarchical data object in the relational structure, and this field may have more than one Key;

·Cross域表示交叉关系,cross表示关系结构上有交叉关系的层次型数据对象的Key,同样该域也可能不止一个Key;The Cross field indicates a cross relationship, and the cross indicates the Key of a hierarchical data object with a cross relationship in the relationship structure, and the field may also have more than one Key;

·Co-ref域表示同指关系,co-ref表示在语义理解上都是指同一个域内的层次型数据对象,该域也可能不止一个Key;The Co-ref domain indicates the same-referring relationship, and the co-ref indicates that they all refer to hierarchical data objects in the same domain in terms of semantic understanding, and the domain may also have more than one Key;

·0/1域表示该层次型数据对象可能会有重名现象,即名字相同,但却是两个完全不同的意义。如果设置为0表示没有重名,为1则表示有重名,后面紧跟着的Fathers域则记载着所有包含该数据对象的上一级数据对象;·The 0/1 field indicates that the hierarchical data object may have the same name, that is, the name is the same, but it has two completely different meanings. If it is set to 0, it means that there is no duplicate name, and if it is 1, it means that there is a duplicate name, and the Fathers field immediately following it records all the upper-level data objects that contain this data object;

·Fathers域记载着有重名现象时所有包含该层次型数据对象的上一级数据对象,所以该域也可能不止一个Key。如果没有重名则为NULL。·Fathers field records all the upper-level data objects that contain this hierarchical data object when there is a phenomenon of duplicate names, so this field may also have more than one Key. NULL if there is no duplicate name.

以上每个域都占四个字节。Each of the above fields occupies four bytes.

3.3数据文件的结构3.3 Structure of data files

数据文件是存放层次型数据对象本身的文件,它与索引文件的结合才能实现对数据对象的存取操作。数据文件是数据逻辑上的数据对象线性表,线性表中的表项——数据条目通过指针建立与结构体中各域之间的联系,如图10所示。其中Wi、Wj分别是一个层次型数据对象的条目,Father,Son,Neighbour,Cross,Co-ref,Fathers域的指针分别指向该数据对象条目的上一级数据对象、下一级数据对象、相邻数据对象、交叉数据对象、同指数据对象,及其当有基本属性完全相等时,该数据对象的所有上一级数据对象。通过上一级数据对象可以区别出这两个数据对象。The data file is the file that stores the hierarchical data object itself, and the combination of it and the index file can realize the access operation of the data object. The data file is a linear table of data objects in the data logic, and the entries in the linear table—the data entries establish the connection with the fields in the structure through pointers, as shown in Figure 10. Among them, W i and W j are entries of a hierarchical data object, and the pointers of Father, Son, Neighbor, Cross, Co-ref, and Fathers fields point to the upper-level data object and the lower-level data object of the data object entry respectively. , adjacent data objects, intersecting data objects, same-referring data objects, and all the upper-level data objects of this data object when the basic attributes are completely equal. The two data objects can be distinguished by the superordinate data object.

当需要查找某个数据对象时,只要通过Hash计算得到该数据对象的地址,并将其调入到内存中就可以很轻易地构建该数据对象与其他相关数据对象的组织结构。而且无需检索匹配等过程,整个检索的时间主要耗费在Hash计算上,算法时间效率极高,Hash的装填因子在0.8以上。When a data object needs to be searched, as long as the address of the data object is obtained through Hash calculation and loaded into the memory, the organizational structure of the data object and other related data objects can be easily constructed. Moreover, there is no need for search and matching processes, and the entire search time is mainly spent on Hash calculations. The algorithm has extremely high time efficiency, and the filling factor of Hash is above 0.8.

图11、图12显示了层次型领域数据库的相关业务算法过程。Figure 11 and Figure 12 show the related business algorithm process of the hierarchical domain database.

4、领域数据演化模块:4. Domain data evolution module:

4.1基于LDA的信息抽取4.1 Information extraction based on LDA

领域数据演化的首要步骤是对大量有价值的文本信息进行分析处理。本系统采用基于LDA(Latent Dirichlet Allocation)概率生成模型的文本聚类挖掘技术,它通过将文本集中相似的文本自动聚集成不同的类别,帮助发现相关领域数据。文本用向量空间模型来表示,文本表示矩阵通常具有很高的维数,在聚类过程中往往会因“维灾”而导致相似性度量失去意义。通过LDA主题模型具有很好的文本表示能力,能够挖掘文本的潜在语义信息,得到文档在主题空间的表示,降低文档表示的维度。通过对文本的建模,可以对文本进行特征选择、主题分类、判断相似度等。LDA模型采用了词袋的方法,该方法将每一篇文本数据资源视为一个词频向量,从而将文本信息转化为易于建模的数字信息。The first step in domain data evolution is to analyze and process a large amount of valuable text information. This system adopts the text clustering mining technology based on the LDA (Latent Dirichlet Allocation) probability generation model, which helps to find relevant field data by automatically aggregating similar texts in the text set into different categories. The text is represented by the vector space model, and the text representation matrix usually has a high dimensionality, and the similarity measurement loses its meaning due to the "curse of dimensionality" in the clustering process. The LDA topic model has a good text representation ability, can mine the potential semantic information of the text, obtain the representation of the document in the topic space, and reduce the dimension of the document representation. By modeling the text, we can perform feature selection, topic classification, and similarity judgment on the text. The LDA model adopts the bag-of-words method, which treats each text data resource as a word frequency vector, thereby converting text information into digital information that is easy to model.

LDA模型的三层贝叶斯模型表示如图13所示。Φk表示主题K中的词项概率分布,θm表示第m篇文档的主题概率分布,Φk、θm又作为多项式分布的参数分别用于生成主题和单词。K代表主题数目,M代表文档数目,Nm表示第m篇文档的文档长度,ωm,n和Zm,n分别表示第m篇文档中第n个单词及其主题。α和β是Dirichlet分布的参数,通常是固定值且对称分布,因此用标量表示。Φk、θm均服从Dirichlet分布,该分布函数如下式所示:The three-layer Bayesian model representation of the LDA model is shown in Figure 13. Φ k represents the term probability distribution in topic K, θ m represents the topic probability distribution of the mth document, and Φ k and θ m are used as parameters of multinomial distribution to generate topics and words respectively. K represents the number of topics, M represents the number of documents, N m represents the document length of the m-th document, ω m,n and Z m,n represent the n-th word and its topic in the m-th document, respectively. α and β are the parameters of the Dirichlet distribution, which are usually fixed values and symmetrically distributed, so they are represented by scalars. Both Φ k and θ m obey the Dirichlet distribution, and the distribution function is shown in the following formula:

Dir ( &mu; | &alpha; ) = &Gamma; ( &alpha; 0 ) &Gamma; ( &alpha; 1 ) . . . &Gamma; ( &alpha; k ) &Pi; k = 1 K &mu; k &alpha; k - 1 (公式一) Dir ( &mu; | &alpha; ) = &Gamma; ( &alpha; 0 ) &Gamma; ( &alpha; 1 ) . . . &Gamma; ( &alpha; k ) &Pi; k = 1 K &mu; k &alpha; k - 1 (Formula 1)

其中,0≤μk≤1,Γ是伽马函数。LDA的生成过程如下所示。Among them, 0≤μ k ≤1, Γ is the gamma function. The generation process of LDA is as follows.

(a)对于主题采样 (a) For topic sampling

(b)对于语料中第m个文档,m∈[1,M];(b) For the mth document in the corpus, m∈[1,M];

(c)采样主题概率分布θm~Dir(α);(c) Sampling topic probability distribution θ m ~Dir(α);

(d)采用文档长度Nm~Poiss(ξ);(d) Use document length N m ~Poiss(ξ);

(e)对于文档m中的第n个单词,n∈[1,Nm];(e) For the nth word in document m, n∈[1,N m ];

(f)选择隐含主题zm,n~Mult(θm);(f) Select the hidden topic z m,n ~Mult(θ m );

(g)生成单词 (g) generate words

LDA的参数估计,首先计算单词序列下主题序列的条件概率,公式如下:The parameter estimation of LDA first calculates the conditional probability of the topic sequence under the word sequence, the formula is as follows:

p ( z | w ) = p ( w , z ) &Sigma; z p ( w , z ) (公式二) p ( z | w ) = p ( w , z ) &Sigma; z p ( w , z ) (Formula 2)

然后对主题序列进行Gibbs采样,采样公式如下:Then perform Gibbs sampling on the topic sequence, and the sampling formula is as follows:

p ( z i = k | z . . . i , w ) &Proportional; n k , . . . , i ( t ) + &beta; t [ &Sigma; &upsi; = 1 V n k ( &upsi; ) + &beta; &upsi; ] - 1 &CenterDot; n m , . . . , i ( k ) + &alpha; k [ &Sigma; z = 1 K n m ( z ) + &alpha; z ] - 1 (公式三) p ( z i = k | z . . . i , w ) &Proportional; no k , . . . , i ( t ) + &beta; t [ &Sigma; &upsi; = 1 V no k ( &upsi; ) + &beta; &upsi; ] - 1 &Center Dot; no m , . . . , i ( k ) + &alpha; k [ &Sigma; z = 1 K no m ( z ) + &alpha; z ] - 1 (Formula 3)

获得每个单词ω的主题z的标号,最终的参数计算公式表示如下:Obtain the label of the topic z of each word ω, and the final parameter calculation formula is expressed as follows:

(公式四) (Formula 4)

&theta;&theta; mm ,, kk == nno mm (( kk )) ++ &alpha;&alpha; kk &Sigma;&Sigma; zz == 11 KK nno mm (( zz )) ++ &alpha;&alpha; kk

已经训练好的模型M,任给新文档其中每个单词的隐含主题采样公式如下:The model M that has been trained is given to the new document The hidden topic sampling formula of each word is as follows:

p ( z ~ t = k | &omega; ~ i = t , z ~ &RightArrow; i , &omega; ~ &RightArrow; i ; M ) = n k ( t ) + n k , &RightArrow; i ( t ) + &beta; t &Sigma; &upsi; = 1 V n k ( &upsi; ) + n ~ k , &RightArrow; i ( &upsi; ) + &beta; &upsi; &CenterDot; n m ~ , &RightArrow; i ( k ) + &alpha; k [ &Sigma; z = 1 K n m ~ ( z ) + &alpha; z ] - 1 (公式五) p ( z ~ t = k | &omega; ~ i = t , z ~ &Right Arrow; i , &omega; ~ &Right Arrow; i ; m ) = no k ( t ) + no k , &Right Arrow; i ( t ) + &beta; t &Sigma; &upsi; = 1 V no k ( &upsi; ) + no ~ k , &Right Arrow; i ( &upsi; ) + &beta; &upsi; &Center Dot; no m ~ , &Right Arrow; i ( k ) + &alpha; k [ &Sigma; z = 1 K no m ~ ( z ) + &alpha; z ] - 1 (Formula 5)

其中,代表新文档对应的主题向量。in, represents a new document Corresponding subject vectors.

通过上述的Gibbs采样方法,得到每个单词的主题标号,使用公式六,计算出该文档在各个主题分量上的值后,一项此项空间的文档就获得了在主题空间中的表示。Through the above-mentioned Gibbs sampling method, the topic label of each word is obtained, and the value of each topic component of the document is calculated using formula 6, and the document of an item space is represented in the topic space.

&theta; m ~ , k = n m ~ ( k ) + &alpha; k &Sigma; z = 1 K n m ~ ( z ) + &alpha; z (公式六) &theta; m ~ , k = no m ~ ( k ) + &alpha; k &Sigma; z = 1 K no m ~ ( z ) + &alpha; z (Formula 6)

经过以上步骤之后,可以进行聚类过程。利用LDA选出一定比例的特征之后,选用K-means算法对文本进行聚类。文本聚类流程如下:After the above steps, the clustering process can be performed. After using LDA to select a certain proportion of features, the K-means algorithm is used to cluster the text. The text clustering process is as follows:

(1)对原始文本进行预处理,包括分词、去除停用词;(1) Preprocessing the original text, including word segmentation and removing stop words;

(2)用LDA模型进行特征选择;(2) Use the LDA model for feature selection;

(3)对选出的特征,统计每个特征在每篇文本中的权重,特征在文本中的权重W(d,w)的计算公式如下:(3) For the selected features, count the weight of each feature in each text, and the calculation formula of the weight W(d,w) of the feature in the text is as follows:

W ( d , w ) = log ( tf ( d , w ) + 1 ) &times; log ( ( M + 1 ) / ( df ( w ) + 0.5 ) ) &Sigma; log ( tf ( d , w &prime; ) + 1 ) &times; log ( ( M + 1 ) / ( df ( w &prime; ) + 0.5 ) ) (公式七) W ( d , w ) = log ( tf ( d , w ) + 1 ) &times; log ( ( m + 1 ) / ( df ( w ) + 0.5 ) ) &Sigma; log ( tf ( d , w &prime; ) + 1 ) &times; log ( ( m + 1 ) / ( df ( w &prime; ) + 0.5 ) ) (Formula 7)

其中,M为总体文本数目,tf(d,w)为词元w在文本d中出现的次数,df(w)为词元w的文本频数。得到文本的表示之后,便可生成一个向量空间模型。Among them, M is the total number of texts, tf(d,w) is the number of occurrences of word element w in text d, and df(w) is the text frequency of word element w. Once the representation of the text is obtained, a vector space model can be generated.

(4)随机选取初始点,利用K-means算法得到最终聚类结果。其中K-means聚类算法需要对文本之间的距离进行度量,采用余弦相似度来计算。对于两个文本d和d′,它们的相似度计算公式如下:(4) Randomly select the initial point, and use the K-means algorithm to obtain the final clustering result. Among them, the K-means clustering algorithm needs to measure the distance between texts, and uses cosine similarity to calculate. For two texts d and d′, their similarity calculation formula is as follows:

sim ( d , d &prime; ) = &Sigma; w &Element; d , d &prime; W ( d , w ) &times; W ( d &prime; , w ) d &times; d &prime; (公式八) sim ( d , d &prime; ) = &Sigma; w &Element; d , d &prime; W ( d , w ) &times; W ( d &prime; , w ) d &times; d &prime; (Formula Eight)

4.2领域数据库版本控制4.2 Domain database version control

领域数据库版本控制模块引用版本控制的理论与方法来实现数据库的演化过程管理与控制。数据的每一次演化状态可视为一个版本,本模块提供了版本生成、版本恢复、版本删除等功能。具体而言,由于领域数据库的修改与演化等因素,领域数据库会随着时间的前进而不断演化,本模块的功能记录这一系列的演化过程。它一方面记录了具体领域数据的演化历程,让用户可以随时查看,并能恢复某个领域数据到过去的某个版本,另一方面用户也可以标记某个状态的数据库为一个版本,以便在未来的某个时刻使整个数据库恢复到此版本。在需要的时候,用户可以删除某个非关键版本。其结构如图14所示。The domain database version control module refers to the theory and method of version control to realize the management and control of the evolution process of the database. Each evolution state of data can be regarded as a version. This module provides functions such as version generation, version restoration, and version deletion. Specifically, due to factors such as modification and evolution of the domain database, the domain database will continue to evolve with time, and the functions of this module record this series of evolution processes. On the one hand, it records the evolution process of data in a specific field, allowing users to view it at any time and restore data in a certain field to a certain version in the past. On the other hand, users can also mark a database in a certain state as a version, so that it can Sometime in the future reverts the entire database to this version. Users can delete a non-critical version when needed. Its structure is shown in Figure 14.

4.3新数据对象发现4.3 New Data Object Discovery

新数据对象的发现需要用户提供大量的基础文本数据,系统通过上述基于LDA的分析模型来分析这些数据,构建一个巨大的数据对象分析库;当新的数据输入到数据对象分析库后,将触发读取相关联领域数据的版本信息,然后通过分析数据对象与关联领域数据版本之间的关系,计算出当前数据对象为新的领域数据的可能性,并自动的或者用户手动的对新的领域数据进行修改,并加入领域数据库中。数据对象分析库的核心结构是一个后缀树。该部分还有另外一个重要部分是目录监视模块,用于系统自动感知新数据的到达,进而自动进行演化处理,其处理方法如下:The discovery of new data objects requires the user to provide a large amount of basic text data. The system analyzes these data through the above-mentioned LDA-based analysis model to build a huge data object analysis library; when new data is input into the data object analysis library, it will trigger Read the version information of the associated domain data, and then calculate the possibility that the current data object is new domain data by analyzing the relationship between the data object and the associated domain data version, and automatically or manually update the new domain The data is modified and added to the domain database. The core structure of the data object analysis library is a suffix tree. Another important part of this part is the directory monitoring module, which is used for the system to automatically sense the arrival of new data, and then automatically perform evolution processing. The processing method is as follows:

(1)系统启动,检查配置文件以获取领域数据演化数据源目录。(1) The system is started, and the configuration file is checked to obtain the domain data evolution data source directory.

(2)启动目录监视器(AutoDectector)监听数据源目录的状态变化。当有新的文件增加的时候,目录监视器会探测到该变化,然后检查其文件格式,如果为文本文件、PDF文件、HTML文件和Word文档中的一种,则对其进行读取分析。(2) Start the directory monitor (AutoDector) to monitor the status changes of the data source directory. When a new file is added, the directory monitor will detect the change, then check its file format, and if it is one of text file, PDF file, HTML file and Word document, it will be read and analyzed.

(3)根据输入文件的类型不同,实现了不同的文件分析器:TxtAnalyzer、PdfAnalyzer、HtmlAnalyzer、WordAnalyzer。其中PdfAnalyzer和WordAnalyzer使用了开源工具Apache POI实现。通过文件分析器之后,得到文本或者文本流(当数据量巨大时候返回文本流)。(3) According to different types of input files, different file analyzers are implemented: TxtAnalyzer, PdfAnalyzer, HtmlAnalyzer, WordAnalyzer. Among them, PdfAnalyzer and WordAnalyzer are implemented using the open source tool Apache POI. After passing through the file analyzer, the text or text stream is obtained (when the amount of data is huge, the text stream is returned).

(4)对于获得的文本或文本流输入到分析库中,即插入到当前最新的后缀树中。在后缀树发生变化时候会触发对相关变化的词条的检查:词条的频率达到阈值t之后,则从数据资源本体库中查询与该词条相关的领域数据,并根据查询结果来确定是否构建新的基本领域数据。(4) Input the obtained text or text stream into the analysis library, that is, insert it into the latest suffix tree. When the suffix tree changes, it will trigger the check of the relevant changed entries: after the frequency of the entry reaches the threshold t, query the field data related to the entry from the data resource ontology database, and determine whether to Build new base domain data.

(5)分析完毕文件之后,将该文件重命名为以“.analyzed”结尾的文件,以区分于未分析的文件。此后,检查数据源目录中的元数据文件,如果当前已分析文件数量容量或者大小达到了上限则删除最早分析的一些文件。(5) After analyzing the file, rename the file to a file ending with ".analyzed" to distinguish it from unanalyzed files. After that, check the metadata files in the data source directory, and delete some of the earliest analyzed files if the number or size of the currently analyzed files reaches the upper limit.

本发明说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。The contents not described in detail in the description of the present invention belong to the prior art known to those skilled in the art.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims (6)

1. An extensible database system for multi-type field data coordination management, comprising: data resource ontology library module, network type domain database module, hierarchical type domain database module and domain data evolution module, wherein:
the data resource ontology library module is used for defining a top-level data resource model, realizing logic view design and storage structure design of basic data units, providing data storage and access basic support capacity and establishing a database containing a large number of business data objects, relations and concepts; the data resource ontology library module provides top-level data abstraction rules and data access rules for the network type domain database module and the hierarchical type domain database module;
the network type domain database module is used for constructing a database based on network type data attributes on the basis of a data resource ontology base according to the attributes of the data objects, a relational network and other special attributes, realizing the data structure design, storage design and index design of the network type data objects, forming a relational network containing a large number of network type data objects and realizing the provision of a network type database access interface to the outside; the network type domain database module is used for realizing inheritance of a data resource ontology base and instantiation on network type domain data; providing a query interface based on network type field data for a user, other modules and an external system; the hierarchical domain database module is used for constructing a database specially representing data objects and related information of hierarchical membership thereof according to the characteristics of membership, adjacency, intersection and peer relationship among hierarchical data objects, and realizing an access interface for providing the data objects and the hierarchical membership database thereof to the outside; the hierarchical domain database module is used for further evolving the network domain database module, storing and organizing the data only having the hierarchical structure domain in a tree form, realizing hierarchical semantics, and providing a query interface based on the hierarchical domain data for a user, other modules and an external system;
the domain data evolution module tracks and controls the change of the domain data in the data resource body base, the network type domain database and the hierarchical domain database in the using process of the user, establishes data version history, analyzes an original data set provided by the user in combination with the existing data to obtain new domain data and inputs the new domain data into the domain database through screening, provides record-based data version control for the data resource body base module, the network type domain database module and the hierarchical domain database module, automatically discovers the new domain data from the original data input by the user, and uses an interface thereof to carry out corresponding evolution management.
2. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the data resource ontology library module comprises a data persistence module, a bottom database establishing module, a relation defining module, a data index module and an interface module;
the data persistence module defines an interface-oriented implementation method and flexibly configures data persistence according to different hardware environments, context environments and other requirements; defining serialization and deserialization protocols of field data related objects based on an object serialization technology, and outputting binary streams obtained after object serialization to a file, a database or a network position through a file organization protocol during data persistence; when an object which is not loaded in the object buffer pool needs to be loaded, reading a corresponding data stream according to the logic address information sent by an upper layer request, and reconstructing the object through an deserialization protocol of the object; the logical organization mode of the data in the file is a block storage mode, and the management of the blocks adopts a heap structure for management; the data persistence module of the data resource ontology base is also a data persistence abstraction of the network type domain database and the hierarchical type domain database, and the data storage functions of the network type domain database and the hierarchical type domain database are customized and expanded based on the persistence module according to different persistence protocols to form a persistence base of a specific data type;
the bottom database establishing module is used for storing data object data without extended attributes and relations, and establishing a basic field data object, a serialization protocol, an anti-serialization protocol and a storage manager; the single data form of the bottom database provides a data basis for the definition of the network type field data and the hierarchical type field data; the network type domain database module and the hierarchical type domain database module realize defined serialization and deserialization interfaces;
the relation definition module is used for establishing a synonymy relation, an antisense relation and a membership relation for the entries of the bottom database by using a new file on the basis of the realization of the bottom database; the highly abstract and generalized relation definition, organization, storage and management enable the network type domain database to realize flexible extension on the basis;
the data index module is used for performing abstract definition on the field data object, and mapping the field data abstract and the logic storage information of the field data object through a quick double-coding algorithm so as to achieve the purposes of quick retrieval and access control; the network type domain database module and the hierarchical type domain database module both comprise index parts, wherein keywords are realized by obtaining a long and integer digital pair through double-coding calculation;
the interface module is realized based on EJB3.0 standard, and is published in the form of EJB interface and Web Service interface to realize cross-platform Service, and the network type domain database and the hierarchical type domain database realize the customized interface publishing function by inheriting the data resource ontology interface module.
3. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the network type domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to the network type domain data on the basis of a storage management layer of the data resource ontology library in storage design;
(2) performing storage design on the basis of a defined network type field data object, and firstly defining a basic structure and a process of an attribute part; dividing an attribute part into two parts, wherein one part is an existing attribute during database design and is called a basic attribute; the other part is a user-defined attribute which is called an extended attribute;
(3) establishing a data index on a network type domain data object storage structure, dynamically generating a B tree when inserting a network type data object based on the B tree with a fast buffer and a Bloom Filter, and not limiting the maximum layer number of the B tree, connecting attribute blocks together by using pointers to form an attribute block linked list aiming at the condition that the network type data object has the same keyword, and quickly obtaining a network type data object list with the same keyword when inquiring the network type data object by using the keyword;
(4) after the data index is realized, the data record is updated according to the updating of the data record, and the data update is recorded in the finest granularity through the check point and the log file, so that the high-efficiency access and the high fault tolerance of the system are guaranteed.
4. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the hierarchical domain database module is realized by the following steps:
(1) defining a domain data object and a persistence protocol related to a hierarchical domain database on the basis of a storage management layer of the data resource ontology base in storage design;
(2) on the basis of the extension of the hierarchical domain data object and the persistent protocol based on the hierarchical domain data structure, the storage establishment of the hierarchical structure is carried out;
(3) after the storage and establishment of the hierarchical structure are completed, organizing a relationship structure between hierarchical data objects through a defined binary tree; the key words in the index file are formed by unique double-code number pairs of data objects, and the problem of conflict does not need to be considered; in the attribute file, storing the hierarchical data of the father data object and the son data object by adopting the even number; during retrieval, the number pairs of the data objects are calculated, the same number pairs are matched in the index file, pointers of corresponding attributes are obtained, the attributes are read out, and if a plurality of sub-data objects exist, all lower-level data objects can be found by the pointers of the attributes pointing to the next attribute;
(4) after the index is finished, a simplified log file suitable for a hierarchical structure is constructed on the basis of the functions of the check point and the log file of the network type domain database module.
5. The scalable multi-domain data orchestration management database system according to claim 1, wherein: the field data evolution module is realized by the following steps:
(1) firstly, collecting user activity records of various domain databases, and monitoring the change of activity degree of domain data objects;
(2) analyzing the activity change data of the collected data objects, and bringing the domain data objects with the activity lower than a system threshold value into a guard backup library;
(3) further analyzing the activity record of the user, and establishing a version change record of the data object for the field data object with the changed core attribute;
(4) the system analyzes the text data provided by the user or on the Internet to construct a huge data analysis library; when new data is input into the data analysis base, the version information of the associated data object is triggered to be read, then the probability that the current data object is new domain data is calculated by analyzing the relation between the data object and the associated data object version, the new domain data is automatically or manually modified by a user and added into the corresponding domain database.
6. An extensible database management method for data coordination management in multiple types of fields is characterized by comprising the following implementation steps:
(1) preprocessing text data provided by a user, removing non-core field data including stop words, tone words and punctuation marks to obtain preprocessed text data;
(2) inputting the preprocessed data output in the step (1) into an LDA (latent Dirichlet allocation) probability model, and matching the preprocessed data with the established data model to obtain a field-related data object;
(3) constructing a Suffix Tree (Suffix Tree) on the field-related data object output in the step (2), fusing the existing Suffix Tree, gradually traversing the merged Suffix Tree to obtain a high-frequency character string, and initializing a field-related data object;
(4) inputting the field-related data object obtained in the step (3) into a data resource ontology base for type and relationship judgment and matching, and obtaining the type of the field-related data object, namely the hierarchical type, the network type or the user-defined type, and other field data objects related to the data object;
(5) inputting the field-related data object and the associated data output in the step (4) into a field database of a corresponding type, establishing a data change log record, and inputting the field-related data object into a double-coding algorithm to obtain a corresponding index even;
(6) and (4) carrying out service combination on the number pairs obtained in the step (5), the data objects and the related field data objects, and finally outputting the field data objects containing field correlation, multiple relations and multiple attributes.
CN201310343157.XA 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method Active CN103412917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310343157.XA CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310343157.XA CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Publications (2)

Publication Number Publication Date
CN103412917A CN103412917A (en) 2013-11-27
CN103412917B true CN103412917B (en) 2016-08-10

Family

ID=49605929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310343157.XA Active CN103412917B (en) 2013-08-08 2013-08-08 The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method

Country Status (1)

Country Link
CN (1) CN103412917B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183735B (en) * 2014-06-18 2019-02-19 阿里巴巴集团控股有限公司 The querying method and inquiry unit of data
CN105354266A (en) * 2015-10-23 2016-02-24 北京航空航天大学 Rich graph model RichGraph based graph data management method
US9952931B2 (en) * 2016-01-19 2018-04-24 Microsoft Technology Licensing, Llc Versioned records management using restart era
CN106326457B (en) * 2016-08-29 2019-04-30 山大地纬软件股份有限公司 The construction method and system of people society personnel file pouch database based on big data
CN106569941B (en) * 2016-11-04 2019-01-01 金蝶软件(中国)有限公司 The method and apparatus for recording data course
CN106682173B (en) * 2016-12-28 2019-10-18 华南理工大学 A social security big data OLAP preprocessing method and online analysis query method
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing
CN109254962B (en) * 2017-07-06 2020-10-16 中国移动通信集团浙江有限公司 A T-tree-based index optimization method, device and storage medium
CN110019474B (en) * 2017-12-19 2022-03-04 北京金山云网络技术有限公司 Automatic synonymy data association method and device in heterogeneous database and electronic equipment
CN108182265B (en) * 2018-01-09 2021-06-29 清华大学 Multi-layer iterative screening method and device for relational network
CN109446175A (en) * 2018-11-12 2019-03-08 郑州云海信息技术有限公司 A kind of method and apparatus for the log object constructing key operation
CN110569327A (en) * 2019-07-08 2019-12-13 电子科技大学 A Multi-Keyword Ciphertext Retrieval Method Supporting Dynamic Update
CN110851848B (en) * 2019-11-12 2022-03-25 广西师范大学 Privacy protection method for symmetric searchable encryption
CN111192176B (en) * 2019-12-30 2023-04-28 华中师范大学 An online data acquisition method and device supporting educational informatization evaluation
CN111897824B (en) * 2020-03-25 2024-09-17 上海云砺信息科技有限公司 Data operation method, device, equipment and storage medium
CN111767332B (en) * 2020-06-12 2021-07-30 上海森亿医疗科技有限公司 Data integration method, system and terminal for heterogeneous data sources
CN111858607B (en) * 2020-07-24 2024-10-25 北京金山云网络技术有限公司 Data processing method, device, electronic equipment and computer readable medium
CN112597348A (en) * 2020-12-15 2021-04-02 电子科技大学中山学院 Method and device for optimizing big data storage
CN112990601B (en) * 2021-04-09 2023-10-31 重庆大学 Worm gear machining accuracy self-healing system and method based on data mining
KR102392880B1 (en) * 2021-09-06 2022-05-02 (주) 바우디움 Method for managing hierarchical documents and apparatus using the same
CN113963794A (en) * 2021-10-27 2022-01-21 西安交通大学医学院第二附属医院 A hierarchical medical data processing and exchange system and method based on resource object modeling
CN115048344B (en) * 2022-08-16 2022-11-04 安格利(成都)仪器设备有限公司 Storage method for three-dimensional contour and image data of inner wall and outer wall of pipeline or container
CN115543960B (en) * 2022-09-16 2024-01-05 北京神舟航天软件技术股份有限公司 Dynamic modeling method and system for business object

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5724577A (en) * 1995-06-07 1998-03-03 Lockheed Martin Corporation Method for operating a computer which searches a relational database organizer using a hierarchical database outline
CN102110165A (en) * 2011-02-28 2011-06-29 深圳市五巨科技有限公司 Method and system for scheduling interior of browser of mobile terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5724577A (en) * 1995-06-07 1998-03-03 Lockheed Martin Corporation Method for operating a computer which searches a relational database organizer using a hierarchical database outline
CN102110165A (en) * 2011-02-28 2011-06-29 深圳市五巨科技有限公司 Method and system for scheduling interior of browser of mobile terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
用户访问特征驱动的中间件语义缓存替换策略;陈宁江等;《广西大学学报:自然科学版》;20101031;第35卷(第5期);第787-792页 *

Also Published As

Publication number Publication date
CN103412917A (en) 2013-11-27

Similar Documents

Publication Publication Date Title
CN103412917B (en) The Database Systems of a kind of extendible polymorphic type FIELD Data coordinated management and management method
Zafar et al. Big data: the NoSQL and RDBMS review
US9507807B1 (en) Meta file system for big data
Schindler A fast renormalisation for arithmetic coding
KR100745442B1 (en) Representation method of multimedia content management object, management method of multimedia content management system, computer readable recording medium, management and transmission system of multimedia data object item and population method of multimedia content management system
Broekstra13 et al. A metadata model for semantics-based peer-to-peer systems
JP2006018821A (en) System and method for providing contention processing for peer-to-peer synchronization of units of information manageable by a hardware / software interface system
JP2012043456A (en) Data management architecture related to generic data item using reference
Patil et al. A survey on graph database management techniques for huge unstructured data
Lu et al. Efficient infrequent pattern mining using negative itemset tree
Álvarez-García et al. Compact and efficient representation of general graph databases
US9275059B1 (en) Genome big data indexing
Manghi et al. PACE: A general-purpose tool for authority control
CN113495945A (en) Text search method, text search device and storage medium
Colace et al. Pervasive systems architecture and the main related technologies
IVANOVA et al. Introduction to storing graphs by NL-addressing
CN112131434B (en) Extensible Access Control Markup Language Policy Search Method Based on Matching Tree
Kanojia et al. IT Infrastructure for Smart City: Issues and Challenges in Migration from Relational to NoSQL Databases
Vakali et al. New directions in web data management 1
Ilkhomjon et al. About Database (Db)
CN110413797B (en) Anonymous class analysis and storage method for maximum semantic preservation of indefinite-length nested structure
Punia et al. Implementing Information System Using MongoDB and Redis
Greiner Teaching NoSQL Data Models: A Tutorial.
Haslhofer et al. A retrospective on semantics and interoperability research
Liu et al. A Graph Database Storage Engine for Provenance Graphs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191017

Address after: No. 1089, building a, No. 19, Guokai Avenue, Nanning City, 530000 Guangxi Zhuang Autonomous Region

Patentee after: Nanning super cube science and Technology Co Ltd

Address before: 530004 No. 100, University Road, the Guangxi Zhuang Autonomous Region, Nanning

Patentee before: Guangxi University

TR01 Transfer of patent right