CN102169507B - Implementation method of distributed real-time search engine - Google Patents
Implementation method of distributed real-time search engine Download PDFInfo
- Publication number
- CN102169507B CN102169507B CN 201110137785 CN201110137785A CN102169507B CN 102169507 B CN102169507 B CN 102169507B CN 201110137785 CN201110137785 CN 201110137785 CN 201110137785 A CN201110137785 A CN 201110137785A CN 102169507 B CN102169507 B CN 102169507B
- Authority
- CN
- China
- Prior art keywords
- index
- burst
- node
- center control
- control nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 claims description 19
- 238000012217 deletion Methods 0.000 claims description 12
- 230000037430 deletion Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 10
- 238000012423 maintenance Methods 0.000 claims description 7
- 238000013461 design Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 2
- 206010033799 Paralysis Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of search engines, specifically relating to a distributed real-time search engine. A system construction and operation method of the search engine at least comprises the following steps: A, designing a functional structure of a system; B, designing a data index structure of the system; C, creating an index; D, updating the index; and E, searching the index. The distributed real-time search engine can construct an updating index and a combining index simultaneously in the memory of the system, and can access the updating index and the combining index simultaneously while searching the index; when the number of the documents of the updating index is accumulated to a threshold value, the updating index is submitted to a disk index and changed as a combining index, and the original combining index is changed as a new updating index; and therefore, the updating data can be searched, and the real time property of the retrieval data of the search engine can be improved.
Description
Technical field
The present invention relates to the search engine technique field, relate in particular to a kind of implementation method of distributed real-time search engine.
Background technology
Be accompanied by the arrival of era of knowledge-driven economy, the information in the internet is explosive growth, and what the present stage people faced is not absence of information, but information spreads unchecked, the screening of having no way of, thereby, obtaining the information that needs how accurately and fast, in time, is the problem that search engine need to solve.
Search engine refers to according to certain strategy, uses specific computer program to gather information from particular network such as internet, and after information being organized and processed, for the user provides retrieval service, the information display that user search is relevant is to user's system.
Traditional search engine, for example, Google, Baidu, Yahoo etc., although the data volume of processing is huge, reached the TB level, but its data source is mainly from conventional websites such as portal website, forum, E-Government, the station data renewal frequency of this class is not high, each data volume of upgrading is also little, thereby its information processing is not high to the requirement of real-time of search engine.
Along with microblogging, the rise of the social medias such as social class website, " micromessage " that the netizen creates emerges in multitude, thereby produces the real time mass data.In addition along with the fast development of enterprise mobile application such as mobile crm system and handheld terminal, the user has higher requirement to inquiry velocity and the real-time of information, and traditional search engines can not adapt to the processing demands of the processing of real time mass data and real-time search.The data volume that the real time mass data have renewal frequency height, renewal is large, the large characteristic of data volume of accumulation, usually reaches hundreds of GB, even reaches the data volume of TB or PB level.Real-time search engine has very high requirement on the real-time of mass data processing and inquiry response.When data volume reaches the TB level, there is very large contradiction between the frequency of Data Update and the speed of inquiry response, because it is large to work as the cumulative data amount, when the data volume of upgrading is also very large, thereby can cause the structure of index and maintenance time length to cause real-time to guarantee, namely, when existing search engine scheme adopts this increment index mechanism, the structure of index and retrieving separately carry out, after the number of files that the construction logic of index is only accumulated in new section reaches threshold value (such as 10000) or reaches threshold value (such as 5 minutes) interval time, just new section is submitted in the index burst for the indexed search logic.Therefore, can retrieve the document from being submitted to of a document, between have a regular hour and postpone, usually a few minutes in the dozens of minutes scope, and in real-time retrieval, so long delay is intolerable.
Summary of the invention
Deficiency for the prior art scheme, the present invention proposes a kind ofly to overcome increment index mechanism with the contradiction between the index real-time, index during by the renewal in the Installed System Memory, a kind of distributed real-time search engine that the cooperation of index and disk index realizes when merging.
The technical solution used in the present invention is as follows:
A kind of implementation method of distributed real-time search engine, its system constructing and operation may further comprise the steps at least:
A. the functional structure of design system, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is used for the establishment of data directory structure index burst, upgrade and retrieval, described external service node is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
B. the data directory structure of design system, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing Doctype;
C. the establishment of index may further comprise the steps:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index data node in the center control nodes, judge state and the loading condition of each index data node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index data node.The index datastore node is according to the attribute information of the index to be created of receiving, make up the index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index data node in good condition, that load is relatively little of tasking other to create this index burst, finish or create unsuccessfully until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
D. the renewal of index may further comprise the steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
E. the retrieval of index may further comprise the steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
Further, the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, and when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use.
Further, described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index data node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index data nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst.
Further, in the heartbeat signal that described index data node occurs in the center control nodes, the load information that comprises this index data node, in the process of index creation, center control nodes can be distributed to the index burst the little index data node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
Further, described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index.
Further, in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst; The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index data node that the index data node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy; Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.
Further, center control nodes is made regular check on the number of the index burst copy of each index in whole index, and when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other back end; When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index data nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged; When the index data node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index data nodes, guarantee that the number of copies of this index burst remains unchanged.
Further, each index burst of described same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node.
Further, in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions.
Further, the storage of the described renewal document of step D on the index burst, the cryptographic hash by the key assignments that calculate to upgrade document, this cryptographic hash is counted delivery with the index burst of document place index after, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
Further, the different pieces of information object type of the described document of step B, comprise: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type are stored in the structure in territory of document.
The present invention is by adopting technique scheme, and the beneficial effect that has is:
1. in the internal memory of system, make up simultaneously index when index is with merging when upgrading, index when index is with merging when passing through simultaneously access renewal during indexed search, after the number of documents of index runs up to threshold value when upgrading, upgrading index is submitted to the disk index and changes to index when merging, index when index changes to new renewal during original merging, guaranteed that the data of upgrading also can be retrieved, but improved the real-time of search engine retrieve data;
2. the center control nodes of native system, center control nodes for subsequent use, external service node and index datastore node are at the concentrating type system creation based on Master/Slave, has Error Tolerance, be fit to be deployed on the cheap machine, and the data access of high-throughput can be provided;
3. by the index burst that is stored in the index datastore node is created index burst copy, strengthen the fault-tolerance of system.
Description of drawings
Fig. 1 is the functional structure synoptic diagram of one embodiment of the present invention.
Fig. 2 is the synoptic diagram of data directory structure of the present invention.
Fig. 3 is the embodiment synoptic diagram of index burst of the present invention and index burst copy storage policy.
Embodiment
Now the present invention is further described with embodiment by reference to the accompanying drawings.
A kind of implementation method of distributed real-time search engine, its system constructing and operation are to be made of following steps:
Steps A: the functional structure of design system, consult shown in the accompanying drawing 1, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes is host node in system, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is back end in system, be used for the establishment of data directory structure index sliced layer, upgrade and retrieval, described external service node is client node in system, is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
Step B: the data directory structure of design system, consult shown in the accompanying drawing 2, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing the document different attribute; Wherein, described index provides the set of the several data object of retrieval support, and described index burst disperses to be stored on the index datastore node of system, and this can improve the retrieve data efficient of system;
Step C: the establishment of index is to be made of following step:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index data node in the center control nodes, judge state and the loading condition of each index data node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index data node; The index datastore node is according to the attribute information of the index to be created of receiving, make up the index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index data node in good condition, that load is relatively little of tasking other to create this index burst, finish or create unsuccessfully until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
Step D: the renewal of index is to be made of following steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
Step e: the retrieval of index is to be made of following steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
As one preferred embodiment, the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, and when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use; Because center control nodes is host node in system, in a single day it break down, and will cause the whole system paralysis, therefore, by increasing center control nodes for subsequent use, can realize the fault of center control nodes is shifted, and improves the fault-tolerance of system.
As one preferred embodiment, described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index data node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index data nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst.
As one preferred embodiment, in the heartbeat signal that described index data node occurs in the center control nodes, the load information that comprises this index data node, in the process of index creation, center control nodes can be distributed to the index burst the little index data node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
As one preferred embodiment, described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, and described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index; This index datastore node attribute information and index attributes information are metadata in system, this metadata store is on center control nodes, and the center control nodes of system, index datastore node and external service node can be followed according to these metadata and be deduced each index burst position in cluster.
As one preferred embodiment, in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst.The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index data node that the index data node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy.Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.。
Further, center control nodes is made regular check on the number of the index burst copy of each index in whole index, and when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other back end.When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index data nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged.When the index data node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index data nodes, guarantee that the number of copies of this index burst remains unchanged.
Further, each index burst of described same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node; One that consults this strategy shown in the accompanying drawing 3 illustrates, it is that an index burst number is 2, the index burst number of copies of each index burst is 1 index in the situation of the storage of index datastore node: when the index datastore nodes of system is 1, the index burst 1 of this index and index burst 2 all are stored in the index datastore node 1, and each burst does not have index burst copy, because copy only is stored in the different nodes and could availability and the reliability of system be worked with former burst, when the index datastore nodes in the system is 2, the index burst 1 and the index burst 2 that are stored in the index datastore node 1 all have index burst copy 1 ' and the index burst copy 2 ' that is stored on the index datastore node 2, index datastore node 2 can provide with index datastore node 1 the same service, therefore increase the service performance that the index datastore node can expanding system; When the index datastore nodes of system was 4, index burst 1, index burst 2, index burst copy 1 ' and index burst copy 2 ' were separately to be stored on these 4 index datastore nodes.
As one preferred embodiment, in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions; In the building process of index, can constantly produce new section, when index divides the number of medium film section too many, can affect the recall precision of indexed search logic, therefore, this step is merged into a large section with a plurality of little sections, and rejects the data of tag delete, has optimized the storage space of index, reduce the number of the index segment that the indexed search logic operates simultaneously, thereby improved the recall precision of indexed search logic.
As one preferred embodiment, the storage of the described renewal document of step D on the index burst, by calculating the cryptographic hash of the key assignments that upgrades document, after this cryptographic hash counted delivery with the index burst of document place index, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
As one preferred embodiment, the different pieces of information object type of the described document of step B is: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type is stored in the structure in territory of document, the structure in the territory of document is used for the attribute information of storage document, for example, for the document of text, can comprise following information: file name, keyword, author, file size, classification, file description etc.; And for the document of audio types, can comprise following information: file name, bit rate (bps), file size, duration, author or artist name, song title, school, album name etc.
Although specifically show and introduced the present invention in conjunction with preferred embodiment; but the those skilled in the art should be understood that; within not breaking away from the spirit and scope of the present invention that appended claims limits; can make a variety of changes the present invention in the form and details, be protection scope of the present invention.
Claims (10)
1. the implementation method of a distributed real-time search engine, its system constructing and operation may further comprise the steps at least:
A. the functional structure of design system, this functional structure is to create in the concentrating type system based on Master/Slave, comprise the following functions node: center control nodes, index datastore node and external service node, wherein, described center control nodes is created in the Master system, described index datastore node and external service node are created in the Slave system, described center control nodes, the storage and maintenance that is used for the attribute information of data directory structure index, and the storage and maintenance of the attribute information of index datastore node, described index datastore node is used for the establishment of data directory structure index burst, upgrade and retrieval, described external service node is used for the establishment of reception hint, renewal and retrieval request also are forwarded to center control nodes with this request and process;
B. the data directory structure of design system, this index structure tree hierarchy from top to bottom consists of: index, the index burst, section, document and territory, wherein, described index can have a plurality of in a system, a described index burst is the data block of described index after divided, wherein, each index burst that belongs to same index is stored on the index datastore node, a described index burst is to be made of one or more section, a described section is to be made of one or more document, each contained document can be different data object type in the section, a described document has the uniquely identified key assignments in system's overall situation, the structure of described document comprises for the territory of describing Doctype;
C. the establishment of index may further comprise the steps:
C1. after externally service node receives the index creation request this request is forwarded to center control nodes, center control nodes is resolved this index creation request, therefrom extract the attribute information of index to be created, and verify that this attribute information is whether complete and effectively, if this attribute information is complete and effective, then carry out the processing of step C2, if this attribute information is incomplete or invalid, then send answer failed information to external service node;
C2. center control nodes is divided into some bursts according to the index burst number in the attribute information of the index to be created that generates among the step C1 with index to be created, simultaneously, according to the attribute information that is stored in the index datastore node in the center control nodes, judge state and the loading condition of each index datastore node, and come according to this to determine each index burst is stored and created, and then the attribute information with index to be created is sent to each corresponding index datastore node in which index datastore node; The index datastore node is according to the attribute information of the index to be created of receiving, make up an index burst of the described index to be created of center control nodes assignment at this index datastore node, if this index datastore node creates this index burst failure, then center control nodes divides the index datastore node in good condition, that load is relatively little of tasking other to create this index burst, finish or create failure until whole index bursts of this index to be created create in the index datastore node, carry out the processing of step C3;
If C3. whole index bursts of index to be created create in the index datastore node and finish among the step C2, the center control nodes updated stored is in index datastore node attribute information wherein, and transmission index burst creates successful response message to external service node; If whole index bursts of index to be created create unsuccessfully in the index datastore node among the step C2, then send to external service node and create replying of index failure;
D. the renewal of index may further comprise the steps:
D1. after externally service node receives the index upgrade request this request is forwarded to center control nodes, center control nodes is sent to this index upgrade request the index datastore node at the index burst place of this index according to the index attributes information and the index datastore node attribute information that are stored in wherein;
D2. the index datastore node is according to the index upgrade request of receiving, on the index burst of index to be updated place index datastore node, to upgrade document storage in new section, if upgrade the document storage success, then will upgrade the corresponding old document of document and in new section, be labeled as the deletion state, and return the index upgrade successful information to center control nodes, if upgrade the document storage failure, then return the index upgrade failure information to center control nodes, center control nodes is sent to external service node with index upgrade success or failed information at last;
The index upgrade of this step D also comprises the delete step of document: when index upgrade request during only for the deletion document command, on the storage burst of the index datastore node at document to be deleted place, in new section the document is labeled as deletion;
The index upgrade of this step D, also comprise the step that makes up real time indexing: in the internal memory of system, make up simultaneously index when index is with merging when upgrading, the retrieval of index be when accessing this renewal index and when merging index carry out, when carrying out index upgrade, the index when index in the renewal is described renewal, when reached threshold value the update time of index when the number of documents of index reached threshold value or this renewal when this renewal, system indexes when submitting this renewal in the disk index, index index index when the upgrading during merging before index and the simultaneously change when merging when changing afterwards this renewal;
E. the retrieval of index may further comprise the steps:
E1. externally send it to center control nodes after the retrieval request of service node reception hint, center control nodes resolve this retrieval request and judge its for the target index, then according to the attribute information of index datastore node attribute information and target index, search all index bursts of this target index, and assign retrieval request to the index datastore node of each burst of storage;
E2. the index datastore node is retrieved relevant documentation according to the retrieval request of receiving at the respective index burst of its storage, will be sent to external service node after the result for retrieval ordering at last;
E3. externally the result for retrieval of service node each index datastore node that will receive is integrated, is sent to client after the ordering.
2. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the functional structure of the described system of steps A, also comprise a center control nodes for subsequent use, described center control nodes in real time with the data backed up in synchronization of its storage to center control nodes for subsequent use, when center control nodes breaks down the phase, this center control nodes for subsequent use changes to center control nodes, when former center control nodes is recovered from fault, former center control nodes changes to new center control nodes for subsequent use.
3. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: described index datastore node and external service node periodically send the heartbeat signal that characterizes its status information to described center control nodes, if center control nodes is not received heartbeat signal within the default time, then this index datastore node of mark or external service node are dead, simultaneously, center control nodes can will be labeled as all index bursts of storing in the dead index datastore node, copy is a in the index datastore node of any this index burst copy of not storing of other again in the copy of these index bursts of storing from other index datastore nodes, so that the number of copies of index burst remains unchanged, all be available at any time with assurance index burst; In the heartbeat signal that described index datastore node occurs in the center control nodes, the load information that comprises this index datastore node, in the process of index creation, center control nodes can be distributed to the index burst the little index datastore node storage of load as far as possible, equally, in the process of indexed search, center control nodes can be submitted to retrieval request the index datastore node processing at the little index burst of load or this burst copy place as far as possible.
4. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: described index datastore node attribute information comprises: the type of the ID of node, the title of node, node, the state of node, the load of node and the position of node, described index attributes information comprises: the memory node ID of the number of copies of the burst number of the organization definition of document, index, index burst and index burst and index burst copy in the title of index, the index.
5. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: in the data directory structure of the described system of step B, each index burst also has a plurality of index burst copies, this index burst copy creates when the described index creation of step C, upgrade rear asynchronous refresh at former index burst when the described index upgrade of step D, it is stored on the different index datastore nodes with former index burst; The index datastore node at former index burst place is responsible for processing the update request for this index burst, when former index burst upgrade complete after, the index datastore node that the index datastore node at former index burst place is responsible for update request is sent to asynchronously corresponding index burst copy place carries out the renewal of index burst copy; Index burst copy is all supported indexed search with corresponding former index burst, center control nodes is submitted to the little index burst of load or index burst copy place index datastore node processing according to the loading condition of former index burst and index burst copy place index datastore node with the indexed search request.
6. the implementation method of distributed real-time search engine as claimed in claim 5, it is characterized in that: center control nodes is made regular check on the number of the index burst copy of each index in whole index, when the number of index burst copy was lower than default setting number, system copied the copy of this index burst automatically in other index datastore nodes; When the index datastore node of the former index burst of storage breaks down, system chooses an index upgrade job of taking over former index burst from the index burst copy of correspondence, this index burst copy becomes new former index burst, then in other index datastore nodes generating an index burst copy, guarantee that the number of copies of this index burst remains unchanged; When the index datastore node of storage index burst copy broke down, system can generate a copy the same with former index burst in other index datastore nodes, guarantee that the number of copies of this index burst remains unchanged.
7. the implementation method of distributed real-time search engine as claimed in claim 5, it is characterized in that: each index burst of same index and index burst copy creating and be stored on the index datastore node, be to carry out according to following strategy: center control nodes is according to the load information of node in the attribute information of index datastore node, described index burst and index burst copy are dispensed to the lightest index datastore node of load, when the number of available index datastore node is less than the number of index burst, center control nodes distributes a plurality of index bursts to same index datastore node, and center control nodes is the index burst copy of allocation index burst not; When the number of available index datastore node during more than the number of index burst, the center control nodes distribution portion or all the index burst copy of index bursts to remaining index datastore node.
8. the implementation method of distributed search engine as claimed in claim 1, it is characterized in that: in the renewal of the described index of step D, the step of the merging of the section of comprising also: the index of the index in described renewal divides the number of medium film section to reach on threshold value or the distance once to reach interval time that index merges threshold value, the index datastore node at this index burst place reads the document in less several sections and it is stored in a new section, then with these several less section physics deletions.
9. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the storage of the described renewal document of step D on the index burst, by calculating the cryptographic hash of the key assignments that upgrades document, after this cryptographic hash counted delivery with the index burst of document place index, at last document is assigned to the index burst of the numerical value reference numeral of this delivery and stores.
10. the implementation method of distributed real-time search engine as claimed in claim 1, it is characterized in that: the different pieces of information object type of the described document of step B, comprise: text data object, image data objects, audio data objects, video data objects, executable program data object, the attribute information of each data object type are stored in the structure in territory of document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110137785 CN102169507B (en) | 2011-05-26 | 2011-05-26 | Implementation method of distributed real-time search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110137785 CN102169507B (en) | 2011-05-26 | 2011-05-26 | Implementation method of distributed real-time search engine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102169507A CN102169507A (en) | 2011-08-31 |
CN102169507B true CN102169507B (en) | 2013-03-20 |
Family
ID=44490669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110137785 Active CN102169507B (en) | 2011-05-26 | 2011-05-26 | Implementation method of distributed real-time search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102169507B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649804A (en) * | 2016-12-29 | 2017-05-10 | 深圳市优必选科技有限公司 | Data processing method and device of data query server and data processing system |
Families Citing this family (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102394922A (en) * | 2011-10-27 | 2012-03-28 | 上海文广互动电视有限公司 | Distributed cluster file system and file access method thereof |
CN102523480A (en) * | 2011-12-08 | 2012-06-27 | 成都东方盛行电子有限责任公司 | Recording system and method based on active-standby and cache technology |
CN103309903A (en) * | 2012-03-16 | 2013-09-18 | 刘龙 | Position search system and method based on cloud computing |
CN102779185B (en) * | 2012-06-29 | 2014-11-12 | 浙江大学 | High-availability distribution type full-text index method |
CN103685429B (en) * | 2012-09-25 | 2017-09-26 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus of information displaying |
CN103106233A (en) * | 2012-11-02 | 2013-05-15 | 北京邮电大学 | Asynchronous index and read-write method of massive files applied to search engine |
US20140156668A1 (en) * | 2012-12-04 | 2014-06-05 | Linkedin Corporation | Apparatus and method for indexing electronic content |
CN102984762B (en) * | 2012-12-12 | 2016-05-25 | 中国联合网络通信集团有限公司 | IMS function assigning method and device |
CN103914483B (en) * | 2013-01-07 | 2018-09-25 | 深圳市腾讯计算机系统有限公司 | File memory method, device and file reading, device |
CN103067525B (en) * | 2013-01-18 | 2015-11-25 | 广东工业大学 | A kind of cloud storing data backup method of feature based code |
CN103198108B (en) * | 2013-03-27 | 2016-08-10 | 新浪网技术(中国)有限公司 | A kind of index data update method, retrieval server and system |
CN103258036A (en) * | 2013-05-15 | 2013-08-21 | 广州一呼百应网络技术有限公司 | Distributed real-time search engine based on p2p |
CN103310023A (en) * | 2013-07-05 | 2013-09-18 | 深圳中兴网信科技有限公司 | Distributed searching system and method |
CN104298692B (en) * | 2013-07-19 | 2017-11-24 | 深圳中兴网信科技有限公司 | A kind of method and system of distributed search |
CN103488687A (en) * | 2013-09-02 | 2014-01-01 | 用友软件股份有限公司 | Searching system and searching method of big data |
CN104239377A (en) * | 2013-11-12 | 2014-12-24 | 新华瑞德(北京)网络科技有限公司 | Platform-crossing data retrieval method and device |
CN103699648A (en) * | 2013-12-26 | 2014-04-02 | 成都市卓睿科技有限公司 | Tree-form data structure used for quick retrieval and implementation method of tree-form data structure |
CN104092735A (en) * | 2014-06-23 | 2014-10-08 | 吕志雪 | Cloud computing data access method and system based on binary tree |
CN104252537B (en) * | 2014-09-18 | 2019-05-21 | 彩讯科技股份有限公司 | Index sharding method based on mail features |
CN104361009B (en) * | 2014-10-11 | 2017-10-31 | 北京中搜网络技术股份有限公司 | A kind of real time indexing method based on inverted index |
CN104820693B (en) * | 2015-04-28 | 2018-07-24 | 广东小天才科技有限公司 | Data searching method and device |
CN105045684B (en) * | 2015-07-16 | 2018-06-15 | 北京京东尚科信息技术有限公司 | Index switching and the method and device of index control |
CN105138669A (en) * | 2015-09-07 | 2015-12-09 | 天脉聚源(北京)传媒科技有限公司 | Method and device for combining incremental indexes with general indexes |
CN105373835B (en) * | 2015-10-14 | 2021-07-02 | 国网湖北省电力公司 | Link information management method based on structure tree model |
CN106598990B (en) * | 2015-10-16 | 2020-06-19 | 卓望数码技术(深圳)有限公司 | Searching method and system |
CN105843933B (en) * | 2016-03-30 | 2019-01-29 | 电子科技大学 | The index establishing method of distributed memory columnar database |
US9934092B2 (en) * | 2016-07-12 | 2018-04-03 | International Business Machines Corporation | Manipulating a distributed agreement protocol to identify a desired set of storage units |
CN106294721B (en) * | 2016-08-08 | 2020-05-19 | 无锡天脉聚源传媒科技有限公司 | Cluster data counting and exporting methods and devices |
CN108509438B (en) * | 2017-02-24 | 2021-08-31 | 南京烽火星空通信发展有限公司 | ElasticSearch fragment expansion method |
CN108694188B (en) * | 2017-04-07 | 2023-05-12 | 腾讯科技(深圳)有限公司 | Index data updating method and related device |
CN107133350A (en) * | 2017-05-25 | 2017-09-05 | 努比亚技术有限公司 | Data-updating method, mobile terminal and storage medium based on search engine |
CN107220347B (en) * | 2017-05-27 | 2020-07-03 | 国家计算机网络与信息安全管理中心 | Custom relevance ranking algorithm based on Lucene support expression |
CN109002448B (en) * | 2017-06-07 | 2020-12-08 | 中国移动通信集团甘肃有限公司 | Report statistical method, device and system |
CN109120885B (en) * | 2017-06-26 | 2021-01-05 | 杭州海康威视数字技术股份有限公司 | Video data acquisition method and device |
CN107436923A (en) * | 2017-07-07 | 2017-12-05 | 北京奇虎科技有限公司 | A kind of method and apparatus of the search index in big data cluster |
EP3726438A1 (en) | 2017-10-23 | 2020-10-21 | Siemens Aktiengesellschaft | Method and control system for controlling and/or monitoring devices |
CN108804502A (en) * | 2018-04-09 | 2018-11-13 | 中国平安人寿保险股份有限公司 | Big data inquiry system, method, computer equipment and storage medium |
CN108681592B (en) * | 2018-05-15 | 2021-05-25 | 北京三快在线科技有限公司 | Index switching method, device and system and index switching central control device |
CN110609844B (en) * | 2018-05-29 | 2022-05-13 | 优信拍(北京)信息科技有限公司 | Data updating method, device and system |
CN108959640B (en) * | 2018-07-26 | 2021-02-12 | 浙江数链科技有限公司 | ES index rapid construction method and device |
CN109086409B (en) * | 2018-08-02 | 2021-10-08 | 泰康保险集团股份有限公司 | Microservice data processing method and device, electronic equipment and computer readable medium |
CN109767247A (en) * | 2019-01-15 | 2019-05-17 | 武汉费米坊科技有限公司 | A kind of distribution commodity traceability system and source tracing method |
CN109726264B (en) * | 2019-01-16 | 2022-02-25 | 北京百度网讯科技有限公司 | Method, apparatus, device and medium for index information update |
CN110209910B (en) * | 2019-05-20 | 2021-06-04 | 无线生活(杭州)信息科技有限公司 | Index switching scheduling method and scheduling device |
CN110175151A (en) * | 2019-05-22 | 2019-08-27 | 中国农业科学院农业信息研究所 | A kind of processing method, device, equipment and the storage medium of agricultural big data |
CN110704453B (en) * | 2019-10-15 | 2022-05-06 | 腾讯音乐娱乐科技(深圳)有限公司 | Data query method and device, storage medium and electronic equipment |
CN111324767A (en) * | 2020-02-17 | 2020-06-23 | 厦门快商通科技股份有限公司 | Distributed audio fingerprint engine system |
CN111611222B (en) * | 2020-04-27 | 2024-07-23 | 上海鼎茂信息技术有限公司 | Data dynamic processing method based on distributed storage |
CN112527210A (en) * | 2020-12-22 | 2021-03-19 | 南京中兴力维软件有限公司 | Storage method and device of full data and computer readable storage medium |
CN113535730A (en) * | 2021-07-21 | 2021-10-22 | 挂号网(杭州)科技有限公司 | Index updating method and system for search engine, electronic equipment and storage medium |
CN114020986B (en) * | 2022-01-05 | 2022-04-26 | 深圳思谋信息科技有限公司 | Content retrieval system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101677328A (en) * | 2008-09-19 | 2010-03-24 | 中兴通讯股份有限公司 | Content-fragment based multimedia distributing system and content-fragment based multimedia distributing method |
CN101727460A (en) * | 2008-10-31 | 2010-06-09 | 中兴通讯股份有限公司 | Method and system for positioning content fragment |
CN101853283A (en) * | 2010-05-21 | 2010-10-06 | 南京邮电大学 | Construction method for multidimensional data-oriented semantic indexing peer-to-peer network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8903810B2 (en) * | 2005-12-05 | 2014-12-02 | Collarity, Inc. | Techniques for ranking search results |
-
2011
- 2011-05-26 CN CN 201110137785 patent/CN102169507B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101677328A (en) * | 2008-09-19 | 2010-03-24 | 中兴通讯股份有限公司 | Content-fragment based multimedia distributing system and content-fragment based multimedia distributing method |
CN101727460A (en) * | 2008-10-31 | 2010-06-09 | 中兴通讯股份有限公司 | Method and system for positioning content fragment |
CN101853283A (en) * | 2010-05-21 | 2010-10-06 | 南京邮电大学 | Construction method for multidimensional data-oriented semantic indexing peer-to-peer network |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649804A (en) * | 2016-12-29 | 2017-05-10 | 深圳市优必选科技有限公司 | Data processing method and device of data query server and data processing system |
Also Published As
Publication number | Publication date |
---|---|
CN102169507A (en) | 2011-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102169507B (en) | Implementation method of distributed real-time search engine | |
JP6778795B2 (en) | Methods, devices and systems for storing data | |
JP7410181B2 (en) | Hybrid indexing methods, systems, and programs | |
EP3596619B1 (en) | Methods, devices and systems for maintaining consistency of metadata and data across data centers | |
CN107003935B (en) | Apparatus, method and computer medium for optimizing database deduplication | |
CN104714755B (en) | Snapshot management method and device | |
US8423733B1 (en) | Single-copy implicit sharing among clones | |
CN102937980B (en) | A kind of Cluster Database data enquire method | |
AU2013210018B2 (en) | Location independent files | |
US7769792B1 (en) | Low overhead thread synchronization system and method for garbage collecting stale data in a document repository without interrupting concurrent querying | |
CN105183400B (en) | It is a kind of based on content addressed object storage method and system | |
CN104679898A (en) | Big data access method | |
CN104301360A (en) | Method, log server and system for recording log data | |
CN114116613A (en) | Metadata query method, equipment and storage medium based on distributed file system | |
CN109376121B (en) | File indexing system and method based on elastic search full-text retrieval | |
CN104778270A (en) | Storage method for multiple files | |
EP3103025A2 (en) | Content based organization of file systems | |
JP2022500727A (en) | Systems and methods for early removal of tombstone records in databases | |
US9002906B1 (en) | System and method for handling large transactions in a storage virtualization system | |
CN104881466A (en) | Method and device for processing data fragments and deleting garbage files | |
CN103049574B (en) | Realize key assignments file system and the method for file dynamic copies | |
CN104424219A (en) | Method and equipment of managing data documents | |
CN112334891B (en) | Centralized storage for search servers | |
CN103778219A (en) | HBase-based method for updating incremental indexes | |
CN110413617B (en) | Method for dynamically adjusting hash table group according to size of data volume |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: 361008 Xiamen Torch High tech Zone Software Park Innovation Building C Area 303-E, Xiamen, Fujian Province Patentee after: Xiamen Yaxun Zhilian Technology Co.,Ltd. Country or region after: China Address before: No. 46 Guanri Road, Software Park Phase II, Xiamen City, Fujian Province 361008 Patentee before: XIAMEN YAXON NETWORK Co.,Ltd. Country or region before: China |
|
CP03 | Change of name, title or address |