WO2015058331A1 - Extract data from xml stream - Google Patents
Extract data from xml stream Download PDFInfo
- Publication number
- WO2015058331A1 WO2015058331A1 PCT/CN2013/085580 CN2013085580W WO2015058331A1 WO 2015058331 A1 WO2015058331 A1 WO 2015058331A1 CN 2013085580 W CN2013085580 W CN 2013085580W WO 2015058331 A1 WO2015058331 A1 WO 2015058331A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nodes
- data
- static tree
- static
- xml
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/835—Query processing
- G06F16/8373—Query execution
Definitions
- Extensible Markup Language can be utilized for information representation, exchange, and retrieval, e.g. , via an XML document.
- XML Extensible Markup Language
- X ML can be used to create content a nd mark t he content u p wit h delimiting tags, making each word or ph rase into identifiable, sortable information .
- F igure 1 is a d iagrarn of a n exarnp le of an en viro n men t whe re in examples of the present d isclosure may be utilized.
- Figures 2 A illustrates an example of a system according to the present disclosure.
- F igures 2 B i II ustrates an examp le of a syste m accord i ng to the present disclosure.
- Figure 3A illustrates n example of a static tree of nodes acooid ing to the present disclosure.
- Figure 3B illustrates an example of a static tree of nodes accord in to the present disclosure.
- Extensible Markup Language can be used to create documents and data records that are portable and platform independent.
- One previo us app roach to access ing XM L data from XM L documents includes a Document Object Model (DOM) parser.
- a DOM parser uses a tree- based parsing approach that builds a parse object tree in memory.
- a D OM pa rser is capable of su pporti ng X Path , wh ich can be util ized for selecting and retrievin g data from XML doc ments.
- XPath allows for retrieval of XML data based not on ly on its content, but also on the XML document structure.
- DOM approaches face performance issues. For example, b ild ing the DOM tree structure in memory is computationally expensive. For instance, some instances of DOM trees have required 0 times the memory of the original document. DOM parsers do not perform or scale well when processing large XML documents because of their high memory cost. Further, for each XPath utilized, the DOM tree structure must be built in memory, such that a DOM tree structure is traversed for each of the multiple Xpaths resulting in significant processing times.
- Another previous app roach to access in g XM L d ata f ro m X ML documents includes streaming, e.g., such that data can be processed in a continuous stream.
- a number of streaming based XML processing techniques such as SAX (Simple application progr mming interface (A I) for XML) and St AX (Stream in g API fo r XML ), These stream in g based XM L p recess in g techn iques utilized less memory, as compared to DOM.
- SAX Simple application progr mming interface
- St AX Stream in g API fo r XML
- These stream in g based XM L p recess in g techn iques utilized less memory, as compared to DOM.
- these streaming based XML processing techniques do not maintain the h ierarchical structure of XML documents and cannot support XPath based XML data retrieval.
- [001 3 ⁇ 4 F igure 1 is a d iagrarn of a n examp le of an environ men 1 100 wherein ex mples of the present disclosure may be utilized.
- the environment 100 can include a server 102.
- the server 102 can be utilized for data extraction, e.g., XPath based data extraction, as d isclosed herein.
- Wh ile Figure 1 illustrates a single server 102, examples of the present disclosure are not so limited.
- the server 102 can be communicatively cou led to a
- the communication link 104 e.g. , a network, represents a cable, wireless, fiber optic, or remote con nection via a
- the communication link 104 can, for ex mple, include a link to an intranet, the Internet, or a combination thereof, among other communication interfaces.
- the communication link 104 can include intermediate proxies, for exam pie, an intermed iate proxy server, routers, switches, load balancers, and the like.
- the environment 100 can include an XML data provider 10S.
- the XM L data p rovider 106 can p ro ide XML data, e. g. , a n XM L d ecu men t, to the server 102.
- the XML data is an XML stream, e.g. streaming XML data.
- Various types of XML data can be provided by XML data provider 10S to the server 102.
- XML data provider 10S can provide XML data related to a global position ing system or an emergency service, e.g., an earthquake early warn ing system, among others. While Figure 1 illustrates a single XML data provider 10S, examples of the present d isclos re are not so limited.
- XML data from an XML data provider 10S can be processed by the server 102.
- a pl rality of XPath expressions can be combined to construct a static tree of nod s in memory.
- an XML stream e.g., the XML data provided by XML data provider 10S, can be analyzed by referring to the static tree of nodes to extract data from the XM L stream .
- Some exa mp les of the p rese nt d isclos ure p rovide th at th e particu la r portio ns of th e XM L stream are a nalyzed on ly o ne time, e. . , th e ana lyzatio n is a si ngle scan of the XM L data.
- Data extracted from the X ML steam may be processed, e.g. to format the extracted data, among other processes.
- data extracted from the XML stream can be provided to an extracted XML data receiving device 10£, 1 10, 1 12.
- Th XML data receiving device 10B can be a server, e.g., that provides extracted XML data to a network a nd/o r anot her comp utin g device .
- the XM L data receivi ng device 110 can be a mo ile device, e.g., a mobile phone or a laptop computer, among other mobile devices.
- the XML data receiving device 112 can be a database e.g., that provides extracted XML data to a network and /or another co puting device.
- Figure 1 illustrates three XML data receiving devices 10S , 110, 112, exam pies of th e present d isclosu re are not so li mited to a particu la r n umber and /or type of XM L data receivi ng d evices .
- F igures 2 Ar2B il lustrate exam pies of systems 26G, 2 SO accord i ng to t he p resen t d isclcs ure .
- the system 260 can in cl ude a data store 262, processing system 264, and/or a n mber of engines 296, 267, 26S.
- the processing system 264 can be in commun ication with the data store 262 via a communication link, and can include the number of engines, e.g. , configuration engine 266, abstractor engine 267, flush engine 26S, etc.
- the processing system 264 can include add itional orfewer en ines than illustrated to perform the v rious functions described herein.
- Th e n u rnbe r of en gines ca n in cl ude a com bin atio n of hard wa re and program mi ng t hat is co nfig ured to pe rfo rm a n umbe r of fu net ion s d escri bed herein, e.g., com ine a plurality of Path expressions to construct a static tree of nodes, parse an XML stream by referring to the static tree of nodes and extract data from the XML stream to an output queue, flush extracted data from the output queue to a data store, etc
- the programming can include program instructions, e.g., software, firmware, etc., stored in a memory resource, e.g., computer readable medium, machine read ble medium, etc., as w ll as hardwired program, e.g., logic
- the configuration engine 266 can include hardware, programming, and /or a com bin ation of hard ware an d progra mmi ng to comb ine a p lural ity of XPath expressions to con st met a static t ree of nodes.
- an XML stream can be analyzed to extract a plurality of data.
- XPath is a query language that can be utilized to locate particular data in an XML stream, e.g. via an XPath expression that identifies the particular data.
- an XPath expression can be used to define which paths in an XML stream are to be traversed to extract a particular data.
- the extraction engine 266 can include hardware, programming, and/or a com ination of ha id ware and programming to perform other functions described herein.
- exarnp les of the present d isclos ure p ro ide th at a plurality of XPath expressions are combined to construct a static tree of nodes.
- Combing the pl rality of XPath expressions to construct th static tree of nodes provides that a search of the static tree of nodes corresponds to each of the XPath expressions being searched simultaneously, e.g., because the static tree of nodes is constructed from the com ined XPaths.
- a node can correspond to either an elem nt or attribute.
- An element can include a start tag, an end tag, and information, wh ich can be referred to as contents, between the start tag and end tag.
- a tag is a markup construct.
- the tag can be a start tag, e.g. , ⁇ section>, and end tag, e.g., ⁇ ⁇ section> , o r a n em pty eleme nt tag, e.g . , ⁇ li ne- brea k t> .
- An attribute is a markup construct.
- the attribute can be a name/value pair that exists within a start tag or empty element tag.
- Some examples of the present disclos re provide for bounding, e.g. while traversing the static tree of nodes, a value of a tag and/or a value of an attribute to the static tree of nodes.
- Some examples of the present disclosure provide that the tags and attributes associated with the static tree of nodes have a structure that links an extraction cond ition of the tag and its descendants. Therefore, when processing an end tag, data used to check the extraction condition of the tag will already be bound to the static tree of nodes and condition checking can be performed efficiently utiliz in g the condition lin k ⁇ t ruct ure.
- stat ic tree of nod es can be traversed downward while bounding related data.
- static tree of nodes can be traversed pward to check the extraction condition linked to the tag.
- Figure 3A illustrates n example of a static tree of nodes 340 acco rd in g to t he p rese nt d isclosu re.
- Wh i le F igu re 3A il lustrates t he static tree of nod s 340 having a particular architecture, examples of the present disclosure are not so limited and may have various architectures. Examples of the present disclosure provide that the static tree of nodes 340 can be constructed by read i n , e.g. , seque ntia lly p rocess in g , a pi urality of X Path express ion ⁇ .
- each unique tag specified in the plurality of XPath expressions is represented by a unique node 341 , 342, 343, 344, 345, 346, 347, 343, 343 in the static tree of nodes 340.
- each un ique attribute and unique text specified in the plural ity of XPath expressions is represented by a un ique item object 350, 35 .
- a unique tag refers to a tag that occurs on ty once in the plurality of XPath expressions or, brly, a unique tag refers to a first occurrence of a tag that occurs a plural ity of times in the plurality of XPath expressions , wh ereby th e occu rre nces of th e tag su bseque nt to the first occurrence are not represented by additional nodes in the static tree of nodes 340.
- a tag subsequent to the first occurrence may be referred to as a d uplicate tag.
- d up licate tags i n t he p lural ity of XPath exp ress io ns are el im inated .
- An attribute or text subsequent to the first occu rre nee may be referred to as a d upl icate attri bute or a d upl icate text .
- a reduction of memory utilized for the data extraction nd/or a reduction of processing time spent for the data extraction may be realized to eliminating the duplicate tags in the plurality of XPath expressions wh n constructing the static tree of nodes 340.
- Th e static tree of nodes 340 can be con st meted i n me mory , e. , in memory of a s rver in preparing to extract data from an XML stream.
- the static tree of nodes 340 can De constructed prior to runtime processing.
- static refers to a one time construction of the static tree of nod es 340.
- Beca use a p lural ity of XPath expressions are com bi ned to construct the static tree of n odes 340 ⁇ uch that a sea rch of the static tree of n odes 340 corresponds to each of the XPath expressions being searched sim ltaneously, the static tree of nodes 340 is only constructed once in memory, e.g. the tree is static for a search correspond i ng to th e pi urality of XPaths .
- I n contrast to the statictree of nodes 340, other XPath based extractions create a separate node tree in memo ry for each XPath that is searched, i.e.
- Uti lizing the static tree of nodes 340 can provide a reduction of memory utilized for the data extraction and /or a red uction of p rocess i ng ti me spent fo r the d ata ext raction .
- the static tree of nod es 340 ca n in cl ud e va examples of nodes .
- the stat ic tree of nodes 340 ca n i nclud e a n ode 341.
- the node 341 is a root node; some examples of the present disclosure provide that the node 341 is a d ummy root node, e.g. if the re is no corresponding tag in XML.
- the static tree of nodes 340 can include a flush node 343. While F igu re 3 i II ustrates a si ngle fl us h node 343, exa mples of t he p rese nt d isclos ure are not so limited .
- th e static tree of nodes 340 can include a pi urality of f lus h n odes .
- the data in the outp t ueue can be flush d, e.g. output.
- the extracted data that has been flushed can be processed, e.g. to format the extracted data, among other processes.
- th e abstracto r e ngi ne 267 can in cl ude hardware, programming, and/or a combination of h rdware and programming to analyze an XML stream by referring to the static tree of nodes and extract data from the X ML st rea m to an o utput que ue.
- the abstractor e ngi ne 267 can referto un ique nodes and unique item objects in the static tree of nod s to identify data that is to be extracted from an XML stream.
- Th abstractor engine 267 can include hardware, programming, and/or a
- Th e system 260 can i nclud e a f lus h engi ne 2631 hat can incl ude hardware, programming, and/or a combination of hardware and programming to fl ush extracted d ata f ro rn th e out put queue to a data store.
- Th e fl us h en gin e 263 can include hardware, programming, and/or a combination of ha id ware and programming to perform oth r functions described her in.
- Th system 260 can i nclud a filter en in .
- the fi tt r ngi ne can include hardware, programming, and/or a combination of hardware and programming to filter elements of the XML stream that do not have a corresponding node in the static tree of nodes.
- the filter engin can include hardware, programming, and/or a combination of ha id ware and programming to perform other functions described herein.
- Figure 3B illustrates an example of a static tree of nodes 352 acco id in g to t he p resen t d isclos ure.
- the static tree of n odes 352 i nclud es root nod 34 , nodes 353, 354, 355, 356, and item objects 357, 353, 359.
- the filter engine filter elements of the XML stream that do not have a corresponding node in the static tree of nodes.
- Figure 2B illustrates a diagram of an example of a system 280 accord ing to the present disclosure.
- the system 280 can utilize software, hardware, firmware, and/or logic to perform a number of fund ions described herein.
- the system 280 can be any combin tion of hardware and program instructions configured to share information.
- the hardware for example can include a processing resource 282 and /or a memory resource 284, e.g., computer- readable medium (CRM), machine readable medium (MRM), database, etc.
- a processing resource 232 can include any number of processors capable of executing instructions stored by a memory resource 284.
- P rocess in g resource 282 may be i nteg rated i n a si ngle d evice or distributed across multiple devices.
- the program instructions can include instructions stored on the memory resource 284 and executable by the processing resource 282 to implement a des i red fun ct ion , e.g. , oo mb ine a pi ural ity of X Path exp ress io ns to con struct a static tree of nodes, analyze an XML stream by referring to the static tree of nodes and extract data from the XML stream to an output queue, flush extracted data from the output queue to a data store etc.
- a des i red fun ct ion e.g. , oo mb ine a pi ural ity of X Path exp ress io ns
- the memory resource 284 can be in communication with a processing resource 282.
- a memory resource 284, as used herein can include any number of memory co mponents capable of storing instructions that can be executed by processing resource 282.
- Such memory resource 284 can be a non -transitory CRM or MRM.
- Memory resource 284 may be integrated in a si ngle d evice or d istributed across mu Iti pie devices .
- F urther, memo ry resource 2£4 may be fully or partially integ ated in the same device as processing resource 2E2 or it may be separate but accessible to that device and processing resource 2E2.
- system 2S0 may be implemented on a partici nt device, on a server device, on a collection of serv r devices, and/or a com bin ation of the use r device a nd t he serve r d evice.
- the memory resource 2S4 can be in comm nication with the processing resource 2E2 via a communication lin k, e.g. , a path, 2S6.
- the comm nication link 2B can be local or remote to a machine, e.g., a computing device, associated with the processing resource 2S2.
- Examples of a local comm nication link 2S6 can include an electronic bus internal to a machine, e.g. , a computing device, where the memory resource 2S4 is one of volatile, non -volatile, fixed , and /or removable storage med ium in comm nication with the processing resource 2S2 via the electronic bus.
- a n mber of modules can include CRI that when executed by the processing resource 2S2 can perform a number of functions, e.g. , functions described herein .
- the n mber of modules can be sub- mod ules of other modules.
- configuration mod ule 270, abstractor module 271 , and flush module 272 can be sub-modules and/or co nta ined with i n the sa me com puti ng device.
- I n an othe r exa mple, t he n urnbe r of modules can comprise individual modules at separate and distinct locations, e.g. , CRM, etc.
- Each of the number of mod ules can include instructions that when executed by the processing reso rce 2S2 can f nction as a oo responding engine as described herein.
- configuration module 270 can include instructions that when executed by the processing resource 232 can function as the configuration engine 266.
- the abstractor module 271 can include instruct ions that when executed by the processing resource 2E2 can function as the abstractor engine 267 and the flush mod ule 272 can include in st ructions that when executed by the processing resource 2£2 c n function as the flush engine 26S.
- Figure 4 illustrates a flow chart of an example of a method 490 accord ing to the present disclosure.
- the method 490 can include constructing a static tree of nod s from a plurality of Path expressions.
- the static tree of nodes can be constructed in memory, e.g., memory of a se rver, p nor to ru nti me p rocess i ng.
- the method 490 can include analyzing an XML stream by referring to the static tree of nodes. Analyzing the XML stream can provide a location of data in the XML stream and/or value of data that is to be extracted from the XML stream.
- t he method 490 can i nclude extracti ng data f rorn the XM L st ream to an outp ut queu .
- Some examp les of the press nt d isclos ure p ro vide that extracting data from the XML stream to an output queue can include sharing the data between the static tree of nodes and the output queue.
- Sharing the data between the static tree of nodes and the output queue may help provide a red ction memory utilized for the data extraction and /or a reduction of processing time spent for the data extraction.
- the method 490 can include flushing extracted data from the output queue to a data store.
- Some examples of the present d isclosure provide that fl us h in g can occur via a pred efin ed fins h n ode. F lush i ng ext racted data from the output que ue to a d ata sto re ca n export d ata fro m a n XM L stream .
- the method 490 can include bounding a value of a tag to the static t ree of n odes.
- the method 490 can i nclud e bo und ing a val ue of an attri bute to th e static tree of nod es. Bound ing the va lue of th e tag and/or the value of the attribute to the static tree of nodes can help increase a speed of a co nd ition check and /or data retrieva I.
- Some exam pies of the present d isclosure provide that the value of the value of the tag and/or the value of the attribute can be boun d to the stat ic tree of nod es wh i le the t ree is bein g an alyzed , e .g. wh ile the tree is being traversed.
- Th e exam pies herei n provide a d escri ption of the appl icat ion s a nd use of the systems and methods of the present disclosure. Since many examples can be made without departing from the spirit and scope of the system and method of the present d isclosure, this specification sets forth some of the many possible example configurations and implementations.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A plurality of XPath expressions can be combined to construct a static tree of nodes. An XML stream can be analyzed by referring to the static tree of nod es, where the analyzation is a single scan of the XML stream, and data can be extracted from the XML stream.
Description
EXTRACT DATA FROM XML STREAM
Background
[000 ] Extensible Markup Language (XML) can be utilized for information representation, exchange, and retrieval, e.g. , via an XML document. For in sta nee, X ML can be used to create content a nd mark t he content u p wit h delimiting tags, making each word or ph rase into identifiable, sortable information .
Brief Description of the Drawings
[0002] F igure 1 is a d iagrarn of a n exarnp le of an en viro n men t whe re in examples of the present d isclosure may be utilized.
[0003] Figures 2 A illustrates an example of a system according to the present disclosure.
[0004] F igures 2 B i II ustrates an examp le of a syste m accord i ng to the present disclosure.
[0005] Figure 3A illustrates n example of a static tree of nodes acooid ing to the present disclosure.
[OO0S] Figure 3B illustrates an example of a static tree of nodes accord in to the present disclosure.
[0007] F igure 4 i II ust rates a flow chart of an exarnp le of a meth od accord ing to the present disclosure.
Detailed Description
[0003] Extensible Markup Language ( ML) can be used to create documents and data records that are portable and platform independent.
Various systems and methods have been utilized to access XML data from XML documents. Efficiently accessing XML data fi m XML docum nts is becoming more important as the size and volume of XML documents increases.
[0009] One previo us app roach to access ing XM L data from XM L documents includes a Document Object Model (DOM) parser. A DOM parser uses a tree- based parsing approach that builds a parse object tree in memory. A D OM pa rser is capable of su pporti ng X Path , wh ich can be util ized for selecting and retrievin g data from XML doc ments. XPath allows for retrieval of XML data based not on ly on its content, but also on the XML document structure.
[00 ] However, as XML documents become increasingly larger, DOM approaches face performance issues. For example, b ild ing the DOM tree structure in memory is computationally expensive. For instance, some instances of DOM trees have required 0 times the memory of the original document. DOM parsers do not perform or scale well when processing large XML documents because of their high memory cost. Further, for each XPath utilized, the DOM tree structure must be built in memory, such that a DOM tree structure is traversed for each of the multiple Xpaths resulting in significant processing times.
[001 1 ] Another previous app roach to access in g XM L d ata f ro m X ML documents includes streaming, e.g., such that data can be processed in a continuous stream. A number of streaming based XML processing techniques, such as SAX (Simple application progr mming interface (A I) for XML) and St AX (Stream in g API fo r XML ), These stream in g based XM L p recess in g techn iques utilized less memory, as compared to DOM. However, these streaming based XML processing techniques do not maintain the h ierarchical
structure of XML documents and cannot support XPath based XML data retrieval.
[00 2] Examp les of the prese nt d isclos ure p ro vide syste ms and met hods that can increase efficiency data extraction, e.g., XPath based data extraction, from an XML document, as compared to other systems and methods. Th is increase in the efficiency of data extraction may be realized in a reduction of memory utilized for the data extraction and/or a reduction of processing time spent for the data extraction .
[0013] I n the following detailed descri tion of the pres nt d isclos ure, reference is made to the accompanying d rawings that form a part hereof, and in which is shown by way of illustration how examples of the disclos re may be practiced . These examples are described in sufficient detail to enable those of ord i nary skil I in the a rt to p ractice the examp les of th is d isclos ure, a nd it is to be understood that other examples may be used and the process, electrical, and /or structural changes may be made without departing from the scope of the present disclosure.
[0014] Th e figures he re in follow a nu.ni beri ng con ven tk> n i n wh ich the first d igit or d igits corres pon d to the d rawing figure n umbe r and the rernai n in g d igits identify an element or component in the drawing. Elements shown in the v rious examples herein can be added , exchanged, and/or eliminated so as to provide a n umber of additional exam les of the present disclosure.
[0015 I n ad d it ion , the propo rtion and t he relative scale of the ele men ts provided in the figures are intended to illustrate the examples of the present disclosure, and should not be taken in a limiting sense. As used herein , ' number of an entity, n element, and/or feature can refer to one or more of such entities, elements, and /or features.
[001 ¾ F igure 1 is a d iagrarn of a n examp le of an environ men 1 100 wherein ex mples of the present disclosure may be utilized. As illustrated in Figure 1 , the environment 100 can include a server 102. The server 102 can be utilized for data extraction, e.g., XPath based data extraction, as d isclosed herein. Wh ile Figure 1 illustrates a single server 102, examples of the present disclosure are not so limited.
[0017] The server 102 can be communicatively cou led to a
communication link 104. The communication link 104, e.g. , a network, represents a cable, wireless, fiber optic, or remote con nection via a
telecommunication link, an infrared lin k, a radio frequency link, and/or other connectors or systems that provide electronic commun ication. That is, the communication link 104 can, for ex mple, include a link to an intranet, the Internet, or a combination thereof, among other communication interfaces. The communication link 104 can include intermediate proxies, for exam pie, an intermed iate proxy server, routers, switches, load balancers, and the like.
[0013] The environment 100 can include an XML data provider 10S. The XM L data p rovider 106 can p ro ide XML data, e. g. , a n XM L d ecu men t, to the server 102. Some examples of the present disclosure provide that the XML data is an XML stream, e.g. streaming XML data. Various types of XML data can be provided by XML data provider 10S to the server 102. For instance, XML data provider 10S can provide XML data related to a global position ing system or an emergency service, e.g., an earthquake early warn ing system, among others. While Figure 1 illustrates a single XML data provider 10S, examples of the present d isclos re are not so limited.
[0019] As discussed f rther herein, XML data from an XML data provider 10S can be processed by the server 102. For instance, a pl rality of XPath expressions can be combined to construct a static tree of nod s in memory. Then an XML stream, e.g., the XML data provided by XML data provider 10S, can be analyzed by referring to the static tree of nodes to extract data from the XM L stream . Some exa mp les of the p rese nt d isclos ure p rovide th at th e particu la r portio ns of th e XM L stream are a nalyzed on ly o ne time, e. . , th e ana lyzatio n is a si ngle scan of the XM L data. Data extracted from the X ML steam may be processed, e.g. to format the extracted data, among other processes.
[0020] As illustrated in Figure 1 , data extracted from the XML stream can be provided to an extracted XML data receiving device 10£, 1 10, 1 12. Th XML data receiving device 10B can be a server, e.g., that provides extracted XML data to a network a nd/o r anot her comp utin g device . The XM L data receivi ng
device 110 can be a mo ile device, e.g., a mobile phone or a laptop computer, among other mobile devices. The XML data receiving device 112 can be a database e.g., that provides extracted XML data to a network and /or another co puting device. While Figure 1 illustrates three XML data receiving devices 10S , 110, 112, exam pies of th e present d isclosu re are not so li mited to a particu la r n umber and /or type of XM L data receivi ng d evices .
[0021 ] F igures 2 Ar2B il lustrate exam pies of systems 26G, 2 SO accord i ng to t he p resen t d isclcs ure . The system 260 can in cl ude a data store 262, processing system 264, and/or a n mber of engines 296, 267, 26S. The processing system 264 can be in commun ication with the data store 262 via a communication link, and can include the number of engines, e.g. , configuration engine 266, abstractor engine 267, flush engine 26S, etc. The processing system 264 can include add itional orfewer en ines than illustrated to perform the v rious functions described herein.
[0022] Th e n u rnbe r of en gines ca n in cl ude a com bin atio n of hard wa re and program mi ng t hat is co nfig ured to pe rfo rm a n umbe r of fu net ion s d escri bed herein, e.g., com ine a plurality of Path expressions to construct a static tree of nodes, parse an XML stream by referring to the static tree of nodes and extract data from the XML stream to an output queue, flush extracted data from the output queue to a data store, etc The programming can include program instructions, e.g., software, firmware, etc., stored in a memory resource, e.g., computer readable medium, machine read ble medium, etc., as w ll as hardwired program, e.g., logic
[0023] The configuration engine 266 can include hardware, programming, and /or a com bin ation of hard ware an d progra mmi ng to comb ine a p lural ity of XPath expressions to con st met a static t ree of nodes. For so me a ppl ications , an XML stream can be analyzed to extract a plurality of data. XPath is a query language that can be utilized to locate particular data in an XML stream, e.g. via an XPath expression that identifies the particular data. For instance, an XPath expression can be used to define which paths in an XML stream are to be traversed to extract a particular data. The extraction engine 266 can include
hardware, programming, and/or a com ination of ha id ware and programming to perform other functions described herein.
[0024] As me ntio ned , exarnp les of the present d isclos ure p ro ide th at a plurality of XPath expressions are combined to construct a static tree of nodes. Combing the pl rality of XPath expressions to construct th static tree of nodes provides that a search of the static tree of nodes corresponds to each of the XPath expressions being searched simultaneously, e.g., because the static tree of nodes is constructed from the com ined XPaths.
[0025] A node can correspond to either an elem nt or attribute. An element can include a start tag, an end tag, and information, wh ich can be referred to as contents, between the start tag and end tag. A tag is a markup construct. The tag can be a start tag, e.g. , <section>, and end tag, e.g., ■^section> , o r a n em pty eleme nt tag, e.g . , < li ne- brea k t> . An attribute is a markup construct. The attribute can be a name/value pair that exists within a start tag or empty element tag.
[002 S Some examples of the present disclos re provide for bounding, e.g. while traversing the static tree of nodes, a value of a tag and/or a value of an attribute to the static tree of nodes. Some examples of the present disclosure provide that the tags and attributes associated with the static tree of nodes have a structure that links an extraction cond ition of the tag and its descendants. Therefore, when processing an end tag, data used to check the extraction condition of the tag will already be bound to the static tree of nodes and condition checking can be performed efficiently utiliz in g the condition lin k ≤t ruct ure. I n othe r words , afte r process ing a start ta the stat ic tree of nod es can be traversed downward while bounding related data. When an end tag is processed the static tree of nodes can be traversed pward to check the extraction condition linked to the tag.
[0027] Figure 3A illustrates n example of a static tree of nodes 340 acco rd in g to t he p rese nt d isclosu re. Wh i le F igu re 3A il lustrates t he static tree of nod s 340 having a particular architecture, examples of the present disclosure are not so limited and may have various architectures. Examples of the present disclosure provide that the static tree of nodes 340 can be constructed by
read i n , e.g. , seque ntia lly p rocess in g , a pi urality of X Path express ion≤. For inst nce, each unique tag specified in the plurality of XPath expressions is represented by a unique node 341 , 342, 343, 344, 345, 346, 347, 343, 343 in the static tree of nodes 340. Also, each un ique attribute and unique text specified in the plural ity of XPath expressions is represented by a un ique item object 350, 35 . Herein , a unique tag refers to a tag that occurs on ty once in the plurality of XPath expressions or, alternativ ly, a unique tag refers to a first occurrence of a tag that occurs a plural ity of times in the plurality of XPath expressions , wh ereby th e occu rre nces of th e tag su bseque nt to the first occurrence are not represented by additional nodes in the static tree of nodes 340. A tag subsequent to the first occurrence may be referred to as a d uplicate tag. I n other wo rd s, d up licate tags i n t he p lural ity of XPath exp ress io ns are el im inated . Si mi larly, a un ique att ri bute an d un ique text refe rs to a un ique attribute or unique text, respectiv ly, that occurs only once in the plurality of XPath expressions or, ltern tively, a unique attribute and un ique text refers to a first occurrence of an attribute or text th t occurs a plurality of times in the plurality of XPath expressions, whereby the occurrences of the attribute and text subsequent to the first occurrence are not represented by additional item objects in the static tree of nodes 340. An attribute or text subsequent to the first occu rre nee may be referred to as a d upl icate attri bute or a d upl icate text . A reduction of memory utilized for the data extraction nd/or a reduction of processing time spent for the data extraction may be realized to eliminating the duplicate tags in the plurality of XPath expressions wh n constructing the static tree of nodes 340.
[0023] Th e static tree of nodes 340 can be con st meted i n me mory , e. , in memory of a s rver in preparing to extract data from an XML stream. For instance, the static tree of nodes 340 can De constructed prior to runtime processing. Herein, static refers to a one time construction of the static tree of nod es 340. Beca use a p lural ity of XPath expressions are com bi ned to construct the static tree of n odes 340≤ uch that a sea rch of the static tree of n odes 340 corresponds to each of the XPath expressions being searched sim ltaneously, the static tree of nodes 340 is only constructed once in memory, e.g. the tree is
static for a search correspond i ng to th e pi urality of XPaths . I n contrast to the statictree of nodes 340, other XPath based extractions create a separate node tree in memo ry for each XPath that is searched, i.e. the trees of nodes a re dy nam ic for th e pi ural ity of XPaths to be sea rched . Uti liz ing the static tree of nodes 340 can provide a reduction of memory utilized for the data extraction and /or a red uction of p rocess i ng ti me spent fo r the d ata ext raction .
[0029] Some examples of the present disclosure provide that the static tree of nod es 340 ca n in cl ud e va nous types of nodes . For i nstance, the stat ic tree of nodes 340 ca n i nclud e a n ode 341. So me exa mples of th e presen t disclosure provide that the node 341 is a root node; some examples of the present disclosure provide that the node 341 is a d ummy root node, e.g. if the re is no corresponding tag in XML.
[0030] The static tree of nodes 340 can include a flush node 343. While F igu re 3 i II ustrates a si ngle fl us h node 343, exa mples of t he p rese nt d isclos ure are not so limited . For instance, th e static tree of nodes 340 can include a pi urality of f lus h n odes . Some exam pies of the present d isclosure provide that data extracted from an XML stream, e.g., during runtime processing, can be put into an output queue. Then, when an end tag of a flush node, e.g., flush node 343 is encount red, the data in the outp t ueue can be flush d, e.g. output. The extracted data that has been flushed can be processed, e.g. to format the extracted data, among other processes.
[0031 ] Return i ng to F igure 2 A, th e abstracto r e ngi ne 267 can in cl ude hardware, programming, and/or a combination of h rdware and programming to analyze an XML stream by referring to the static tree of nodes and extract data from the X ML st rea m to an o utput que ue. For insta nee, the abstractor e ngi ne 267 can referto un ique nodes and unique item objects in the static tree of nod s to identify data that is to be extracted from an XML stream. Th abstractor engine 267 can include hardware, programming, and/or a
combination of hardware and programming to perform other functions described herein.
[0032] Th e system 260 can i nclud e a f lus h engi ne 2631 hat can incl ude hardware, programming, and/or a combination of hardware and programming to
fl ush extracted d ata f ro rn th e out put queue to a data store. Th e fl us h en gin e 263 can include hardware, programming, and/or a combination of ha id ware and programming to perform oth r functions described her in.
[0033] Th system 260 can i nclud a filter en in . The fi tt r ngi ne can include hardware, programming, and/or a combination of hardware and programming to filter elements of the XML stream that do not have a corresponding node in the static tree of nodes. The filter engin can include hardware, programming, and/or a combination of ha id ware and programming to perform other functions described herein.
[0034] Figure 3B illustrates an example of a static tree of nodes 352 acco id in g to t he p resen t d isclos ure. The static tree of n odes 352 i nclud es root nod 34 , nodes 353, 354, 355, 356, and item objects 357, 353, 359. As mentioned , the filter engine filter elements of the XML stream that do not have a corresponding node in the static tree of nodes. With reference to Figure 3 B, given a plurality of XPath expressions that are combined to construct the static tree of nodes 352 having a list:
/A/B/@a
/A/B/C/textO
/A/D/text
only nod s, e.g. elements or attributes, specified in the list of the static tree of nodes 352 will processed , while others not in the list will be filtered, e.g. not processed. For instance, given an XML stream:
<A>
<B a=y123y b^00 c=yr>
<O00K/O
<D>≤≤ss-^D>
<E
<O002</O
<!E>
<F>mt<SF>
-^ =-
b=rQ0 c=?r>, </E>, <O002</O, and <R=-tttt< F> will be filtered because they a re not specified in the list of the static tree of nodes 852. However, <A>, <B B.=^23y, <O001 <fO, <D>ssss</D>, <fB>, and </A will be processed, e.g . , they wi II not be filtered , because they are specified i n th e I ist of the stat ic tree of nodes 352. Filtering, as disclosed herein, can provide a reduction of memory utilized for the data extraction and/or a reduction of processing time spent for the data extraction .
[0035] Figure 2B illustrates a diagram of an example of a system 280 accord ing to the present disclosure. The system 280 can utilize software, hardware, firmware, and/or logic to perform a number of fund ions described herein.
[003ΰ] The system 280 can be any combin tion of hardware and program instructions configured to share information. The hardware, for example can include a processing resource 282 and /or a memory resource 284, e.g., computer- readable medium (CRM), machine readable medium (MRM), database, etc. A processing resource 232, as used herein, can include any number of processors capable of executing instructions stored by a memory resource 284. P rocess in g resource 282 may be i nteg rated i n a si ngle d evice or distributed across multiple devices. The program instructions, e.g., computer- readable instructions (CRI), can include instructions stored on the memory resource 284 and executable by the processing resource 282 to implement a des i red fun ct ion , e.g. , oo mb ine a pi ural ity of X Path exp ress io ns to con struct a static tree of nodes, analyze an XML stream by referring to the static tree of nodes and extract data from the XML stream to an output queue, flush extracted data from the output queue to a data store etc.
[0037] The memory resource 284 can be in communication with a processing resource 282. A memory resource 284, as used herein , can include any number of memory co mponents capable of storing instructions that can be executed by processing resource 282. Such memory resource 284 can be a non -transitory CRM or MRM. Memory resource 284 may be integrated in a si ngle d evice or d istributed across mu Iti pie devices . F urther, memo ry resource
2£4 may be fully or partially integ ated in the same device as processing resource 2E2 or it may be separate but accessible to that device and processing resource 2E2. Thus, it is noted that the system 2S0 may be implemented on a partici nt device, on a server device, on a collection of serv r devices, and/or a com bin ation of the use r device a nd t he serve r d evice.
[0033] The memory resource 2S4 can be in comm nication with the processing resource 2E2 via a communication lin k, e.g. , a path, 2S6. The comm nication link 2B can be local or remote to a machine, e.g., a computing device, associated with the processing resource 2S2. Examples of a local comm nication link 2S6 can include an electronic bus internal to a machine, e.g. , a computing device, where the memory resource 2S4 is one of volatile, non -volatile, fixed , and /or removable storage med ium in comm nication with the processing resource 2S2 via the electronic bus.
[0039] A n mber of modules, e.g., mod les 270, 271 , 272 can include CRI that when executed by the processing resource 2S2 can perform a number of functions, e.g. , functions described herein . The n mber of modules can be sub- mod ules of other modules. For example, configuration mod ule 270, abstractor module 271 , and flush module 272 can be sub-modules and/or co nta ined with i n the sa me com puti ng device. I n an othe r exa mple, t he n urnbe r of modules can comprise individual modules at separate and distinct locations, e.g. , CRM, etc.
[0040] Each of the number of mod ules can include instructions that when executed by the processing reso rce 2S2 can f nction as a oo responding engine as described herein. For example, configuration module 270 can include instructions that when executed by the processing resource 232 can function as the configuration engine 266. As another example, the abstractor module 271 can include instruct ions that when executed by the processing resource 2E2 can function as the abstractor engine 267 and the flush mod ule 272 can include in st ructions that when executed by the processing resource 2£2 c n function as the flush engine 26S.
[0041] Figure 4 illustrates a flow chart of an example of a method 490 accord ing to the present disclosure. At 491 , the method 490 can include
constructing a static tree of nod s from a plurality of Path expressions. As discussed, the static tree of nodes can be constructed in memory, e.g., memory of a se rver, p nor to ru nti me p rocess i ng.
[0042] At 492, the method 490 can include analyzing an XML stream by referring to the static tree of nodes. Analyzing the XML stream can provide a location of data in the XML stream and/or value of data that is to be extracted from the XML stream.
[0043] At 493, t he method 490 can i nclude extracti ng data f rorn the XM L st ream to an outp ut queu . Some examp les of the press nt d isclos ure p ro vide that extracting data from the XML stream to an output queue can include sharing the data between the static tree of nodes and the output queue.
Sharing the data between the static tree of nodes and the output queue may help provide a red ction memory utilized for the data extraction and /or a reduction of processing time spent for the data extraction.
[0044] At 494, the method 490 can include flushing extracted data from the output queue to a data store. Some examples of the present d isclosure provide that fl us h in g can occur via a pred efin ed fins h n ode. F lush i ng ext racted data from the output que ue to a d ata sto re ca n export d ata fro m a n XM L stream .
[0045] The method 490 can include bounding a value of a tag to the static t ree of n odes. The method 490 can i nclud e bo und ing a val ue of an attri bute to th e static tree of nod es. Bound ing the va lue of th e tag and/or the value of the attribute to the static tree of nodes can help increase a speed of a co nd ition check and /or data retrieva I. Some exam pies of the present d isclosure provide that the value of the value of the tag and/or the value of the attribute can be boun d to the stat ic tree of nod es wh i le the t ree is bein g an alyzed , e .g. wh ile the tree is being traversed.
[004¾ Th e exam pies herei n provide a d escri ption of the appl icat ion s a nd use of the systems and methods of the present disclosure. Since many examples can be made without departing from the spirit and scope of the system and method of the present d isclosure, this specification sets forth some of the many possible example configurations and implementations.
Claims
What is claimed :
. A n on-trans itory corn pute r- readable med i urn sto ri ng i nstructio ns executable by a processing resource to:
combine a plurality of XPath expressions to construct a static tree of nodes;
analyze an XML stream by referring to the static tree of nodes, wherein the analyzation is a single sc n of the XML stream; and
extract data from the XML stre .
2. The no n-trans itory com pute r- readable med i um of clai m 1 , whe re in constructing the static tree of nodes includes seq entially processing the plurality of XPath expressions.
3. The no n-trans itory com pute r- readable med i um of clai m 1 , whe re in each unique attribute of the plurality of XPath expressions is represented by a un ique ite m o bject in t he static tree of nod es.
4. The no n-trans itory com pute r- readable med i um of clai m 1 , whe re in t he static tree of nodes is constructed in memory of a server.
8. The no n-trans itory com pute r- readable med i um of clai m , whe re in each unique tag of plurality of XPath expressions is represented by a un ique node of the static tree of nodes.
6. A syste rn, com pris in g a p rocess in g resource i n cornm un icatio n with a non -trans itory computer- readable med ium having instructions executable by the processing resource to implement:
a conf ig u ration eng ine to com bin e a p lu ra lity of X Pat h expression≤ to construct a static tree of nodes;
an abstractor en ine to analyze an XML stream by referrin to the static tree of nodes and extract data from the XML stream to an output queue; and
a fl sh en gins to flush extracted data from the out put queue to a data store.
7. The syste rn of clai rn 6, incl ud i ng a filter engine to filter elements of the XML stream that do not have a correspond ing node in the static tree of nodes.
B. The syste rn of clai rn 6, incl ud i ng a filter engin to filter attri butes of the XML stream that do not have a correspond ing node in the static tree of nodes.
9. A method fo r d ata extraction , the meth od cornp rising:
constructing a static tree of nodes from a plurality of X Path expressions; analyzing an ML stream by referring to the static tree of nodes;
extracting data from the XML stream to an output queue; and
flush ing extracted data from the out ut queue to a data store. 0. The method of claim 9, wherein extracting data from the XML streams includes simultaneously extracting data for the plurality of XPath expressions. . The method of claim 9, wh erein constructs g th e static t ree of nodes f ro m the pl rality of XPath expressions eliminates a duplicate XPath tag.
12. The method of claim 9, wh erein constructs g th e static t ree of nodes f ro m the pl rality of XPath expressions eliminates a duplicate XPath attribute.
13. The method of claim 9, f rther cornp ris ing bou nd i ng a va lue of a tag to the static tree of nodes.
14. The method of claim 13, further com prising processing an end tag of the static tree of nodes, wherein processing the end tag includes checking an extraction condition of the tag bound to the static tree of nodes.
15. The method of claim 9, f rther cornp ris ing forma ttin g th e extracted data .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2013/085580 WO2015058331A1 (en) | 2013-10-21 | 2013-10-21 | Extract data from xml stream |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2013/085580 WO2015058331A1 (en) | 2013-10-21 | 2013-10-21 | Extract data from xml stream |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015058331A1 true WO2015058331A1 (en) | 2015-04-30 |
Family
ID=52992107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2013/085580 WO2015058331A1 (en) | 2013-10-21 | 2013-10-21 | Extract data from xml stream |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015058331A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1497474A (en) * | 2002-10-03 | 2004-05-19 | �Ҵ���˾ | Method of stream XPATH processing with forward and backward |
CN101599089A (en) * | 2009-07-17 | 2009-12-09 | 中国科学技术大学 | The automatic search of update information on content of video service website and extraction system and method |
CN102447585A (en) * | 2012-01-04 | 2012-05-09 | 迈普通信技术股份有限公司 | Method and device for converting network configuration protocol response message into command line |
US20120143919A1 (en) * | 2010-01-20 | 2012-06-07 | Oracle International Corporation | Hybrid Binary XML Storage Model For Efficient XML Processing |
-
2013
- 2013-10-21 WO PCT/CN2013/085580 patent/WO2015058331A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1497474A (en) * | 2002-10-03 | 2004-05-19 | �Ҵ���˾ | Method of stream XPATH processing with forward and backward |
CN101599089A (en) * | 2009-07-17 | 2009-12-09 | 中国科学技术大学 | The automatic search of update information on content of video service website and extraction system and method |
US20120143919A1 (en) * | 2010-01-20 | 2012-06-07 | Oracle International Corporation | Hybrid Binary XML Storage Model For Efficient XML Processing |
CN102447585A (en) * | 2012-01-04 | 2012-05-09 | 迈普通信技术股份有限公司 | Method and device for converting network configuration protocol response message into command line |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399256B (en) | Heterogeneous database content synchronization method and device and middleware | |
US11138252B2 (en) | System for organizing and fast searching of massive amounts of data | |
CN109582660B (en) | Data blood margin analysis method, device, equipment, system and readable storage medium | |
CN103365873B (en) | The method for pushing of business datum and device | |
US9767082B2 (en) | Method and system of retrieving ajax web page content | |
JP6505123B2 (en) | Processing Data Sets in Big Data Repository | |
DE202011110890U1 (en) | System for providing a data storage and data processing service | |
WO2007005730B1 (en) | System and method of making unstructured data available to structured data analysis tools | |
KR20080027251A (en) | Dynamic method for generating xml documents from a database | |
JP2007226452A (en) | Structured document management device, structured document management program and structured document management method | |
DE202011110863U1 (en) | Column Memory Representations of records | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN102799613A (en) | Showing method and device for recently-used file | |
CN103118007A (en) | Method and system of acquiring user access behavior | |
EP2521043A1 (en) | Method for establishing a relationship between semantic data and the running of a widget | |
US20140149837A1 (en) | Spreadsheet Cell Dependency Management | |
CN111046636B (en) | Method, device, computer equipment and storage medium for screening PDF file information | |
CN115495440A (en) | Data migration method, device and equipment of heterogeneous database and storage medium | |
KR101287371B1 (en) | Method and Device for Collecting Web Contents and Computer-readable Recording Medium for the same | |
CN110347390A (en) | A kind of method quickly generating WEB page, storage medium, equipment and system | |
WO2015058331A1 (en) | Extract data from xml stream | |
CN106326400A (en) | Multi-dimension data set-based data processing system | |
Deshmukh et al. | An Empirical Study: XML Parsing using Various Data Structures | |
CN102799645A (en) | Security search device and method | |
JP4519028B2 (en) | XPath processing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13896027 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13896027 Country of ref document: EP Kind code of ref document: A1 |