1. Introduction
In recent years, large-scale sensor arrays and the vast datasets produced worldwide have been utilized, shared, and published by a rising number of data providers. With open standards defining data schemas and web service interfaces, such as the Open Geospatial Consortium (OGC) Sensor Web Enablement (SWE) standards, sensors, and their data can be integrated in an interoperable manner, which has become the main idea behind the Worldwide Sensor Web [
1,
2,
3]. This Worldwide Sensor is capable of monitoring the physical world with spatial and temporal scales that were previously impossible.
The powerful monitoring capabilities of the sensor web technologies are increasingly attracting the interest of researchers for a wide range of applications, including large-scale environmental monitoring [
4,
5,
6], civil structures [
7], roadways [
8,
9], and animal habitats [
10,
11]. Some of the applications are time-critical and require efficient data processing for timely decision-making and notification, such as emergency response systems [
12]. These time-critical applications may be used to support decision-making when handling time-critical events, such as the 2007 Minneapolis Interstate-35W bridge collapse and the 2011 Japan earthquake and tsunami.
As the worldwide sensor web’s vision is to connect various types of sensors that are located worldwide and perform high-frequency observations, it could have the ability to capture these time-critical events and provide up-to-date information to support decision-making. We believe that by efficiently converting sensor web data streams into information and providing timely notifications, we can lower or even prevent the damage caused by time-critical disasters.
Apart from the disaster management system, many applications would also require efficient sensor data stream processing and timely notifications, such as house and human security, illegal activity detection, early warning system, location-based services, health monitoring, etc. For example, potential queries are (1) “send me a notification if the temperature in my kitchen is higher than 50 degree Celsius”; (2) “send me a notification if my child’s GPS location leaves the school area”; (3) “send me a notification if any water level sensor within 1 km radius of my house reports a reading higher than 1 meter” and (4) “send customers notifications about discounts when they are within 1 km radius of stores”. These queries all require efficient processing to notify users for taking action accordingly and in time.
However, the traditional request/response communication model is not suitable for processing continuous data streams as it is based on point-to-point pulling interaction and not designed for rapid and continuous data streams [
13]. In a system following the request/response model (i.e., pull-based system), the response is evaluated with a snapshot of the system. However, no predictions of when an event will happen can be made (e.g., starting of a fire, collapse of a bridge, or simply any observation). A communication cannot be scheduled ahead of time. By the time a user sends requests, events may already be out of date.
An alternative communication model is known as the publish/subscribe model (or continuous query processing model). The publish/subscribe model allows users to first register queries in a system (i.e., predefined queries). Whenever the system receives new data, it executes the predefined queries and sends the results to users as notifications. In this way, the publish/subscribe model can provide notifications more promptly than the request/response model.
In a system following the publish/subscribe model (i.e., push-based systems), a typical approach to performing a continuous query is to first create a query execution plan [
13], which consists of operators and queues. All new data will traverse through the query plan. While some approaches have been proposed to optimize the overall structure of query plans [
14,
15], there was not much discussion of how to improve the efficiency of computational-intensive geospatial operators in a push-based system.
While many general-purpose operators are simple and efficient enough to be used directly in query execution plans (e.g., elementary arithmetic, average, count, sum), some geospatial operators are complex and time-consuming. As sensor web data are geospatial in nature, geospatial operators are common in many monitoring applications, such as the aforementioned example queries. For this research, we focus on the topological operators, which are used to determine the geospatial relationships defined in the OGC Simple Feature Access specification [
16].
Although the answers to the aforementioned example queries could be obtained by database topological queries, topological operators are usually time-consuming and could create a performance bottleneck when processing a large number of geometries [
17]. In the context of a sensor web publish/subscribe system, we face the major sensor web data challenges and opportunities [
3]: the large volume and large generation velocity of sensor data make it difficult to manage, but the sensor web’s powerful monitoring capability can be extremely valuable in various applications. Therefore, we expect that processing topological queries with big sensor web data and a large number of subscriptions will create a performance bottleneck.
Hence, the aim of this paper is to propose a new approach to efficiently process topological operators in a sensor web publish/subscribe system. It presents the Aggregated Hierarchical Spatial Model (AHS model) for processing topological queries based on the nature of continuous query processing. The AHS model encompasses two key ideas. Firstly, as queries are predefined and continuous in a publish/subscribe system, the AHS model pre-generates the necessary indices for geometries of subscriptions and reuses them whenever new data arrive. Secondly, by indexing geometries of subscriptions with the same indexing structure, we can aggregate the subscription indices. In this case, we not only can reduce the space for storing the pre-generated indices, but can also intersect new data with all subscriptions in a single process.
In addition, we are also proposing a distributed computing version of the AHS model (i.e., a distributed AHS model). With the advancements in distributed computing [
18] and cloud computing techniques (e.g., Amazon Elastic Compute Cloud), we designed the distributed AHS model to scale horizontally for more computation and storage resources.
In summary, this paper presents the following:
The AHS model—a model that can efficiently determine the topological relationship between the geometries in a sensor web publish/subscribe system. As a sensor web publish/subscribe system aims to detect and manage time-critical events in a timely manner, the AHS model can be a critical component in such a system.
The incorporation of a predefined hierarchical spatial framework to index and aggregate subscription geometries, which allows the AHS model to intersect a publication with all subscriptions in a single process. In addition, all the necessary subscription indices are pre-generated and reused to reduce the runtime processing cost.
A distributed AHS model that can scale horizontally to store the pre-generated indices and improve query performance. By connecting more machines, the distributed AHS model is able to distribute the index storage and also process queries in parallel.
Evaluation of the AHS model in terms of its scalability, indexing performance, matching performance, and end-to-end query latency. The evaluation results show that the AHS model is more scalable than PostGIS, and the distributed AHS model can attain satisfying end-to-end performance with a relatively realistic dataset.
Before presenting the details, we first define the subscriptions and publications in a sensor web publish/subscribe system. In general, subscriptions are continuous queries registered by users, and publications are the sensor data produced by sensors. As sensor data are geospatial in nature, both subscriptions and publications have geospatial components. A subscription (SUB) can have different predicates as query criteria/filters, among which the spatial predicate in a subscription (SUBSP) has two parameters: a base geometry (SUBSP_GEO) and a topological operator (SUBSP_OPER). Users set these two parameters to select publications with a geometry (PUBGEO) that matches the topological relationship (i.e., SUBSP_OPER) with SUBSP_GEO. For example, with relation to the sensor web, a PUBGEO could be a sensor’s location or the geometry of a feature that the sensor observed (e.g., the coverage of a river or a road intersection).
The relationship between PUBGEO, SUBSP_OPER, and SUBSP_GEO follows the subject–verb–object structure, in which PUBGEO, SUBSP_OPER, and SUBSP_GEO are the subject, verb, and object, respectively. For example, if “point_1” is PUBGEO, “WITHIN” is SUBSP_OPER, and “polygon_1” is SUBSP_GEO, the spatial predicate (i.e., SUBSP) evaluates whether the relationship “point_1 WITHIN polygon_1” is true or not. More specifically, in the example “send me a notification if any water level sensor within 1 km radius of my house reports a reading higher than 1 meter”, the SUBSP_GEO is the “1 km radius of my house”, the SUBSP_OPER is the “within”, and the PUBGEO is the location of any water level sensors.
This paper is organized as follows:
Section 2 reviews the literature related to this research and defines the topological operators.
Section 3 and
Section 4 introduce the details of the proposed AHS model and distributed AHS model, respectively.
Section 5 explains the evaluation results. Lastly,
Section 6 presents the conclusions and future work.
2. Related Work
The AHS model is proposed to efficiently process topological operators in a geospatial sensor web publish/subscribe system. There have been different types of systems applying the publish/subscribe model to manage and process continuous data streams, such as publish/subscribe systems [
19], simple event processing systems [
20], data stream management systems (DSMS) [
13,
21,
22], and complex event processing systems (CEP) [
23]. In addition, there are some new open-source platforms that allow users to construct a publish/subscribe system, such as the Apache Spark and RethinkDB.
However, geospatial publish/subscribe systems are rarely discussed as compared to general-purpose systems. Although there has been some work discussing the topic of supporting geospatial operators in a publish/subscribe system, most of them applied spatial database join operations to prove the concept, such as [
12,
15,
24]. None of these studies discussed improving the efficiency of geospatial algorithms for publish/subscribe systems. Therefore, we are putting forward the argument that geospatial algorithms can be improved based on the nature of the continuous query processing model.
More specifically, one of the important continuous query optimization approaches is sharing operators across similar query plans [
14,
25,
26,
27], which allows every necessary operator to be executed only once and avoids processing redundancy. However, existing topological operators do not take into consideration this sharing execution concept. Even with using the minimum bounding boxes of geometries to limit the number of topological operator executions [
17], handling a vast number of geometries would still require a large number of topological operators and become a performance bottleneck. Hence, the objective of this research is to discover the shareable processes in the topological operators and propose a new approach to share them across queries.
As mentioned earlier, this research focuses on the eight topological relationships defined in the OGC Simple Feature Access Specification [
16]:
EQUALS,
DISJOINT,
INTERSECTS,
TOUCHES,
OVERLAPS,
CROSSES,
WITHIN, and
CONTAINS. This specification has been widely adopted in many spatial databases, such as PostGIS, Oracle, and Microsoft SQL Server. The typical approach to determine topological relationships is the Dimensionally Extend 9 Intersection Model (DE-9IM) [
28].
DE-9IM has three main steps. Firstly, DE-9IM generates the interior, boundary, and exterior regions of two geometries. The second step of DE-9IM is to intersect these interior, boundary, and exterior regions of two geometries and construct a three-by-three intersection matrix (Equation (1)). Finally, if the intersection matrix matches the predefined matrices (i.e.,
Table 1), the topological relationships between the two geometries can be determined accordingly.
Table 1 shows the topological relationships, the definition of relationships, and the corresponding intersection matrices, in which the wildcard symbol (
) means “any value would work” [
16].
where the
function returns the maximum dimension (i.e., 0 for points, 1 for lines, and 2 for polygons) of the intersection (
) of interior (
), boundary (
), and exterior (
) of geometries
and
. The geometry
and
are respectively called the primary and secondary geometry. If an intersection is an empty set (
), the
function returns a value of –1. If an intersection is not an empty set, the
function returns 0, 1, or 2. One way to simplify the matrix is to store only True (if the
function returns 0, 1, or 2) and False (if the
function returns–1) in the matrix, which is also represented in
Table 1.
Note that for the
CROSSES relationship, the OGC definition shown in
Table 1 may not be the most commonly used definition. Instead, the
CROSSES relationship is usually defined as “
”. Hence, the DE-9IM matrix is defined as
for a line/line relationship. In this research, the proposed solution follows the common definition.
However, DE-9IM is a time-consuming process and could create a performance bottleneck when processing a large number of geometries [
17]. In order to address this issue, a common solution is to reduce the number of unnecessary DE-9IM processes. For example, a typical approach consists of two stages: filter and refinement [
17]. The filter stage finds candidate geometries with approximated rectangles of geometries (e.g., minimum bounding rectangles (MBRs)); then the refinement stage performs the actual DE-9IM process on the candidates found in the filter stage. While this approach has been widely applied in many DBMS systems (e.g., PostGIS), we argue that it can be further improved for a publish/subscribe system to handle a larger number of geometries.
Although existing research discussed the support of spatial operators in a publish/subscribe system, most of them only applied the same filter-and-refinement spatial join approach to prove the concept. We believe that time-consuming geospatial algorithms can be improved by sharing executions. As a result, this research proposes the AHS model as a new approach for improving the efficiency of topological operators in a sensor web publish/subscribe system.
3. AHS Model
We propose the AHS model to improve the query efficiency and scalability of topological operators in a sensor web publish/subscribe system. The AHS model follows the definition of the eight topological relationships in the OGC Simple Feature Access specification. As such, it follows the same conceptual framework used in traditional approaches such as DE-9IM.
The difference between the AHS model and traditional approaches is that the current AHS model does not take multi-point, multi-line, and multi-polygon into consideration. Extending the algorithm to support these types of geometries would not change the main idea for the AHS model, but is one of the future directions that will be pursued.
Since the targeted objectives of traditional DBMS and publish/subscribe systems are essentially different, algorithms optimized for DBMS may not be suitable for publish/subscribe systems. For example, since the queries in DBMS are atomic and independent, the algorithms are optimized for each individual query. However, since the queries are continuous and pre-defined in publish/subscribe systems, their algorithms should consider the aggregation of multiple queries/subscriptions.
With the nature of continuous queries, we argue that it is acceptable to spend more effort (e.g., create indices) on the start-up preparation stage to execute continuous queries more efficiently. Although this approach may cause some delay in the beginning, it can generate a larger throughput considering the long-running nature of continuous queries. There are two key ideas in AHS model that are based on these concepts. Firstly, the AHS model pre-generates the necessary indices from the geometries of subscriptions and then reuses the indices when needed in the continuous queries. Secondly, by indexing the geometries of subscriptions with the same indexing structure, we can aggregate the indices of all subscriptions to save storage space and intersect PUBGEO with all SUBSP_GEO in a single process.
The AHS-Model consists of three major stages: (1) The preparation stage generates necessary information from the geometries of subscriptions; (2) the intersection stage intersects the geometries of subscriptions with the geometry of publication; and (3) the determination stage determines geospatial relationships. The details of these three stages are introduced in the following sections.
3.1. Preparation Stage: Generate Necessary Information from the Geometries of Subscriptions
Similar to the way DE-9IM works, the AHS model also determines topological relationships by intersecting the interior, boundary, and exterior regions of two geometries. However, instead of the typical two-step (i.e., filter and refinement) approach, the AHS model performs candidate finding and intersections at the same time. This is doable in a sensor web publish/subscribe system as queries are predefined and PUBGEO is usually small (e.g., a sensor’s location, a road intersection, or a football field). The indices of the geometries in subscriptions (i.e., SUBSP_GEO) are first created and reused until SUBSP_GEO is changed. By doing so, we can avoid generating redundant indices, which can consequently speed up the query processing.
However, not every indexing structure is suitable for the AHS model. Firstly, the indexing structure should be space-driven, so that the indices of
SUBSP_GEO can be aggregated. Secondly, space-driven indexing structure should have multiple levels to reduce the storage. Therefore, the AHS model needs a space driven and multi-level hierarchical indexing structure, similar to that in the idea proposed by [
29]. The hierarchical indexing structure allows multiple nodes on a lower level to be aggregated as a node on a higher level. In this research, a quadtree tile system [
30] is used as the hierarchical structure. By defining the lowest level of the quadtree as the finest granularity, any geometry can be indexed into (or approximated as) a list of quadtree nodes (i.e., quadkeys).
Examples can be seen in
Figure 1, where the maximum quadtree level is 4 and interiors, boundaries, and exteriors are represented in dark gray, light gray, and white, respectively. Since lower-level indices can be aggregated into higher-level indices, the number of indices can be reduced if a geometry covers multiple quadkeys. The upper two polygons in
Figure 1 are used as examples. The upper-left polygon is about four times larger than the upper-right polygon. By using a hierarchical structure, we can aggregate the upper-left polygon’s interior indices from 72 forth-level indices to six indices (i.e., five third-level and one second-level indices), which is the same number of interior indices as in the upper-right polygon.
In addition, the determination of a topological relationship does not require all three of the interior, boundary, and exterior of
SUBSP_GEO. Instead, the AHS model only needs to generate the necessary information according to the spatial predicate
SUBSP. As such, we can further reduce the number of indices and speed up the query processing. Based on the definition of topological relationships in OGC Simple Feature Access Specification (
Table 1) and the possible topological relationships between different geometry types, we can analyze and propose the necessary information for determining each geospatial relationship.
However, during a preliminary test of the first version of the AHS model, it was found that the indexing and query performance using exterior indices were poorer than when only using interior and boundary indices. This is because the number of exterior indices is usually larger than the number of interior and boundary indices. As a result, we redesigned the AHS model to avoid exterior indices. Our final analysis is listed below:
EQUALS: Since we can compare the interior and boundary of two geometries to determine the EQUALS relationship, only the interior and boundary are needed.
DISJOINT: Since the DISJOINT relationship can be seen as “no intersection between the interiors and boundaries”, only the interior and boundary are needed to determine DISJOINT relationship.
INTERSECTS: Since the INTERSECTS relationship means that the two geometries have at least one interior or boundary point in common, only the interior and boundary are needed for this relationship.
TOUCHES: The TOUCHES relationship means two geometries are INTERSECTS but their interiors do not intersect with each other. Therefore, similar to the INTERSECTS relationship, only the interior and boundary are required for the TOUCHES relationship.
OVERLAPS: The OVERLAPS relationship means that the interior of both geometries intersects the interior and exterior of the other. If both geometries are line geometries, the intersection of interiors needs to be a line for OVERLAPS relationship. In order to avoid the processing of exterior indices, we follow the idea that “if the intersection of interiors and boundaries are not equal to neither geometries, each geometry interests with the exterior of the other”, that is . Therefore, only the interior and exterior are required for this relationship. In addition, since multi-point is not in the scope, a single point does not OVERLAP with another point.
CROSSES: The CROSSES relationship means that the interior of the primary geometry intersects the interior and exterior of secondary geometry . Similar to the OVERLAPS relationship, we replace the requirement of the exterior with the intersections of interiors and boundaries. In addition, for line geometries, they are CROSSES if, and only if, their interiors intersect at a point. Since we only consider single point geometries, no geometry can cross a single point based on the OGC definition. As a result, the determination of the CROSSES relationship only needs the interior for line geometries and requires both interior and boundary for polygon geometries.
WITHIN: “ WITHIN ” means that the interior and boundary of fully locate in the interior and boundary of . That means that only the interior and boundary are required to determine the WITHIN relationship.
CONTAINS: The CONTAINS relationship is the inverse of the WITHIN relationship, which means “ CONTAINS ” and “ WITHIN ” are equal. Therefore, the interior and boundary are also necessary information for CONTAINS.
After generating the necessary information from SUBSP, the AHS model aggregates the necessary indices from all subscriptions into a single data structure. In the data structure, each quadkey (which is shared by at least one subscription) maintains a list of subscription identifiers (SUBID) and the corresponding region type (TYPESUB) (i.e., interior or boundary). In this case, the AHS model can directly match quadkeys of new data with quadkeys in the data structure to intersect new data with all subscriptions in a single process. The worst-case scenario is that no quadkey is shared by any other subscriptions (i.e., each quadkey only attaches to one SUBID), which means that the aggregation benefits neither the storage nor the query processing. On the other hand, as long as there is more than one subscription sharing the same quadkey, the aggregation can reduce both the storage size and the query latency. For the remainder of this paper, this aggregated quadtree structure is referred to as AHSSUB for clarity purposes.
A more critical contribution is that by aggregating quadkeys from all SUBSP_GEO into AHSSUB, the AHS model can simply match the PUBGEO with the quadkeys in the AHSSUB and link to a set of SUBID. This means that the AHS model decouples quadkeys and SUBID and allows itself to be more scalable in terms of the number of subscriptions.
3.2. Intersection Stage: Intersect with the Geometry of Publication
There are three steps for intersecting the geometry of publication (i.e.,
PUBGEO) with
AHSSUB: (1) index; (2) match; and (3) create matrices. This workflow is shown in
Figure 2. In order to efficiently intersect the
PUBGEO with the
AHSSUB, the AHS model first indexes the interior and boundary of
PUBGEO with the same hierarchical structure (i.e., quadtree tile system). The outcome of this indexing is named as
AHSPUB for clarity purposes.
As covered in the previous section, the AHS model modifies the determination algorithm in order to avoid processing exterior indices. Therefore, only the interior and boundary of the publication geometry need to be generated at this stage.
This matching step benefits from using the same indexing structure. Since both AHSPUB and AHSSUB are indexed by the same quadtree tile system, we can match the prefixes of quadkeys to find the intersection. More specifically, every quadkey q has a corresponding geospatial bounding box bbox. Given two quadkeys qA and qB, bboxA ⊆ bboxB if, and only if, qA starts with qB. The bounding box of quadkey ‘0’ contains all the bounding boxes of quadkeys whose first digit is ‘0’. If a quadkey in AHSPUB intersects with a quadkey in AHSSUB, this intersection is referred to as a match.
Each match contains the following five attributes: (1) the intersected quadkey from
AHSPUB; (2) the intersected quadkey from
AHSSUB; (3) the subscription identifier (
SUBID); (4) the region type of
AHSPUB (
TYPEPUB; i.e., interior or boundary); and (5) the region type of
AHSSUB (
TYPESUB; i.e., interior or boundary). After finding all the matches, they are grouped by
SUBID and a matrix created for each group. This matrix is referred to as the area matrix because it records the size of the intersected area. In a similarly way to the three-by-three intersection matrices of the DE-9IM, the area matrices are for determining the topological relationship between geometries. The area matrix is a two-by-two matrix, and its form is shown in Equation (2):
where the
function returns the sum of the number of area unit (here we define that every lowest level quadkey has one area unit and the quadkey on level
n has 4
(lowest level-n) area units),
returns intersected quadkeys whose
TYPEPUB and
TYPESUB are both interior,
returns intersected quadkeys whose
TYPEPUB is interior and
TYPESUB is boundary,
returns intersected quadkeys whose
TYPEPUB is boundary and
TYPESUB is interior, and
returns intersected quadkeys whose
TYPEPUB and
TYPESUB are both boundary.
3.3. Determination Stage: Determining the Topological Relationship
With the area matrices generated in the previous stage, the topological relationships between
PUBGEO and
SUBSP_GEO can be determined. As with the DE-9IM, each relationship has a specific matrix pattern.
Table 2 lists the topological relationships and their corresponding area matrices (
AMatrix), in which the
function returns the number for the area unit whilst
I and
B functions return the interior and boundary of
SUBSP_GEO or
PUBGEO, respectively. In addition, the
AMatrixII,
AMatrixIB,
AMatrixBI, and
AMatrixBB represent the cell (0, 0), (0, 1), (1, 0), and (1, 1) in an area matrix, respectively. Finally,
Sum(
AMatrix) represents
AMatrixII +
AMatrixIB +
AMatrixBI +
AMatrixBB.
Note that most of the information used during the determination process can be calculated in advance. For instance, , , , , , and can be generated at the time when subscriptions and publications are entered into the system.
An explanation of the concept of each determination:
EQUALS: If PUBGEO EQUALS SUBSP_GEO, their interiors and boundaries should be the same, which means that their interiors completely intersect with each other and so do their boundaries. Consequently, AMatrixII equals the area of both interiors, and AMatrixBB equals the area of both boundaries.
DISJOINTS: If PUBGEO DISJOINTS SUBSP_GEO, both the interior and boundary of PUBGEO are completely located on the exterior of SUBSP_GEO. There is no intersection between the interiors and boundaries of PUBGEO and SUBSP_GEO. Therefore, AMatrixII, AMatrixIB, AMatrixBI, and AMatrixBB are all zero.
INTERSECTS: PUBGEO INTERSECTS SUBSP_GEO if any interiors or boundaries intersect, which means that one of the cells in AMatrix is not zero.
TOUCHES: If PUBGEO INTERSECTS SUBSP_GEO and their interiors do not intersect (i.e., AMatrixII equals to zero), PUBGEO TOUCHES SUBSP_GEO.
OVERLAPS: As mentioned in
Section 3.1, the processing of exterior indices is replaced with the intersection of interiors and boundaries as “
if the intersection of interiors and boundaries are not equal to both geometries, each geometry intersects with the exterior of the other”. Therefore, for non-line/line relationships, the determination algorithm becomes “if
AMatrixII is not zero and
Sum(
AMatrix) equals to neither
Area(
PUBGEO) nor
Area(
SUBSP_GEO),
PUBGEO OVERLAPS SUBSP_GEO”.
In addition, as the OGC specification defines, the OVERLAPS relationship requires the dimension of the intersection region to be equal to the dimensions of both geometries. The intersection of a “line OVERLAPS line” relationship should be a line instead of a point. Therefore, for a line/line relationship, the AHS model also checks if AMatrixII is larger than one area unit in order to make sure that the PUBGEO does not intersect with SUBSP_GEO at a point.
CROSSES: For non-line/line relationships, the determination algorithm for
PUBGEO CROSSES SUBSP_GEO relationship is the same as that for the
OVERLAPS relationship. For line/line relationships, as the AHS model follows the common definition mentioned in
Section 2,
PUBGEO CROSSES SUBSP_GEO if the interiors intersect at a point (i.e.,
one area unit).
WITHIN: The relationship of PUBGEO WITHIN SUBSP_GEO means that AMatrixII, AMatrixIB, AMatrixBI, and AMatrixBB contain the entire area of PUBGEO (i.e., the interior and boundary of PUBGEO).
CONTAINS: PUBGEO CONTAINS SUBSP_GEO if AMatrixII, AMatrixIB, AMatrixBI, and AMatrixBB contain the entire area of SUBSP_GEO (i.e., the interior and boundary of SUBSP_GEO).
4. Distributed AHS Model
As discussed earlier, distributed computing and cloud computing techniques are applied to the AHS model to provide more storage and computation resources. For example, we can split the
AHSSUB into pieces and store them in different machines to effectively address the potential storage issue. We can also use distributing computing techniques to improve query processing performance, especially for the indexing and matching stages. To be more specific, we follow the MapReduce framework [
18] and propose a distributed AHS model processing flow.
Figure 3 shows the high-level architecture and workflow of the distributed AHS model. There are four stages in the distributed AHS model, namely (1) index; (2) match; (3) aggregate area matrices; and (4) determine relationships. As the processes in the (1) index; (2) match; and (4) determine relationships stages are the same as for the standalone AHS model, the (3) aggregate area matrices stage mainly groups and integrates area matrices based on
SUBID.
As shown in
Figure 3, the distributed AHS model has a master node and a set of worker nodes. The master node is responsible for receiving subscriptions and publications as well as forwarding subscriptions and publications to appropriate workers. Each worker is in charge of creating and matching indices according to the set of quadkeys assigned to it. That means that if a worker is in charge of quadkey
qA, all the indices that have
qA as a prefix are created and maintained by this worker. The master node has a lookup table storing the set of quadkeys that each worker is responsible for. For the remainder of this paper, this set of quadkeys will be referred to as
WorkerQ and the lookup table as
LUT for clarity purposes. Hence, a worker
Workeri would have
WorkerQi as the set of quadkeys it is responsible for.
Algorithm 1 shows the worker selection algorithm. When a master node receives a subscription SUB or a publication PUB, it uses the LUT and SUBSP_GEO or PUBGEO to determine the workers that are responsible for processing the SUB or PUB. Workers are returned if their WorkerQ overlaps with SUBSP_GEO or PUBGEO. In order to reduce the computation load on the master node, we used a coarse estimation on SUBSP_GEO and PUBGEO. That is, we first find the lowest level of quadkeys (i.e., LowestLevel) in the LUT with the GetLowestQuadkeyLevel function (line 2 of Algorithm 1), and generate quadkeys of SUBSP_GEO or PUBGEO (i.e., qs) on the LowestLevel (line 3 of Algorithm 1). Then the containing relationship is determined by the prefix matching of quadkeys from qs and WorkerQi (line 6 of Algorithm 1). Finally, if quadkeys from qs and WorkerQi overlap with each other, the algorithm returns the Workeri.
Algorithm 1.
The worker selection algorithm. |
Function SelectWorkers(LUT, Geo): Workers |
1: | Workers ← {} |
2: | LowestLevel ← GetLowestQuadkeyLevel(LUT) |
3: | qs ← GetQuadkeysByLevel(Geo, LowestLevel) |
4: | FOREACH WorkerQi ∈ LUT |
5: | FOREACH q ∈ qs |
6: | IF q is contained by WorkerQi OR q contains WorkerQi THEN |
7: | IF Workers does not contain Workeri |
8: | Workers ← Workers + Workeri |
9: | BREAK |
10: | END IF |
11: | END FOREACH |
12: | END FOREACH |
13: | RETURN Workers |
To match a publication, when the master node receives a publication PUB, it first uses Algorithm 1 to select one or more workers, and then forwards PUB to the selected workers. When a worker receives PUB, it creates indices with the PUBGEO based on the quadkeys it is in charge of, matches AHSPUB with the local AHSSUB, and creates area matrices based on the local matches, before finally returning the created area matrices to the master. Since each worker only has a portion of AHSSUB (based on the quadkeys it is in charge of), the created area matrices only represent a portion of the complete area matrices. Therefore, after the workers send the partial area matrices to the master, the master groups and aggregates them into complete area matrices based on the SUBID. Finally, the master node equally distributes the aggregated area matrices to workers to determine topological relationships in parallel.
Regarding the load balancing in the distributed AHS model, the number of quadkeys in AHSSUB indicates the processing performance. This is because the number of quadkeys in AHSSUB determines the required storage and CPU resources in a worker node. Therefore, a simple load-balancing approach is assigning a threshold θ on the number of quadkeys each worker handles. After every process of registering a subscription, if the number of quadkeys in a worker’s AHSSUB is larger than the threshold θ, the worker splits the original AHSSUB into multiple AHSSUBs based on the quadkeys it is in charge of. These split AHSSUBs are then assigned to other existing or newly created workers. Finally, the master updates its lookup table accordingly.
To sum up, the AHS model is further improved by using distributed computing concepts. The proposed distributed AHS model is able to harness the storage and CPU resources from multiple machines to address potential storage issues and improve indexing and matching performance. The distributed AHS model assigns quadkeys to workers (i.e., WorkerQ) to distribute the processing load. This approach allows the distributed AHS model to retain the ability to match a publication with all subscriptions in a single process. This is because the quadkeys shared by multiple subscriptions remain aggregated in the same AHSSUB.
5. Evaluation Results
We evaluate the proposed AHS model from four perspectives. Firstly, since the major objective of the AHS model is to efficiently process topological operators when handling a large number of geometries, we analyze the scalability of the AHS model in terms of the number of queries/subscriptions by comparing it with PostGIS, which is used to represent traditional topological operators.
The second evaluation is for measuring the indexing latency. Since the AHS model approximates geometries with a quadtree tile system, the indexing for large geometries may be time-consuming. Although we argue that the subscriptions and data in the context of sensor web would not have large geographical coverage, we evaluate the indexing latency of the AHS model by simulating geometries in various sizes for comprehensiveness.
The third evaluation analyzes the matching latency of the AHS model. More specifically, this evaluation measures the latency of matching AHSPUB and AHSSUB. Like the second evaluation, we examine geometries simulated in various sizes to provide a comprehensive evaluation.
Finally, the fourth evaluation is the end-to-end performance analysis for the distributed AHS model. We measure the overhead latencies and carry out each of the following steps: (1) index publication; (2) match AHSPUB with AHSSUB; and (3) determine the relationship. This evaluation examines all possible relationships between two geometries. The testing data for this evaluation are manually simulated city-level subscriptions and publications that we think are relatively realistic and that provide the expected AHS model performance in a real-world application.
5.1. AHS Model Scalability Evaluation
One of the most important functions of the AHS model is to perform topological operators in an aggregated manner. By aggregating the indices from all subscriptions in a single structure (i.e., AHSSUB) and decoupling the indices and subscriptions, the AHS model can match new data with all subscriptions in a single process. In doing so, subscriptions that have quadkeys in common can benefit from this design.
In order to demonstrate this contribution, we evaluate the scalability of the AHS model in this section. The query performance is measured whilst registering different numbers of subscriptions into the AHS model. The point-in-polygon query is chosen as our testing case, as it is one of the most common queries. A subscription with the coverage of a city (i.e., a polygon) was simulated and assigned WITHIN as the topological operator in the spatial predicate. This was followed by the simulation of a publication with a point geometry located in the city. With the simulated subscription and publication, different numbers of subscriptions were registered into the AHS model (with different subscription identifier SUBID) and the query latency measured every 500 additional subscriptions by sending the publication to the AHS model. The same test was performed on an off-the-shelf PostGIS database for comparison.
PostGIS mainly use the GEOS library, which ports Java Topology Suite (JTS) to C++. JTS has been used in many products, including GDAL/OGR, QGIS, MapServer, and GeoTools. For the geometries stored in PostGIS, their minimum bounding rectangles (MBRs) are indexed. In the filter step, the MBRs are intersected with the input geometries, and the intersected geometries are found. Then in the refinement step, PostGIS uses each pair of database geometry and input geometry to construct an intersection matrix, which will be used to compare with the predefined DE-9IM patterns to decide the topological relationships [
31].
Since this evaluation was mainly about the scalability in terms of the number of subscriptions, this evaluation was performed on a single machine with a standalone AHS model to avoid communication overhead and machine heterogeneity. This evaluation was performed on a desktop machine, which runs on an Intel® Core™ i5-650 @ 3.20GHz, 6GB RAM, and Western Digital WD10EARS-22Y5B1.
The query latencies on different numbers of subscriptions are shown in
Figure 4. Based on these experimental results, we observed that the query latency increases with the number of subscriptions for both the PostGIS database and the AHS model. However, as the query latency of the AHS model increases 2.5 times slower than that of PostGIS, it shows that the AHS model is more scalable than the traditional solution in terms of the number of subscriptions.
5.2. Evaluation of AHS Model Indexing Performance
This section measures the latency of generating necessary quadkeys from the geometry of subscriptions. Since the time cost for indexing may differ based on the size of geometry, we randomly generated geometries in different sizes and measured the latencies for indexing them. In addition, as mentioned previously, the lowest level of quadtree tile is needed as the granularity on geometry approximation. The quadtree tile system used in this evaluation had 14 levels.
Furthermore, as the distributed AHS model can execute indexing tasks in parallel, we also measure the indexing latency when using different numbers of workers. However, in order to avoid machine heterogeneity, we used a single machine to simulate each worker in the distributed AHS model by executing workers’ tasks independently. We simulated scenarios of distributed AHS models with one, four, 16, 64, and 256 workers. While the one-worker scenario is basically the standalone AHS model, each worker in the four-, 16-, 64-, and 256-workers scenarios handles a quadkey in the first, second, third, and fourth levels of the quadtree, respectively. For example, for the four-workers scenario, we simulated four workers handling quadkeys ‘0’, ‘1’, ‘2’, and ‘3’. With distributed computing processes, the entire process is considered as finished at the time that the last worker finishes its task. Here we present the maximum (instead of average) indexing latency from workers in each scenario.
The indexing latency of point, line, and polygon geometries are shown in
Figure 5. Since the sizes of point geometries are the same (i.e., one area unit = one quadkey), we take the average for each scenario. From
Figure 5, we can observe that the indexing latency for point geometry is much smaller than that for other types of geometries, and the indexing latency for line geometry is smaller than that for polygon geometry. We believe this is because of the different number of quadkeys being indexed, which is related to the size and location of geometries.
By comparing the indexing latencies based on different numbers of workers, we can observe that performance can be significantly improved by using more workers in the distributed AHS model. In addition, sometimes using four workers results in the same performance as using one worker does. The reason is that workers in the distributed AHS model handle different geospatial regions. If the randomly simulated test data are located only in one region, only one worker will conduct the whole process. Hence, with more workers handling different regions, we will have more opportunities of distributing the processing load. Our evaluation results show that the indexing process of using 256 workers can be 5 to 10 times faster than the standalone indexing process.
Finally, while this evaluation examines simulated geometries in various sizes for comprehensiveness, some of these geometries are too large in the context of the sensor web. Among these simulated geometries, the longest line geometry generated was 7112 kilometers long; the largest polygon geometry covered about 37% of Earth. In reality, a major city highway is usually about 100 kilometers; and a city usually covers less than 1% of Earth.
To sum up, while some of the simulated geometries are not realistic, the AHS model is able to finish the indexing step in a timely manner with the help of distributed processing. For indexing subscriptions, considering the long-running nature of the continuous query, we argue that the measured indexing overheads are acceptable. In addition, as real-world sensor web data usually have much smaller geospatial coverage than the simulated geometries, we believe the indexing for publications would be much faster than that for subscriptions.
5.3. Evaluation of AHS Model Matching Performance
This section measures the latency of matching AHSPUB with AHSSUB. Since the time cost for matching may differ based on the number of quadkeys, we randomly generate geometries in various sizes to provide a comprehensive evaluation. In order to make sure that the quadkeys of these geometries will be processed, we first applied the same geometry to both publication and subscription, and then assigned EQUALS as the topological operator. In this evaluation, the quadtree tile system had 14 levels.
In addition, like the previous evaluation, we also measured the matching latency when applying different numbers of workers. In this evaluation, we used the same machine to simulate each worker handling different quadkeys in the distributed AHS model. Scenarios of distributed AHS models were simulated with one, four, 16, 64, and 256 workers. The maximum (instead of average) matching latency from workers in each scenario was also presented.
The matching latency of point, line, and polygon geometries are shown in
Figure 6. Since the sizes of point geometries are the same (i.e., one area unit), we took the average latency for each scenario. From
Figure 6, we can observe that the matching latency for point geometry is much smaller than that for other types of geometries, and the matching latency for line geometry is smaller than that for polygon geometry. As with the indexing performance, we believe this is because of the different number of quadkeys being processed.
By comparing the indexing latencies based on different numbers of workers, we can observe that the performance can be significantly improved by using more workers in the distributed AHS model. In addition, sometimes using four workers results in the same performance as using one worker does. The reason is that workers in the distributed AHS model handle different geospatial regions. If the randomly simulated test data are located only in one region, only one worker will conduct the entire process. Hence, with more workers handling different regions, we will have greater opportunities for distributing the processing load. Our evaluation results show that the matching process of using 256 workers can be 20 to 300 times faster than the standalone matching process.
Finally, like the previous evaluation, this evaluation examines simulated geometries in various sizes for comprehensiveness. However, some of these geometries may be too large in the context of the sensor web. For example, among these simulated geometries, the longest line geometry we generated was 7778 kilometers long; the largest polygon geometry covered about 20% of Earth. However, while some of the simulated geometries are not realistic, the AHS model is able to match AHSPUB and AHSSUB in a timely manner with the help of distributed processing.
5.4. Evaluation of the End-to-End Query Performance of the AHS Model
In this section, we evaluate the end-to-end query performance of the AHS model. We measure the latency of indexing publication, matching AHSPUB and AHSSUB and determining relationships as well as overheads. In addition, this evaluation is performed on all possible relationships. We simulated one subscription/publication pair for each possible relationship and tried to create geometries with common sizes for city-level applications. For example, we created the point, line, and polygon of subscription based on the ideas of a point in a city (e.g., a city landmark), a road crossing a city (e.g., a major highway), and the city coverage, respectively. Publications that match subscriptions for each possible topological relationship were also manually created (e.g., a sensor located at a road intersection).
In order to test the overheads of distributed computing, this evaluation used two machines located on the same local network. The sets of quadkeys each worker is in charge of are manually configured so that workers handle similar amounts of work. Both machines are desktop machines. One of them runs on an Intel® Core™ i5-650 @ 3.20 GHz, and 6 GB RAM and the other on Intel® Core™ i7-3770 @ 3.40 GHz, and 10 GB RAM. Considering the machine’s heterogeneity and the possibly unequal amount of work assigned, instead of presenting the maximum latencies, this evaluation calculates the average latencies to provide an expected AHS model performance in a real-world application. Each scenario is tested 10 times and the average for each scenario taken.
The end-to-end query performances of using point, line, and polygon geometry as subscriptions are shown in
Figure 7,
Figure 8 and
Figure 9, respectively. Based on these evaluation results, we found that it is difficult to simulate datasets that are equitable enough to be used in the comparison of different topological operators. We argue that the performance differences between topological operators do not have much meaning as those differences may come from the machine’s heterogeneity or the simulated dataset, such as the size of the geometry. However, this evaluation is still valuable as it measures the overheads of distributed computing, presents the latency of each step, and shows the potential AHS model performance in a real-world sensor web application.
Therefore, based on the experimental results, our first observation is that the indexing and overheads take up more than 99% of the end-to-end latency. While the overheads of applying distributed computing process are relatively stable (between 10 to 30 milliseconds), the indexing latency varies widely depending on the size of the publication geometry. Furthermore, in the context of a sensor web publish/subscribe system, since the geometries of the publication are usually small (e.g., sensor locations or observed features), we believe that the indexing latency will remain small as well.
Our second observation is that the latencies for determining relationships are very small as each determination only handles a two-by-two matrix. Lastly, as this evaluation is based on a more realistic dataset, the measured performance could provide us with a possible AHS model performance in a real-world application. As we can see from the evaluation results, most of the tests can be finished in 100 milliseconds, while more than 70% of them can be completed in 50 milliseconds. Therefore, we believe that the AHS model can efficiently process any possible topological operators for sensor web data, which is critical for time-sensitive applications.
6. Conclusions and Future Work
We have presented the AHS model, which can efficiently determine the topological relationships between geometries in a sensor web publish/subscribe system. Due to the potentially large amount of sensor web data, the continuous query processing model is increasingly attracting interest for many time-critical applications. However, time-consuming geospatial operators are not suitable for applications that require timely processing and notification. The AHS model is an example that shows that traditional geospatial operators can be redesigned as efficient continuous query operators in the context of publish/subscribe systems.
Our evaluation results show that the proposed AHS model is 2.5 times faster than PostGIS when processing a large number of geometries, which indicates that the proposed solution is more scalable than the traditional solution. We also evaluated the indexing and matching performances of the distributed AHS model. The evaluation shows that, with the help of distributed processing, the distributed AHS model can significantly improve indexing and matching performances and finish tasks in a timely manner, even for geometries of large sizes.
Finally, we also evaluated the end-to-end query latency with relatively realistic datasets. We observed that indexing and overheads account for more than 99% of the end-to-end latency. While the overhead of applying the distributed computing process is relatively stable (between 10 to 30 milliseconds), the indexing latency varies widely depending on the size of the geometry of publications. As demonstrated earlier, the AHS model can finish most simulated queries in 100 milliseconds, thus we believe that the AHS model is able to efficiently process topological operators in a geospatial sensor web publish/subscribe system.
With regards to future directions, the current load balance approach is a simple and naïve solution. There are other factors that could be considered in the future. For example, in order to improve service availability and avoid the issue of potential machine failure, the distributed AHS model can assign multiple workers to handle the same quadkeys (i.e., replicas). In addition, as the current load balancing approach only considers the geospatial distribution of subscriptions, monitoring the geometries of publications may allow for the adaptive distribution of the processing loads. Lastly, we will also try to improve the AHS model in order to support multi-point, multi-line, and multi-polygon geometries.