1 Introduction
The volume of mobility data being collected has been steadily increasing since the advent of affordable personal location-enabled mobile devices. Examples of mobility data continuously generated and collected in huge volumes include (a) individual sporadic locations obtained from mobile app data and location-based social networks; (b) individual pedestrians, biking, or driving trajectories constrained by underlying sidewalks, biking trails, and road networks, respectively; (c) indoor individual or asset tracking data obtained from RFID and Bluetooth devices; (d) athletes’ movement data in various sports obtained from wearable devices; (e) public transportation, taxis, ride sharing, and delivery logistics trajectories obtained by location-tracking devices and specially designed app services; (f) aircraft and vessel trajectories moving in an unconstrained environment (i.e., no underlying road network) obtained by air and sea traffic monitoring services; and (g) animal tracking data moving freely in the space obtained from physically tagged and remotely sensed animals. Generally speaking, for each moving object, mobility data is typically available in the form of a sequence of (location, timestamp) pairs. The location attribute could be as simple as a point, represented by either latitude and longitude coordinates or as relative coordinates with respect to the underlying space. The location attribute could also be an area, which can represent the mobility of objects with spatial extents, e.g., flocks or group movement.
The ability to understand and analyze mobility data is crucial for various widely used important sectors and applications. In transportation and traffic management, analyzing traffic data through vehicle mobility helps in predicting accidents [
158], traffic congestion [
258], and better route planning [
51]. In ride sharing and delivery logistics application, analyzing trip mobility data helps in data-driven eco route planning, which results in huge cost and energy savings [
96]. In location-based services, analyzing people movements around the city significantly helps in trip planning activities [
217], finding popular tourists sites and restaurants [
118], and data-driven routing and querying [
218]. In indoor navigation, understanding how people move indoors helps in understanding the traffic for various stores inside a mall, which is needed in various market research studies [
114]. In urban planning, driving data can significantly help in building highly accurate, reliable, and annotated maps [
159] as well as deciding on good locations for various facilities, e.g., restaurants, retail stores, and clinics [
206]. In social computing, analyzing how people move in cities and regions helps in understanding the demand for infrastructure and energy as a means of reducing inequalities [
200]. In disaster response, analyzing crowd movement helps in preparing for natural disasters through rescuing and evacuation efforts [
105]. In health informatics, connected wearables can monitor and analyze the movement of elderly people, allowing for timely, and potentially life-saving, interventions [
134]. In pandemic prevention, privacy-preserving individual tracking allows for contact tracing, which was deemed to be a cornerstone in limiting pandemic spread [
155,
277].
Despite the common goal of acquiring, managing, and generating insights from mobility data, the mobility data science community is largely fragmented, developing solutions in silos. It stems from a range of disciplines with expertise in moving object data storage and management [
99], geographic information science [
88], spatiotemporal data mining [
210], human mobility modelling [
27], ubiquitous computing, computational geometry, and more. The sheer volumes of mobility data along with the immense need of mobility data analysis in various applications call for employing a complete Data Science pipeline [
190] over mobility data (Figure
1). This includes the whole pipeline of Data Science applications, starting from the data storage and management infrastructure and going through data collection, data cleaning and preprocessing, and data analysis. Unfortunately, this is not straightforward as current Data Science systems, tools, and algorithms are not directly applicable to mobility data. This is mainly due to the fact that these systems, tools, and algorithms are designed in a generic way to support any data type and, hence, they do not lend themselves to the distinguishing characteristics of mobility data. Examples of such characteristics include the spatial and temporal dimensions of the data, the rate of updates, and the privacy requirements. In particular, mobility data is always spatial, in which nearby objects are more related to each other. This is unlike traditional data, in which the concepts of
nearby and
locality are not taken into account. Also, similar to time series data, mobility data is temporal, in which one object may have hundreds of updates to its location and all updates are related to each other (e.g., one trajectory). This is again unlike traditional data, in which temporal updates of a single object are not frequent and older updates would be of less importance. Similar to streaming data, mobility data has a high frequency of updates, which is not supported in typical data science applications. Finally, mobility data is more sensitive to privacy. While privacy preserving in traditional data can be achieved by removing (quasi-)identifier attributes, in mobility data, locations by themselves are considered private information that can reveal not only the users’ identities, but also their behavior, lifestyle, medical conditions, and workplaces.
Motivated by ubiquity and sheer volume of mobility data, the importance of mobility applications, and the lack of support from current data science pipelines, this article presents a pipeline for Mobility Data Science. We define Mobility Data Science as an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from potentially noisy structured and unstructured mobility data, and apply knowledge from mobility data across a broad range of application domains. While currently, the community of developers, practitioners, and researchers dealing with mobility data use off-the-shelf data science techniques and systems to collect, clean, manage, and analyze their mobility data, we firmly believe that this leads to sub-bar performance. We urge this community to build its own mobility data science pipeline to better serve its own purpose. This article makes the case for the need for a mobility data science pipeline along with presenting the challenges that need to be addressed to realize it.
3 Mobility Data Cleaning
Until the early 21st century, location data and mobility data available for
geographic information science (GIS) was mainly collected, curated, standardized [
78,
79], and published by authoritative sources such as the
United States Geological Survey (USGS) [
231]. Now, data used for mobility data science is often obtained from sources of
volunteered geographic information (VGI) [
216]. Such data is contributed by millions of individual users (more than 10 million contributors in the case of OpenStreetMap [
170]) and is rarely curated. Mobility data collected from such sources is highly uncertain due to physical limitations of sensing devices, due to obsoleteness of observations, and in many cases is simply incorrect due to deliberate misinformation [
157]. Consequentially, our ability to unearth valuable knowledge from large sets of mobility data is often impaired by the uncertainty of the data such that geography has been named the “Achilles heel of GIS” [
89].
Data cleaning and preprocessing is a milestone to all data science. In fact, it has been reported that data scientists spend more than 80% of their time in data cleaning and preparation [
162]. As a result, there are huge efforts in the data science community dedicated to developing various data cleaning algorithms [
57] and full-fledged systems [
67]. Mobility data is of no exception in terms of its need for data cleaning and preparation procedures. However, for numerous reasons, data cleaning and preparation yields unique challenges. This section discusses current efforts and challenges of mobility data cleaning.
3.1 Efforts in Mobility Data Cleaning
A recent survey [
125] and data quality assessment tool [
91] have discussed various sorts of errors that negatively impact data quality in spatial and mobile environments. Motivated by the inaccuracy of location tracking devices, several efforts were dedicated to address (a) the spatial inherent inaccuracy of GPS devices and (b) the uncertainty of moving object whereabouts between two known locations, which is a result of low sampling rates due to bandwidth and battery limitations.
As the spatial inaccuracy indicates erroneous GPS coordinates, the efforts to identify and correct such coordinates have focused on either finding and eliminating outliers or map matching all coordinates to an underlying fixed and trusted infrastructure (e.g., road network map). For the case of map matching, existing efforts aim to match/snap all GPS traces to an underlying road network [
42,
46]. Proposed techniques vary from as simple as snapping each point to its nearest road to applying Markov Chain to identify the most probable road segment that each point should be snapped to. In the case in which there is no underlying road infrastructure (e.g., marine transportation or animal movement), outlier detection techniques are used to identify and remove erroneous points [
224].
Irrespective of the collection method and device settings, there is also indispensable uncertainty in movement data caused by their discreteness. Since time is continuous, the data cannot refer to every possible instant. For any two successive instants, there is a temporal gap in which the whereabouts of the moving objects are unknown. To overcome such location uncertainty, several efforts were dedicated to modeling the uncertainty of mobility data surveyed in [
278].
3.2 Challenges in Mobility Data Cleaning
This section delves into some challenges linked to cleaning mobility data that the community needs to tackle.
Challenge 5. Inaccuracy in the Movement Space Infrastructure. A unique challenge in mobility data is that, in many cases, its reference points are the ones that are inaccurate. In particular, mobility data that represent movement on a road network may be more accurate than the road network itself. Road networks, like any other type of data, suffer from all sorts of inaccuracy and may not even be available in many places [
160]. In fact, Microsoft has recently announced that it has found more than 1 million kilometers of roads missing from current maps [
148]. This is why there is a whole area of industrial and academic research about map inference, which aims to infer (all or missing parts) of the road network from either satellite images [
29] or trajectory data [
37]. However, almost all of these techniques focus on making accurate maps in terms of topology. There need to be more efforts to develop map inference algorithms that go beyond inferring the map topology to inferring map metadata (e.g., road speed, traffic lights, number of lanes, and turns), without which mobility data would not be accurate as its road network reference itself is missing important data. A major step towards cleaning mobility data would be to first clean its reference map.
Challenge 6. Filling in Temporal Mobility Gaps. As mentioned earlier, there are lots of efforts dedicated to modeling the uncertainty of moving objects’ whereabouts between two consecutive time instances. However, uncertainty poses different challenges to downstream functions and applications, including the need to develop new techniques for indexing, query processing, and data analysis for various uncertainty models. One way to overcome this is to try to infer the actual whereabouts of a moving object between any two time instances with known locations. There are already several efforts to insert artificial points between two consecutive trajectory points, with the promise that these points act as if the trajectory was collected in a very high sampling rate. This process has various names, e.g.,
trajectory interpolation [
136,
268],
trajectory completion [
130],
trajectory data cleaning [
261],
trajectory restoration [
124],
trajectory map matching [
42],
trajectory recovery [
243], and
trajectory imputation [
76]. However, the large majority of such work relies on matching the trajectory points on the underlying road network, where the imputation becomes finding the road network’s shortest path between two consecutive trajectory points. Unfortunately, this is not applicable to the case in which the road network is unknown, untrusted, or inaccurate. Hence, more recent attempts try to do data-driven trajectory imputation without relying on the underlying road network [
76,
80]. However, these techniques are either not scalable to city-scale trajectory datasets or require dense historical data that derives its imputation process. There is an immense need to develop a scalable, accurate, and fine-grained imputation that almost mimics a continuous datastream of trajectory locations.
4 Mobility Data Analytics
Spatial data is special. Unlike non-spatial features, location attributes (e.g., longitude and latitude) rarely exhibit linear or other simple functional relationships to variables of interest. It rarely makes sense to model a variable of interest directly in relation to spatial attributes. Instead, it is distances that matter. According to Tobler’s first rule of Geography, “everything is related to everything else, but closer things are more related than things that are far apart” [
221]. For mobility data, proximity is further extended with time, i.e., objects that are close in space and time. In addition to this concept of spatiotemporal autocorrelation, what makes mobility data even more challenging to handle is that it is often observed from humans whose behavior can often be irrational and difficult to explain. As Nobel Prize laureate Murray Gell-Mann famously said, “Think how hard physics would be if particles could think” [
172]. However, unlike in physics, the “particles” of interest are often humans who can think. Data collection sensors have the capability to capture the spatiotemporal locations of moving objects, but not their behavioral aspects. These difficulties require new paradigms, techniques, and algorithms to analyze and learn from the spatiotemporal data and that can explain and predict the associated behavior. This section discusses current efforts and challenges of mobility data analysis.
4.1 Efforts in Mobility Data Analytics
Mobility data analytics has already gained momentum in research in recent years. Dedicated workshops have existed in major conferences, including the ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial) since 2011 [
209], the
Big Mobility Data Analytics (BMDA) workshop in EDBT since 2018 [
177], and the ACM SIGSPATIAL International Workshop on Animal Movement Ecology and Human Mobility (HANIMOB)@SIGSPATIAL since 2021 [
171]. Surveys on the status of research exist [
20,
198].
Mobility data analytics encompasses various application domains and involves analyzing data from different sources such as urban [
265], maritime [
61], aviation [
59], animal movement [
171], and indoor movement [
114]. Among these different themes, urban mobility stands out with a fairly large body of research, including green routing [
10], traffic anomaly detection [
173], hot spot and hot path analysis [
166], road traffic prediction [
161], and travel time estimation [
240]. Trajectories of moving objects have been used as means to create and continuously update the road network [
159]. Public transport systems also collect ticketing data in the form of passenger check-ins, sometimes also associated with check-outs. This data has been shown to be very useful to transit planners in understanding passenger demand and movement patterns in daily operations as well as in the strategic long-term planning of the network [
227]. Personal mobility of individuals is also a subject of analysis that includes, e.g., activity recognition [
50,
175], personalized routing [
66], matching with ride-sharing services [
19], and crowd-sourcing [
178].
While a significant portion of research focuses on understanding and analyzing data through analytics, there are also important efforts dedicated to developing generic analysis tools for spatiotemporal data that are agnostic to the application domain. Efforts regarding generic methods for mobility data analysis include, among many others, trajectory clustering [
244], trajectory similarity measures [
224], outlier detection [
101], transportation mode classification [
40], spatiotemporal pattern detection [
199], and trajectory completion [
121]. However, and despite these many research efforts towards analyzing mobility data, there is a lack of common data analysis tools and systems. The scientific software environment for mobility data analysis is rather fragmented. For example, [
117] lists 58 packages in their review of R packages for movement and [
92] reviews Python libraries for movement data analysis and visualization.
Recent years have seen a notable increase in research on deep learning for mobility data analysis [
137,
250]. This brought an increased adoption of various paradigms and (adapted versions of) architectures used in other areas in whic deep learning has brought improvements in tasks, e.g., clustering/classification [
149], prediction [
122] and recommendation [
30], information propagation [
274], etc. For example,
Generative Adversarial Network (GAN)–based architectures have been used recently to learn representations of trajectories and generate synthetic trajectory techniques [
84]. Given the introduction of Transformers [
233], transformed-based approaches have also been used for mobility modelling and trajectory prediction [
254] given the sequential properties of mobility data. Other deep learning approaches, such as contrastive learning [
273], have also been exploited in mobile data settings, along with investigation of the impact/benefits of representation learning [
86].
4.2 Challenges in Mobility Data Analysis
This section highlights open problems related to mobility data analysis that need consideration from the community.
Challenge 7. Machine Learning (ML) for Mobility Data. The state-of-the-art
deep learning (DL) models, such as Transformers [
233], were not developed initially for mobility data science in mind. They were derived from
natural language processing (NLP) and computer vision domains. The community needs to provide best-case practices for doing ML (and DL) for mobility data.
A major hurdle, and a research opportunity as well, is that existing ML and analytics tools, e.g., TensorFlow and PyTorch, do not support location and mobility as base data types to reason about. Thus, even the basic analysis, such as clustering, classification, and similarity, need to be extended when mobility data is involved. These tasks, as well as higher-level analysis, cannot be totally independent. Instead, common basic building blocks could have an impact on all or some of them. For example, exploring the effectiveness of embedding for mobility data analysis is a basic block that could impact different ML-based analysis tasks. This raises a challenge to build analysis primitives and common building blocks for applications that could shape a framework of ML-based mobility data analysis.
Another major hurdle is the robustness in data-driven mobility models. It is widely known that data-driven models (as in the case of ML and DL) are only as good as the data used to train them. However, given the changes in mobility behaviors, such as during the COVID-19 pandemic and the associated lockdowns, and environmental events and disasters, traditional ML-based, and even recent DL-based, methods are no longer robust. The models’ performances deteriorate in unseen events, especially as new behaviors emerge and then persist. Recent effort includes the incorporation of ‘contextual awareness’ and ‘memory’ in an enhanced event-aware spatiotemporal network [
245] for predicting mobility in multiple modes of transportation, including taxis, cycling, and subways during the unprecedented events such as COVID lockdowns or snowstorms as events emerged and up to 30 days post the event. However, more work needs to be done on modelling and understanding mobility behavior that is robust to changes due to societal events.
Challenge 8. Progressing from Next Location Prediction to Movement Behavior Understanding. Due to the wide availability of aggregated check-in and foot-traffic data, many researchers focus on the problem of location prediction, e.g., [
253]. Leveraging predictions such as “User X will visit Coffee Shop A next” or “
\(32\pm 4\) users will visit Coffee Shop A in the next hour” has some direct applications. It could be useful for providing information about parking (“parking at location X appears to be a problem today, so consider...”), for battery-charging opportunities, or for providing information about collective transportation status (“Metro station X that you are expected to visit is closed for repairs, so instead...”). One could provide a new transportation schedule and departure time in response to problems at an anticipated future location of a user, just like airlines at times update itineraries in the case of issues. Earlier work has been based on data mining techniques to detect periodic behavior, e.g., [
36,
75,
116]. Beyond predicting locations, if we understand the underlying behavior at the individual-, group-, or population-scales that leads to these predictions, we could understand
why one coffee shop chain has increasing visitor rates (e.g., due to a movement towards organically grown coffee sold by the coffee shop). Through inferring from the data about such behaviors, only then can we take corresponding actions not only to predict locations but also to prescribe actions (e.g., offering more organic coffee) to improve visitor rates. This understanding of (human) behavior will broadly affect applications using mobility data. Traditional spatiotemporal data science allows for predictive analytics to predict the future. In contrast, mobility data science enables prescriptive analytics by understanding the underlying human behavior to devise actions and policies that aim to achieve desirable targets.
An open problem for understanding mobility behavior data is the lack of labels or human annotation to provide insights on the actual observations. There are several other tricks that have been proposed, including cross-domain data fusion as well as developing interpretability mechanisms for ML or DL models. When geographical information is fused with contextual features and social behaviors, not only location prediction can be improved but also insights can be provided about the underlying visitor behavior [
253], even if no human-labelled data are provided about the mobility behaviors.
Therefore, explainability of AI and ML models that have underpinned many of such predictive behavior models remain an open challenge, especially since DL models are black boxes. One such approach for DL-based models is disentangled representation learning, and a recent work [
266] shows that the disentanglement of latent spatiotemporal factors can assist the explainability of how the underlying latent factors learned by DL models are correlated. It can also be used for dimensionality reduction and can assist in few-shot learning cases.
Challenge 9. Visual Analytics. Visualization and exploratory analysis of mobility data has long been a hot topic in visual analytics [
15]. More recently, the trend turned to combining visualization with modeling and simulation to support decision-making [
123]. This kind of research is by necessity application oriented, while much less is done on developing more general ideas and approaches.
One general research problem that has only been slightly touched on in visual analytics but not systematically addressed is human involvement in real-time analysis of big mobility data. Is it possible to define realistic scenarios for involving human intelligence in big data analytics taking into account the cognitive limitations of human analysts with regard to the amount of information that can be perceived, speed of processing, and time required for analytical reasoning and contributing to the analysis process? Also how does one combine computational methods of analysis, such as ML, with human expert knowledge and reasoning? The involvement of human intelligence is limited to thoughtful data preparation, feature selection, parameter setting, and so on. It would be great to find ways to make more direct and effective use of human-possessed concepts and, particularly, knowledge of causal relationships. Hence, a grand research challenge for visual mobility analytics is to develop approaches to understanding and modeling mobility behaviors from low-level movement data, such as trajectories of moving entities.
The following research problem is how to analyze behaviors after they have been extracted from elementary movement data and represented by appropriate data structures. A conceptual framework should be developed to enable defining the types of conceivable patterns of movement behavior. This will provide orientation for developing visualization techniques facilitating visual discovery of behavioral patterns as well as algorithmic methods for detection of specified types of patterns. These techniques and methods should be incorporated into systems and workflows for analyzing the contexts in which various patterns take place and developing models for describing and predicting mobility behaviors depending on the context.
5 Mobility Data Management Infrastructure
Classical data management systems have been designed for generic data types, where spatial and temporal data can be supported as new additional types. Yet, the core functionality of the data management engine does not acknowledge the spatial and temporal properties of mobility data. For example, mobility data calls for storing and querying locations of objects that evolve over time. The evolution can be in the location, the extent, and/or the properties of the object. The evolution can happen in discrete steps, e.g., check-ins, or in a continuous form. Thus, it is desired that the data management platform is able to represent the history, the current location, and possibly the near future of the moving object. Another example is classical index structures that are built with the assumption that the read workload is significantly higher than the write workload and, hence, the index structure does not change often. Mobility data exhibits a different workload, in which the write workload (e.g., object location update) is significantly higher than the read workload, which makes all classical index structures simply not applicable to mobility data. A third example is that simple queries of mobility data, e.g., nearest neighbor search, can be supported by classical data management systems by finding the distance between the user location and all other objects, sorting all objects based on that distance, and getting the closest one. This cumbersome approach is mainly due to the lack of having a specialized nearest-neighbor operator. Should we have one, that operator could seamlessly integrate with the query executor and optimize a data management engine to efficiently support a pretty important query in most data mobility applications. A last example is that classical methods for scaling up data management in distributed environments rely on data distribution, mostly based on the data keys. This does not work well in scaling up mobility data, as it is always desired to distribute mobility data in a way that, spatially and temporally, nearby objects are grouped together in the same cluster or computing node. This section discusses current efforts and challenges of mobility data management.
5.1 Efforts in Mobility Data Management
There has already been extensive research in all layers of mobility data management infrastructure. In terms of data modeling, early models based on the constraint database model aim to support simple moving objects (i.e., points), e.g., [
93]. More complex data types (e.g., moving regions) have been supported by later models based on abstract data types, e.g., [
100], that are still being used in recent systems, e.g., [
276]. More recent efforts have been introduced to capture the semantics of trajectories of moving objects. Other models were also proposed to capture specialized modes of movement, including indoor environments, e.g., [
113], network constrained, e.g., [
98], fuzzy trajectories, e.g., [
225], and detecting periodic moving patterns, e.g., [
33,
36,
75,
116]. In terms of indexing, tens of index structures have been proposed to support efficient indexing, storage, and retrieval for spatiotemporal data as either historical data, current locations, or continuously updated locations, e.g., [
143,
146,
154,
163]. This forms the infrastructure support for various spatiotemporal query processing techniques for various query operators over moving objects, including spatiotemporal range queries [
156], spatiotemporal nearest-neighbor queries, e.g., [
11,
12,
13,
214,
252], reverse nearest neighbor queries [
35], skyline queries [
108], and scalable spatial and spatiotemporal joins, e.g., [
247,
251].
In terms of academic full-fledged systems, the SECONDO system has been introduced in the early 2000s as a comprehensive testbed for distributed moving object databases covering all aspects of data modeling, indexing, and querying [
97]. More recently, MobilityDB, implemented on PostGIS, has been introduced as a scalable system with a wider functionality on moving object databases [
228,
276]. In terms of Big Data systems, ST-Hadoop [
8], SUMMIT [
7] and HadoopTrajectory [
22] systems extend the Hadoop system to support spatiotemporal data and trajectories, respectively, while other systems, e.g., [
65,
144,
145], extend the Twitter Storm distributed data streaming system to support streamed location data. TrajSpark [
264], Dita [
207], and TrajMesa [
127] extend the Spark system to support various index structures and query operations over trajectory data. SharkDB [
242] extends in-memory column-oriented storage engines to support trajectories. In the open-source community and in industry, PostGIS [
181] supports very basic trajectory functions and Oracle spatial supports streaming point data to capture real-time mobility [
169], whereas Microsoft Azure [
25] supports storing trajectory data in Azure table and utilizing Azure Redis for indexing. Distributed-MobilityDB [
23] integrates the trajectory data management of MobilityDB with a distributed PostgreSQL database to provide a distributed moving object database.
5.2 Challenges in Mobility Data Management Infrastructure
Though there is already a lot of work in various components of mobility data management infrastructure, there is an apparent lack of integrated systems that offer comprehensive functionality to end users, encapsulated in full-fledged systems that support mobility data science. Hence, the challenges in this section mainly focus on system building.
Challenge 10. Building Systems with Mobility Data in Mind. Location data has almost always been supported in data systems as an afterthought problem. Many systems, e.g., Postgres, Storm, Spark, and Hadoop, have not been originally designed with location data support in mind. What typically happens is that spatial data types get augmented into tuple-oriented systems to support the location data type. For example, a restaurant tuple that describes various attributes of a restaurant is augmented with the latitude and longitude of the location attribute of the restaurant to support location services. Spatial indexes are provided to speed up the access to these attributes, and some accompanying spatial operators are provided to operate on the location attributes to provide location services, e.g., range or k-nearest-neighbor searches. While this approach works to some extent, systems coming out of this approach end up with sub-par performance for spatial data and, hence, for mobility data. Given the myriad applications that rely on mobility data, it is important that systems are extended with native support for locations and mobility data. Thus, mobility data types and operations should be integrated in the core of these systems and should not be considered as an afterthought problem. This can go through all kinds of systems, starting from database management systems that need to be spatially and temporally aware to support mobility data to scalable big data and NoSQL systems, where injecting spatial and temporal awareness into their core functionality will inherit their scalability to support scalable mobility data science.
Challenge 11. Location Data as First-Class Citizens. Having locations as the core of mobility data calls for treating location data as a first-class citizen in a location data system that at the same time can be extended to support other data types [
16]. These location data systems can be presented as Location+X systems, e.g., as in [
16], where the data types “X” can be keywords (e.g., to support spatial keywords and tweets), graphs (e.g., to support road-network data), relational data (e.g., to support descriptions of spatial data objects), click streams (e.g., to support check-in data), document data (e.g., to support points of interest and documents that describe them), or annotated trajectories (e.g., location + time + textual annotations), among others. In many location services, more than one data type X may need to be supported, e.g., a graph data type combined with a document or keyword data types, which calls for a multi-model-like data system. This gives rise to an ecosystem where location is at the core with some form of an extensible multi-model data system that supports the multitude of data types “X”. However, current multi-model data system technology is lacking in several aspects. First, they do not support data streaming, which is a cornerstone in mobility data due to the online streamed locations of moving objects. Second, we do not want to fall into the trap of adopting existing multi-model technologies that may affect location being a first-class citizen. However, the need for supporting multi-models in one seamlessly integrated location+X system remains a necessity. In addition to supporting location data via a native location+X engine, an ecosystem for mobility data would also include many important utilities to facilitate a broad spectrum of location service applications. From the input data side, to help navigate the vast amounts of available location datasets and discover the right datasets for a given task, a location dataset lake infrastructure and location dataset discovery, cleaning, and integration facilities are needed. From the presentation side, a comprehensive visualization suite is envisioned to support visualizations for combinations of spatial and temporal data analytics on top of location data.
Challenge 12. Streaming, Batch, and Hybrid Workloads. Motivated by the application needs, mobility data management needs to support both batch and real-time data through all system layers, from digesting the data to analyzing and visualizing it. For example, a common requirement is to visualize the positions of a fleet of vehicles in real time, which only requires access to the most recent positions of the vehicles. Yet, at the same time, there is a need to perform batch analytics on the full trajectory of these vehicles (e.g., to assess whether the trajectories exhibit some unexpected behavior). Generally speaking, the need to have both real-time and historical data has led to the development of the data warehouse domain, where operational databases cover the real-time
Online Transaction Processing (OLTP) whereas data warehouses cover the historical
Online Analytical Processing (OLAP). Since having two different systems for the two kinds of workloads is very costly, a new approach referred to as
Hybrid Transactional and Analytical Processing (HTAP) has recently been proposed. However, mobility data exhibits significantly different workloads from other data, where streaming data is dominant in terms of objects continuously streaming their new locations. Historical data is not of less importance and is continuously appended. While some efforts have been spent in the direction of write-optimized indexing for location data, e.g., as in [
211], more research efforts need to be spent to adopt the concepts behind HTAP systems to support the nature of mobility data.
6 Mobility Data Privacy
As we discussed in Challenge 1, mobility data privacy is a core problem in the mobility data science pipeline. Studies have shown that location data could reveal sensitive personal information, such as home and workplace, and religious and sexual inclinations [
183]. As localization technology advances and extremely fine-grained location tracking is being enabled, it may even reveal products of interest in the stores we have visited, doctors we saw at a hospital, bookshelves of interest in a library we have visited, artifacts we observed in a museum, and generally anything that might publicize our preferences, beliefs, and habits. A recent survey has shown that 78% of smartphone users among 180 participants believe that apps accessing their location pose privacy threats [
47].
While there are many privacy-preserving data collection and data analysis techniques developed for personal data, mobility data introduces unique challenges due to (1) spatiotemporal correlations in the mobility data, which often results in increased privacy cost due to privacy composition for correlated data or downgraded utility for downstream applications; (2) complex location semantics (e.g., corresponding points of interest of locations) and mobility behaviors (e.g., regular vs. one-time visit of a location) that existing privacy definitions may not be able to capture; and (3) diverse and emerging application scenarios, such as contact tracing using mobility data for which existing privacy algorithms designed for aggregate data analytics are not suitable. In this section, we briefly review existing privacy notions and techniques developed for location and mobility data and discuss several open challenges.
6.1 Efforts in Mobility Data Privacy
We categorize existing techniques in mobility data privacy into two main settings corresponding to our data pipeline: (1) local setting (data collection stage) and (2) central setting (data analysis stage). In the local setting, the mobility service provider that collects mobility data is assumed to be untrusted. Hence, each mobile user or entity can apply privacy-preserving mechanisms before the data is collected by the service provider. In the central or global setting, the mobility service provider is assumed to be trusted and collects the raw mobility data. The provider can apply privacy-preserving mechanisms for statistical analysis and share aggregated data, ML models trained from the data, or synthetic data mimicking the original data with untrusted third parties.
Local Setting. In recent years,
local differential privacy (LDP), the local variant of differential privacy [
63,
94], has become the de facto standard for preserving privacy at the data collection stage. Users can perturb their raw data using an LDP mechanism before uploading it to an untrusted server. Most existing mechanisms are designed to ensure utility for aggregate queries or analytics (e.g., frequency or density estimation), which requires the aggregation of the perturbed values from a large group of users, whereas the individual perturbed value may not provide much utility. Several works applied existing LDP schemes to location data but the utility is poor [
119,
267]. Other works relaxed LDP to personalized LDP [
52]. Recent works developed improved LDP mechanisms for location data with better utility [
239].
In addition to supporting aggregate data analytics,
location based services (LBSs), including range queries, spatial crowdsourcing, and the emerging contact tracing for pandemic control, require the precision of the perturbed locations themselves.
Geo-indistinguishability (GeoInd) [
14] relaxes LDP for location data, which requires the locations to be indistinguishable only within a radius and the indistinguishabilty is scaled by their distances, providing a better privacy utility trade-off for LBSs. Later works extended GeoInd to account for temporal correlations between consecutive locations of mobile users [
249] and protection of customizable spatiotemporal activities instead of raw locations or trajectories [
43]. Other works applied the GeoInd mechanisms and variants for privacy-enhanced spatial crowdsourcing and contact tracing [
64,
220]. Besides statistical privacy techniques,
Private Information Retrieval (PIR) and secure
multiparty computation (MPC) techniques have also been developed to allow LBS queries such as range queries and contact tracing without revealing individual locations [
6,
56,
87,
186] but are generally more computationally expensive and need to be designed for each different query.
Global Setting. Many works have applied
differential privacy (DP) for computing and publishing aggregate mobility data. Compared with DP algorithms for tabular data, they typically exploit the hierarchical structure of locations and sequential patterns of trajectories to improve utility [
2,
49,
150,
184,
204]. Some works also utilized the DP aggregates for task assignment in spatial crowdsourcing [
219]. In practice, mobility data providers have started sharing aggregated mobility datasets with DP, especially in response to the pandemic, such as Meta’s population density maps and Movement Range maps, Google’s COVID-19 Community Mobility Reports, and SafeGraph’s Patterns [
24]. Other works have applied DP for training ML models using mobility data, for example, for location prediction [
5]. Another line of work attempts to generate synthetic trajectories or mobility data based on raw trajectories with formal DP guarantees [
103,
241]. From the privacy attack side, recent works demonstrated the possibility of membership inference attacks on aggregate location data and linking attacks, and the defense power of DP against some of these attacks, reinforcing the need for ensuring rigorous privacy even for seemingly anonymous aggregate mobility data and ML models trained from mobility data [
115,
182].
6.2 Challenges in Mobility Data Privacy
This section highlights open problems related to mobility data privacy that need consideration from the community.
Challenge 12. Threat Models and Privacy Definitions. The first challenge for mobility data privacy is the need to understand the threat models and adopt or define proper criteria by which to enforce privacy. We need to define first what needs to be protected (i.e., the sensitive information). This may vary for different mobile users and applications. It may be the exact location coordinates of a user at a given time (most existing efforts focus on this). It may also be the association of a user with a sensitive place, co-location of two users (while it’s okay for the users to reveal the exact location coordinates), or spatiotemporal activities of a user (e.g., stay at a place, or a trajectory). When defining privacy models and designing subsequent privacy mechanisms, there will almost always be attacks based on side channel information exploitation. While privacy notions such as DP typically assume the worst case, which also means sacrificed utility, relaxed versions may be needed given specific threat models to enhance the privacy and utility trade-off.
Besides developing rigorous privacy-enhancing mechanisms, it is equally important to understand the privacy risks and the empirical defense power of
privacy-enhancing technology (PET). While there has been some work on privacy attacks on aggregate mobility data [
182], more work is needed to understand what sensitive information may be revealed and reconstructed from mobility data-based models, e.g., whether membership inference attacks or feature reconstruction attacks [
81,
212] can be carried out and potentially build benchmark attacks that can be used to audit the privacy risk of mobility data science systems and privacy mechanisms.
Challenge 13. Privacy and Utility Trade-off and Other Factors. When designing privacy mechanisms for mobility data collection and analysis, it is important to consider the utility of the privacy protected data for the downstream applications. For LBS (as typical in the local setting), the utility needs to be measured by the precision or accuracy of range queries for POI search, or contact detection for contact tracing (instead of how accurate the perturbed location is from the original location for which most algorithms following GeoInd are focused on). Hybrid methods that combine DP and cryptographic techniques may be needed, especially for critical applications such as contact tracing and public health [
56]. For aggregate data analytics and ML applications using mobility data (in both local and global settings), the utility need to be measured by the accuracy of the statistics (e.g., frequency or density estimation for which most existing work focuses on), the trained model, or the fidelity of the synthetic data. As a result, the algorithms need to be designed to optimize the corresponding utility and many remain an open challenge. For example, existing methods for DP trajectory synthesization are mainly based on statistical models or low-order Markov models and perform well on some utility metrics [
103,
241]. While there are more powerful
generative adversarial network (GAN)—based models or diffusion models for generating more realistic synthetic trajectories [
137,
275], ensuring formal DP for these models would result in deteriorated utility due to the complexity of the models. Designing methods for optimal privacy utility trade-off remains an open challenge.
In addition to the privacy and utility trade-off, privacy-enhancing technology may exacerbate bias in the data or learning algorithms. Mobility data may have inherent bias, as we discussed in Challenge 2. Data analysis algorithms may also have unfair performance for groups that are underrepresented in training data. It has been demonstrated that learning with DP could exacerbate such unfairness, i.e., underrepresented groups suffer from worse privacy/utility trade-offs [
21]. Research is needed to understand this impact on mobility data and design privacy algorithms to optimize the privacy utility trade-off while ensuring fairness.
Challenge 14. Explainability and Societal Education. Another important challenge of mobility data privacy is to improve the explainability of privacy definitions and mechanisms and communicate them to the stakeholders, including mobile users (data contributors), mobility service providers, and data analysts. This is a general challenge for privacy-enhancing technology, but more so for mobility data given the complex semantics of location information and diverse applications, as we mentioned. DP-compliant algorithms and location privacy models (such as GeoInd) as described earlier use privacy parameters to control the trade-off between privacy guarantee and the utility of the private outputs. However, there is a significant gap between the theory and practice of DP: we lack principles and guidelines for choosing privacy parameters when collecting or processing mobility data using DP techniques in the real world. While the technology companies have employed DP in releasing the mobility datasets, as we discussed earlier, the choice of the privacy parameter and the associated noise and uncertainty are often not precisely specified or uniform across companies. This makes it difficult for the downstream applications to quantify the uncertainty of the analysis result.
The parameter \(\epsilon\) of DP is mathematically defined but not well aligned with the stakeholders’ interests. Even for the same \(\epsilon\), the privacy guarantees could be different based on the different variants of DP and algorithms at hand. In addition, the \(\epsilon\) is not always linked to a specific privacy risk for the users (such as “the probability that an attacker can correctly infer my data”) or a precise utility level for data analysts (such as “the accuracy of the DP-ML model”). To promote the adoption of mobility data privacy technology such as those based on DP, we should establish principles and design guidelines, and provide tools for explaining DP’s protection and limitation from stakeholders’ practical interests. For example, we can help data contributors understand the privacy risk (such as membership inference attacks or reconstruction attacks) under different privacy parameters given a concrete DP algorithm. We can also design efficient methods to visualize how data analyzers’ utility metrics (such as MSE or model accuracy) may change along with different privacy parameters for specific mobility applications.
7 Mobility Data Science Applications
Mobility data science used to be limited to the domain of transportation. However, recent technological inventions have created an abundance of mobility data, resulting in applications in many other domains of interest for society. Such applications leverage mobility data to understand, explain, and predict where moving entities such as humans, animals, or infectious diseases go, why they go where they go, and where they will go next. This section outlines broad applications of mobility data science to illustrate the recent landscape of mobility data science.
7.1 Traffic
Traffic is a problem of global scale, as recognized by transportation science over a decade ago. Drivers in the United States spend 6.9 billion driving-hours stuck in traffic and waste more than 11 billion liters of fuel per year according to INRIX [
112]. Measured per capita, people in Russia and Thailand spend even more time in traffic, whereas Brazil, South Africa, the United Kingdom, and Germany are only slightly behind the United States. Leveraging mobility data science and understanding the underlying behavior of human participants concomitantly with different transportation modes can enable more effective solutions to multiple problems at the heart of improving traffic management. Two main lines of research focus on (1) traffic monitoring at an aggregate level, e.g., to help city administration; and (2) provision of services to road users. Existing work regarding traffic monitoring includes monitoring congestion [
128], assessing the safety of roads and intersections [
142], traffic prediction [
131], evacuation routing [
263], and optimizing public transportation schedules [
191]. Efforts regarding the services provided to road users include routing queries that balance the traffic across roads [
68], helping drivers to find nearest facilities [
120], personalized routing [
129], eco-routing for minimizing greenhouse emissions [
133], and enabling multi-modal trip planning [
223]. But there are many open opportunities and challenges in using mobility data to improve traffic conditions. One example is devising accurate models for the dynamic scheduling of public transportation. Another example is the context-aware optimization of traffic signals, e.g., incorporating the impact of additional flux of pedestrians in bus/train stations, to minimize the stop-and-go impacts for vehicles. A challenge of using mobility data science in the transportation domain is monitoring and reduction of emissions. Being able to quantify emissions (e.g., from transportation) is essential to accountability and reduction of emissions. Using data on emissions collected from in-situ sensors but also sensed remotely through earth observation (satellite) data will allow us to better understand the effects of e-mobility, better collective transportation, and infrastructure improvements.
7.2 Urban Areas
In 2018, 55% of the world’s population (4.2 billion people) resided in urban areas. This proportion is projected to increase to 68% by 2050 [
230]. Urban areas are a focal point for mobility application as they introduce a variety of mobility modalities such as electric vehicles [
234] and bicycles and scooters with respective sharing programs [
132]. By understanding how, where, and why people move in cities, outer suburban areas, and regional areas, the demand for infrastructure and energy can be better understood [
270]. Improving this understanding helps reduce urban inequalities in cities [
165] such as access to high-quality food [
236] and healthcare [
95]. Mobility data also helps improve urban safety by improving crime prediction [
82] and helping to recommend safe routes [
203].
A specific urban mobility data science supports urban areas through data-driven map construction [
3] and updating of existing maps to account for blocked or new road segments [
48], which is paramount in autonomous driving applications [
140].
The real-time monitoring of urban mobility could result in
situational awareness, initially a term coined in defense applications, involving
perception of the environmental states using the surrounding data,
comprehension of the ingested data to understand the emerging situations, and
projection of future states and/or events that require predictive analytics. Mobility data provides critical components and insights into situational awareness in cities. When achieved, this applies not only to enabling robust critical infrastructures in cities but also to protecting them from harm, e.g., forest fires, earthquakes, and terrorist attacks. Many researchers use mobility data as input to enable situational awareness in cities as well as in airports [
208].
7.3 Health Informatics
The spread of infectious diseases is a highly complex spatiotemporal process that is strongly tied to human mobility [
106] and human behavior [
74]. Many recent works have used human mobility data for data-driven epidemic forecasting, as surveyed in [
195]. A specific example of leveraging mobility data for public health is contact tracing, which refers to the process of tracking persons who may have come into spatial contact with an infected person, and subsequently collecting further information about these contacts [
151]. The feature-rich interaction, processing and localization/communication modalities of smartphone devices have brought these to battle on the technological forefront and have curbed the fast spread of pandemics, such as COVID-19. To date, the community has proposed a wide range of contact tracing approaches, including opportunistic [
185] and participatory approaches [
64] as well as privacy-sensitive [
260], decentralized [
226], proximity-based (e.g.,
Bluetooth Low Energy (BLE), sound) [
187], and location-based approaches (e.g., Wi-Fi, GPS) [
64]. However, a wide range of challenges remain unanswered, including methodologies to improve the penetration and adoption rates, alleviate privacy or expectation skepticism [
32], ubiquitous availability on low-end terminals as well as technological/psychological adoption barriers [
31], achieving cross-country interoperability with standard formations beyond recommendations, scalability/reliability and accuracy verification of engaged spatial technologies as well as lessons about effectiveness from real large-scale deployments.
Another specific health application for mobility data is health monitoring of older adults. GPS-enabled smartwatch technology can be used to monitor the movement of older-adult users [
215]. In particular, if the monitored user is showing early signs of dementia, the user’s trajectories could show an abrupt change from the individual’s movement history [
222]. For instance, a user who normally walks in a park and then goes to a restaurant is found to only stay in the park for a substantial amount of time. Indoor sensors installed in the room can also be used to track whether an an older adult or a patient falls from the bed. Trajectory outlier analysis methods, together with gerontology knowledge, can be very useful for this kind of application.
7.4 Indoor Environments
Indoor mobility data management has been described as a new frontier in data management [
114]. However, in addition to data management, large-scale indoor localization data also raises challenges in data collection, data analysis, and data privacy. Indoor data collection is an open research problem due to the non-existence of the indoor equivalent of GPS: a system that can provide the user location in any building worldwide. This is particularly important in applications related to emergency management and infectious disease contact tracing. Systems have been developed over the years to address this problem based on different data sources, including WiFi signal strength and time of arrival [
255], cellular signal [
194], ultra-wideband [
9], ultrasonic [
110], magnetic tracking [
213], and inertial sensors [
102], among others. These novel data sources enable new applications in indoor navigation, contact tracing, indoor analytics, and evacuation management.
Indoor data analytics allows improvement of understanding of indoor behavior, which has multiple benefits and applications, including for crowd management [
4], retail and POI recommendation systems [
189], and for optimizing energy use and improving sustainability in the long term [
200]. For example, by utilizing WiFi logs, Ren et al. [
188] find strong correlations between behaviors and user demography (e.g., age, gender, and visitor types), indicating that indoor mobility behavior, in conjunction with online behavior, can be used to predict the underlying demography of the visitors.
Occupancy behaviors are also highly linked with building management systems and controls [
45]. By having a more accurate energy use estimation using indoor spatial and mobility data, in addition to historical energy consumption data, the performance of the buildings can be better optimized, towards achieving more sustainable operations [
71]. The responsible use of mobility behavior analytics, including indoor and outdoor mobility behaviors, strongly points to the increased capacity for improving sustainable operations of buildings [
200], enabling net zero goals to be achieved.
7.5 Marine Transportation
According to UNCTAD, over 80% of the volume of international trade in goods is carried by sea, and the percentage is even higher for most developing countries [
229]. Estimates say that global shipping activity emitted
\(3\%\) of the global emissions worldwide in 2022 [
109]. These significant numbers, as well as the availability of large-scale ship trajectory data obtained from the
automatic identification system (AIS) [
18], motivated a lot of research efforts on mobility data analysis for maritime transportation. The stakeholders who seek the benefit of such analyses include the maritime authorities, environment officers, ship owners, port and canal managers, and the transport and logistic sectors.
One major challenge is to ensure safety at sea, which splits down to the technical challenges of identifying positional anomalies [
193], locating dark vessels (vessels that switch off their AIS devices) [
147], and cleaning location and identity spoofing [
73]. Additionally, an essential aspect is the detection of fishing activities to ensure sustainable fishing practices [
58]. Since vessels do not have fixed routes in the sea, research has also investigated the density of ship routes [
248].
Multi-criteria routing using multiple optimization criteria, including estimated time of arrival, fuel consumption, safety, and comfort, has been increasingly recognized as an important path planning problem [
104]. An optimization of ship routes could effectively lead to significant reductions of greenhouse gas emissions and contribute to the actions against anthropogenic global warming. The influence of ocean currents, waves, and wind on the course and speed of ships have been known for centuries. Used optimally, ocean currents lead to more efficient paths between two given ports. Ship route computation approaches that exploit the potentials of wind, wave, and weather models aiming at minimizing fuel consumption have been addressed by the marine science, maritime engineering, and transportation communities [
77].
Since green mobility is currently gaining a huge amount of attention, carbon dioxide emission–aware ship routing is expected to make an enormous impact on the economy, politics, and society and provides very promising opportunities for the spatial and spatiotemporal database and mobility communities. Marine transportation becomes particularly important in the scope of climate change (e.g., the advent of hydrogen/battery/fossil/atom hybrid vessels) as well as digitization for new infrastructure-free localization technologies on-board.
7.6 Social Connections
Location-based social networks (LBSNs) bridge the gap between the physical world and online social networking services [
269]. LBSN data capture both human mobility (in the form of check-ins to discrete points of interest) and a social network between individual humans. Combining mobility data and social networks, LBSN data finds many applications. A first application found in the literature was on modeling and describing human mobility patterns (e.g., [
55,
167]), analyzing these patterns (e.g.,[
54]), and explaining why individual users choose locations and how social ties affect this choice (e.g., [
237]). Another application is that of location recommendation, which leverages check-ins of users and their ratings in the user-location network to recommend new locations to users [
26]. A closely related application area is location prediction (e.g., [
53]), which predicts the future check-ins of users. Another active research field in LBSN analysis is friend recommendation or social link prediction (e.g., [
201]), which suggests new friends to users based on similar interests at similar locations while also having similar social connections. Other research topics concerning LBSNs include efficient query processing (e.g., [
17]), finding user communities (e.g., [
257]), and estimating the social influence of users (e.g., [
246]).
This plethora of applications and research shows how mobility data in connection with social network data can be used to understand the social fabric that ties us together. A potential future application is using human mobility data to reinforce this social fabric by recommending social events and meetings to groups of people to help them find new friends, collaborators, sports mates, teachers, mentors, and family members.