US20230164219A1 - Access Pattern Driven Data Placement in Cloud Storage - Google Patents
Access Pattern Driven Data Placement in Cloud Storage Download PDFInfo
- Publication number
- US20230164219A1 US20230164219A1 US18/156,541 US202318156541A US2023164219A1 US 20230164219 A1 US20230164219 A1 US 20230164219A1 US 202318156541 A US202318156541 A US 202318156541A US 2023164219 A1 US2023164219 A1 US 2023164219A1
- Authority
- US
- United States
- Prior art keywords
- data
- data item
- uploaded
- file
- metadata
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000012546 transfer Methods 0.000 claims description 23
- 238000002347 injection Methods 0.000 claims description 12
- 239000007924 injection Substances 0.000 claims description 12
- 238000003066 decision tree Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 230000005012 migration Effects 0.000 description 60
- 238000013508 migration Methods 0.000 description 59
- 238000010586 diagram Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 5
- 239000000835 fiber Substances 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0647—Migration mechanisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/401—Support for services or applications wherein the services involve a main real-time session and one or more additional parallel real-time or time sensitive sessions, e.g. white board sharing or spawning of a subconference
- H04L65/4015—Support for services or applications wherein the services involve a main real-time session and one or more additional parallel real-time or time sensitive sessions, e.g. white board sharing or spawning of a subconference where at least one of the additional parallel sessions is real time or time sensitive, e.g. white board sharing, collaboration or spawning of a subconference
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/52—Network services specially adapted for the location of the user terminal
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/535—Tracking the activity of the user
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
- H04L67/5681—Pre-fetching or pre-delivering data based on network characteristics
Definitions
- Global cloud storage services provide accessibility for large amounts of data from anywhere in the world once the data has been stored in the cloud. For example, an image uploaded in Europe may be immediately accessible for download in the United States.
- Global cloud storage services are often divided into various geographical regions in order to manage the large volume of uploaded data.
- a user request to access data is typically routed to a server nearest to the user, and particularly in the user's geographic region.
- the server looks up the location of the requested data, and then forwards a request for the data to the server where the data is stored, which may be in a different geographic region.
- fetching the requested data may incur a high latency, which may degrade the requesting user's experience of the requested data.
- the long distance fetch also costs precious bandwidth for the service vendor, especially if there is a scarcity of network bandwidth between the user's geographic region and the data's geographic region, such as if not enough optic fiber cables are deployed between the two regions.
- Global cloud storage services commonly store uploaded data in the region from which the data is uploaded. This may be effective in those cases where the uploaded data is primarily downloaded in the same geographic region. However, in many cases, uploaded data is accessed primarily from other geographical regions, which could result in high network bandwidth costs.
- One aspect of the present disclosure is directed to a method for storing data in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions.
- the method may include receiving, by one or more processors, data uploaded to a first datacenter of the distributed network, the uploaded data including metadata, receiving, by the one or more processors, access information about previously uploaded data, prior to the uploaded data being accessed, predicting, by the one or more processors, one or more of the plurality of geographic regions from which the uploaded data will be accessed based on the metadata and the access information, and instructing, by the one or more processors, the uploaded data to be transferred from the first datacenter to one or more second datacenters located at each of the one or more predicted geographic regions.
- the access information may be derived from a predictive model trained on metadata of the previously uploaded data.
- the predictive model may be a decision tree model.
- the metadata may include an identification of a user uploading the uploaded data, and a location from which the uploaded data is uploaded.
- the metadata of the previously uploaded data may include a location of the previously uploaded data, an identification of a user uploading the previously uploaded data and a location from which the previously uploaded data is uploaded.
- the metadata of the previously uploaded data may include an identification of a directory or a file path at which the previously uploaded data is stored.
- the access information may indicate an amount of time between an initial upload of the previously uploaded data and a first download of the previously uploaded data.
- the method may include predicting, by the one or more processors, an amount of time until the uploaded data is downloaded for a first time, and instructing, by the one or more processors, the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions, based on the predicted amount of time.
- instructing the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions may include determining, by the one or more processors, that the uploaded data is broadcast data based on the metadata and the access information, and for each given predicted geographic region, instructing, by the one or more processors, the uploaded data to be transferred to at least one caching server of each datacenter of the given predicted geographic region.
- the method may include instructing, by the one or more processors, the uploaded data to be included in a file including previously uploaded data having a common predicted geographic region, and instructing, by the one or more processors, the file to be transferred to one or more second datacenters located at the common predicted geographic region.
- the system may include one or more storage devices at a first datacenter of the distributed network, configured to store data uploaded to the first datacenter, the uploaded data including metadata, and one or more processors in communication with the one or more storage devices.
- the one or more processors may be configured to receive access information about previously uploaded data that was previously stored in the plurality of datacenters of the distributed network, prior to the uploaded data being accessed, predict one or more of the plurality of geographic regions from which the uploaded data will be accessed based on the metadata and the access information, and instruct, the uploaded data to be transferred from the first datacenter to one or more second datacenters located at each of the one or more predicted geographic regions.
- the access information may be derived from a predictive model trained on metadata of the previously uploaded data.
- the predictive model may be a decision tree model.
- the metadata may include an identification of a user uploading the uploaded data, and a location from which the uploaded data is uploaded.
- the metadata of the previously uploaded data may include a location of the previously uploaded data, an identification of a user uploading the previously uploaded data and a location from which the previously uploaded data is uploaded.
- the metadata of the previously uploaded data may include an identification of a directory or a file path at which the previously uploaded data is stored.
- the access information may indicate an amount of time between an initial upload of the previously uploaded data and a first download of the previously uploaded data.
- the one or more processors may be configured to predict an amount of time until the uploaded data is downloaded for a first time, and instruct the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions, based on the predicted amount of time.
- the one or more processors may be configured to instruct the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions based on a determination that the uploaded data is broadcast data based on the metadata, and for each given predicted geographic region, the one or more processors may be configured to instruct the uploaded data to be transferred to at least one caching server of each datacenter of the given predicted geographic region.
- the one or more processors may be configured to instruct the uploaded data to be included in a file including previously uploaded data having a common predicted geographic region, and instruct the file to be transferred to one or more second datacenters located at the common predicted geographic region.
- the file is initially stored at one or more source servers located at first datacenter.
- the one or more processors may be configured to instruct data servers of the one or more second datacenters located at the common predicted geographic region to pull the file from the one or more source servers.
- FIG. 1 is a block diagram illustrating an example system according to aspects of the disclosure.
- FIG. 2 is a block diagram illustrating an example computing system according to aspects of the disclosure.
- FIGS. 3 and 4 are block diagrams illustrating an example data distribution scheme of a system according to aspects of the disclosure.
- FIG. 5 is a flow diagram illustrating an example method according to aspects of the disclosure.
- FIG. 6 is flow diagram illustrating aspects of the flow diagram of FIG. 5 .
- the technology relates generally to a system for efficiently storing uploaded data across a distributed network.
- the system may include a location predictor or prediction program that predicts the location or locations from which an uploaded data file may be accessed in the future.
- the prediction may be based on uploading and downloading patterns, also referred to as “access patterns,” of previously uploaded data. Predicting access patterns of newly uploaded data can improve storage efficiency of the uploaded data, since the data can be strategically stored close to those locations from which it will be accessed in the future.
- the prediction program can be stored in a distributed network having multiple virtual machines across one or more datacenters.
- the uploaded data may begin by being stored in any datacenter of the network. Subsequently, the uploaded data is analyzed by the prediction program, and migrated to one or more other datacenters at which the uploaded data is predicted to be downloaded.
- Predictions may be based on metadata included in each of the uploaded data and the previously uploaded data.
- metadata of previously uploaded data may be used to train a predictive model, whereby the metadata may be related to various predictors of the model.
- Metadata of the uploaded data may be the same or similar to that of the previously uploaded data, whereby these similarities between the uploaded data and certain previously uploaded data may indicate a future access pattern of the uploaded data.
- the uploaded data may be transferred to the datacenters using offline data migration techniques. For example, separate large files may be set up to store uploaded data based on the destination datacenter of the data, whereby each large file has a different destination datacenter. Newly uploaded data may then be appended to one or more large files based on the datacenters at which the uploaded data is predicted to be downloaded. Data migration of the uploaded data may then be performed on a per-file basis, for instance, when a given file reaches a certain size limit, or after a predefined amount of time, such as 12 hours, has elapsed.
- the metadata may also be used to predict an urgency for migrating the uploaded data to its destination. For example, in the case of broadcast data, such as a streamed file broadcast from one user and made immediately accessible to other users worldwide, may be in high demand across multiple regions both immediately as well as at a later time offline data migration may take too long to deliver the uploaded data to its destination datacenter. In such a case, identifying an urgency of migrating the data may be used to initiate a cache injection of the uploaded data, whereby the data is transferred to a caching server at the destination datacenter from which the data may be served to users locally. The cache injection may be performed in addition to, and prior to, the offline data migration.
- the above implementations can improve storage service of unstructured data within the distributed network, particularly for distributed networks having multiple datacenters spread out across multiple geographic regions.
- the improved storage service may make data that is uploaded in one part of the world more readily accessible in other parts of the world where the data is commonly accessed. This in turn can result in cost and time savings for users and service providers, since accessing data from a distant location is generally more costly and more time consuming than accessing data from a nearby location.
- FIG. 1 is a block diagram illustrating an example system including a distributed computing environment.
- the system 100 may be a cloud storage service providing users with the ability to upload data 101 to servers distributed across multiple geographic regions 110 , 120 , 130 , 140 of the system 100 .
- Each geographic region may include one or more datacenters.
- FIG. 1 shows datacenters 110 a and 110 b of Region 1 ( 110 ), datacenters 120 a and 120 b of Region 2 ( 120 ), datacenters 130 a and 130 b of Region 3 ( 130 ), and datacenters 140 a and 140 b or Region 4 ( 140 ), although the network may include additional regions, and each region may include additional datacenters.
- Each datacenter may include one or more data servers 145 configured to store the uploaded data.
- the datacenters 110 a , 110 b , 120 a , 120 b , 130 a , 130 b , 140 a , 140 b may be communicatively coupled, for example, over a network (not shown).
- the datacenters may further communicate with one or more client devices (not shown) over the network.
- client devices not shown
- Such operations may include uploading and accessing data, such as uploaded data 101 .
- Accessing data may include downloading the data, streaming the data, copying data from one folder or directory to another, or any other means by which data is made accessible in response to a user request received at a server of the system 100 .
- the datacenters may further communicate with a controller (not shown); thus, accessing the data may include making the data accessible in response to an instruction from the controller.
- the datacenters 110 a , 110 b , 120 a , 120 b , 130 a , 130 b , 140 a , 140 b may be positioned a considerable distance from one another.
- the datacenters may be positioned in various countries around the world.
- the regions 110 , 120 , 130 , 140 may group datacenters in relative proximity to one another. Further, in some examples the datacenters may be virtualized environments. Further, while only a few datacenters are shown, numerous datacenters may be coupled over the network and/or additional networks.
- each datacenter may include one or more computing devices 210 , such as processors 220 , servers, shards, cells, or the like. It should be understood that each datacenter may include any number of computing devices, that the number of computing devices in one datacenter may differ from a number of computing devices in another datacenter, and that the number of computing devices in a given datacenter may vary over time, for example, as hardware is removed, replaced, upgraded, or expanded.
- computing devices 210 such as processors 220 , servers, shards, cells, or the like.
- Each datacenter may also include a number of storage devices or memory 230 , such as hard drives, random access memory, disks, disk arrays, tape drives, or any other types of storage devices.
- the datacenters may implement any of a number of architectures and technologies, including, but not limited to, direct attached storage (DAS), network attached storage (NAS), storage area networks (SANs), fibre channel (FC), fibre channel over Ethernet (FCoE), mixed architecture networks, or the like.
- DAS direct attached storage
- NAS network attached storage
- SANs storage area networks
- FC fibre channel over Ethernet
- FCoE mixed architecture networks
- the datacenters may include a number of other devices in addition to the storage devices, such as communication devices 260 to enable input and output between the computing devices of the same datacenter or different datacenters, between computing devices of the datacenters and controllers (not shown) of the network system, and between the computing devices of the datacenters and client computing devices (not shown) connected to the network system, such as cabling, routers, etc.
- communication devices 260 to enable input and output between the computing devices of the same datacenter or different datacenters, between computing devices of the datacenters and controllers (not shown) of the network system, and between the computing devices of the datacenters and client computing devices (not shown) connected to the network system, such as cabling, routers, etc.
- Memory 230 of each of the computing devices can store information accessible by the one or more processors 220 , including data 240 that is received at or generated by the one or more computing devices 210 , and instructions 250 that can be executed by the one or more processors 220 .
- the data 240 may include stored data 242 such as uploaded data objects, a metadata log 244 tracking metadata of the uploaded data objects 242 , as well as one or more migration files 246 and cached data files 248 at which uploaded data objects 242 may be stored before being transferred from one datacenter to another. Details of the above examples of stored data are discussed in greater detail below.
- the instructions 250 may include a location access prediction program 252 configured to predict the location or locations at which a given data object file is likely to be accessed. Such locations may be one or more regions of the distributed network.
- the instructions 250 may further include a data migration program 254 and a data caching program 256 configured to execute the transfer of data object files stored in the one or more migration files 246 and cached data files 248 , respectively. Details of the above examples of stored programs are also discussed in greater detail below.
- the controller may communicate with the computing devices in the datacenters, and may facilitate the execution of programs. For example, the controller may track the capacity, status, workload, or other information of each computing device, and use such information to assign tasks.
- the controller may include a processor and memory, including data and instructions. In other examples, such operations may be performed by one or more of the computing devices in one of the datacenters, and an independent controller may be omitted from the system.
- the uploaded data 101 uploaded to the datacenters may include metadata, indicating various properties of the uploaded data.
- the metadata may be logged at the datacenter to which the data is uploaded (in the example of FIG. 1 , datacenter 140 b ), stored, or both.
- a metadata log 150 is provided in datacenter 140 b to store the metadata 155 of the uploaded data.
- the metadata 155 may include an identification of a region, datacenter, or both, to which the data is uploaded. As discussed in greater detail below, because uploaded data is strategically migrated between geographical regions, in some cases the location at which the data is uploaded may differ from the location at which the data is stored. Additionally, in some cases, data may be stored in multiple locations, including or excluding the location to which it is uploaded. In such cases, the metadata may further include an identification of the region, datacenter, or both, at which the data is stored.
- Metadata may include, but is not limited to, a customer identification of the uploading party, an object name, an object type, an object size, a name of a directory or folder to which the data is uploaded (such as a bucket name used to store the object), an object name prefix (such as a file path of the uploaded object if more than one level of directory hierarchy is used to store the object), a time of upload, a time of first download, subsequent downloads are their times, a number of access requests, and so on.
- the metadata may include both properties of the object, as well as a running access log for the object.
- each location may be associated with different metadata for the same data object.
- the object stored in Region 1 may be downloaded sooner than the same object that is stored in Region 2 .
- the number of access requests for the object may vary from one location to the next.
- metadata from the uploaded object may be separately logged at each location where the object is ultimately stored.
- each of the datacenters 110 a , 110 b , 120 a , 120 b , 130 a , 130 b , 140 a , 140 b may include its own access log (not shown except for datacenter 140 b ), which may store a log of data objects or files that have been uploaded to that datacenter, including the metadata of those objects.
- Metadata for each uploaded object may be tracked across the multiple regions using a metadata aggregator 160 .
- the aggregator 160 may be capable of collecting metadata from the metadata logs of each datacenter on a regular basis, such as according to an aggregation schedule.
- the aggregated metadata may be timestamped to enable changes in the metadata to be tracked over time. For instance, aggregated logs collected by the metadata log aggregator may be categorized according to a duration of time represented by each aggregated log, such as metadata from the previous week, metadata from the previous month, metadata from the previous three months, metadata from a time period longer than the previous three months, and so on. Differences in metadata across the categorized logs may indicate changes in the uploaded data over time, such as an increasing or a decreasing interest in accessing the uploaded data. Additionally, metadata for a given uploaded data object may be tracked as a whole, or per storage location. Thus, the aggregated logs may indicate overall changes in metadata for a given uploaded data object, as well as region-specific or datacenter-specific changes for the uploaded data object.
- the aggregated data may then be fed to a predictive model 170 in order to train the model 170 to predict where future uploaded data is most likely to be accessed.
- the predictive model 170 may be a machine learning algorithm stored in the system 100 , such as at one of the datacenters of the system.
- the predictive model may be decision tree model, whereby the aggregated data may include information about how often and from where previous uploaded data was accessed, and thus may associate a cost with the placement of the previously uploaded data in the system. Based on this information, the predictive model 170 may determine strategic placements for future uploaded data based on the access patterns of the past uploaded data.
- the predictive model 170 may be dynamic. For example, the aggregation of metadata and access logs by the aggregator 160 may occur on a consistent, and possible scheduled, basis, such as once every week, once every two weeks, once every month, or less or more frequently.
- the frequency at which the model is updated may further depend on the nature of data stored and shared on the particular network to which this system 100 is being applied. That is, for some products having relatively slow change, once a month may be sufficient. However, for other platforms or products where user tendencies are constantly changing and access patterns are constantly shifting, a once-a-month update may be insufficient, as more regular updated of the model, such as once a week, may be preferable.
- the predictive model 170 may be used to predict where current or future uploaded data will most likely be accessed.
- the data is initially stored in the data server 145 as it is processed by the access location predictor 180 .
- the access location predictor 180 may determine whether the data is likely to be accessed in the same region as which it was uploaded, whether the data is likely to be accessed in another region of the system.
- the access location predictor 180 may further predict an amount of time between the uploaded data 101 being uploaded and it being downloaded.
- the uploaded data 101 may remain stored in the data server 145 , may be stored at one or more migration files 190 , may be stored in the caching server 195 , or any combination thereof. Storage determination operations are discussed in greater detail below in connection with FIG. 5 .
- Storage in the data server 145 is generally permanent, meaning that the data is intended to be stored there and not moved, and thus may remain stored indefinitely or until manually deleted.
- storage in the migration files 190 and caching server 195 is generally temporary, meaning that the data is intended to be transferred to another location and may deleted automatically at a time after the intended transfer. Data transfer operations are discussed in greater detail below in connection with FIG. 6 .
- the migration files 190 are shown separately from the data server in FIG. 1 , it should be recognized that the migration files may actually be stored at one or more servers of the datacenter 140 b , and thus may be stored at the data server. Additionally, while the contents of the migration files may be regularly deletes, such as after the migration, the file itself may be permanent. Furthermore, the file may include permanent information, such as header information, indicating a destination to which the contents of the file are to be sent.
- the migration files may be files of a distributed file system of the network. It should also be recognized that data stored in the data servers may also be stored in large files of the distributed file system. Effectively, the large files of the data server may be the files having information indicating that the contents of the file are at their intended destination. For example, a file written to the data server 145 of datacenter 140 b may have a header indicating a destination of Region 4 , whereby it may be determined that the contents of the file do not need to be sent to a different region.
- the caching servers 195 may be used by the datacenter for both predictive injection caching as described herein, as well as for on-demand caching for actually data requests (as compared to the speculated requests that trigger injection, as described herein).
- the metadata aggregator 160 and the predictive model 170 of FIG. 1 are shown separately from each of the geographical regions 110 , 120 , 130 , 140 and datacenters, and the access location predictor 180 is shown as being included in datacenter 140 b .
- the data stored at, and the instructions executed by the aggregator 160 , the predictive model 170 , and the access location predictor 180 may be anywhere in the system, such as in the regions and datacenters shown, or in other regions or datacenters not shown, or any combination thereof, including distributed across multiple datacenters of a geographical region, or distributed across multiple geographical regions of the system.
- FIG. 3 is a flow diagram illustrating an example routine 300 for storing data in a distributed network.
- the network may include multiple datacenters, such as datacenters 110 a , 110 b , 120 a , 120 b , 130 a , 130 b , 140 a and 140 b of FIG. 1 , distributed over various geographic regions, such as regions 110 , 120 , 130 and 140 of FIG. 1 .
- Some of the operations in the method may be carried out by processors of the datacenters from and to which the data is being transferred, whereas some operations may be carried out by processors of other datacenters, or processors and servers independent of the datacenters or geographical regions.
- data may be uploaded to a datacenter belong to a first region of the network.
- the uploaded data may include a data object as well as metadata of the uploaded data object, such as a time of upload, a type of data object, a location from which the object is uploaded, and so on.
- access information about previously uploaded data that was previously stored in the network may be received and analyzed.
- the previously uploaded data may have also included metadata at the time of its upload, and may further have additional metadata that was gathered after the upload, such as metadata indicating a time that the previously uploaded data was accessed, locations from which the previously uploaded data was accessed, and so on.
- a prediction as to the geographical regions of the network from which the uploaded data is likely to be accessed may be made. This prediction or determination may be based on the metadata of the currently uploaded data, as well as the access information about the previously uploaded data. Additionally, the prediction may precede the currently uploaded data being accessed, whereby the metadata of the currently uploaded data at a time of upload may be sufficient for the prediction. Patterns recognized in the information of the previously uploaded data may indicate a likely outcome for access of the currently uploaded data, and thus may be used to predict an ideal location to store the currently uploaded data. In many cases, it may be preferable to store the uploaded data at the datacenter to which it is originally uploaded. However, in other cases, it may be preferable to additionally, or alternatively, store the uploaded data in a different datacenter, or even if a different geographical region.
- the uploaded data is directed to be transferred from the originating datacenter to other datacenters at which the data is likely to be accessed.
- the other datacenters may be located in geographical regions other than the first geographical region, thus making access to the uploaded data at those other regions more efficient.
- Efficiency may be a measure of accessing data faster, costing less overall bandwidth, being performed over connection having more available bandwidth, or any combination of these and other factors.
- FIG. 4 is a flow diagram illustrating an example subroutine 400 of the routine 300 of FIG. 3 .
- the subroutine 400 shows example operations that may be performed to carry out the determination of one or more destination regions for an uploaded data object, as well as distribution of the data object to one or more datacenters of the determined destination regions.
- one or more geographical regions from which the currently uploaded data will be accessed are predicted.
- This prediction may be made by an access prediction program, such as access location predictor 180 shown in FIG. 1 , and may be based on an output from a prediction model that has been trained on access logs of previously uploaded data, such as predictive model 170 shown in FIG. 1 .
- the prediction model may recognize access patterns of the previously uploaded data, and may predict based on those patterns where the currently uploaded data is likely to be accessed.
- the currently uploaded data may be sent, such as copied or moved, to one or more files designated for migration of uploaded data to a destination region other than the first region, such as migrations files 190 of FIG. 1 .
- the files to which the currently uploaded data is sent may be based on the prediction of block 410 , whereby the data will be transferred to those regions at which it is expected to be accessed.
- an amount of time until the currently uploaded data will be accessed at the predicted geographical regions is predicted.
- This prediction may also be based on an output from the prediction model.
- access logs of previously uploaded data fed to the prediction model as training data should include information from which a direction between upload and a first access of the previously uploaded data can be determined or derived, such as an upload time, and a log of a first time or every time at which the data is accessed.
- the prediction model may recognize access patterns of the previously uploaded data, and may predict based on those patterns when the currently uploaded data is likely to be accessed.
- the prediction of when the currently uploaded data is likely to be accessed may be compared to a threshold value, such as an amount of time from a current time. If the predicted amount of time until the currently uploaded data is accessed exceeds or is equal to the threshold amount, meaning that the currently uploaded data is not expected to be accessed on a relatively immediate basis, then operations may conclude, and the data may be migrated at a relatively slow pace using the one or more migration files.
- a threshold value such as an amount of time from a current time.
- operations may continue at block 450 , whereby the currently uploaded data is sent (copied or moved) to a caching server, such as caching server 195 of FIG. 1 , to be injected into caching servers of a remote datacenter, either in the same geographical region or in one or more different geographical regions.
- a caching server such as caching server 195 of FIG. 1
- the determination of which datacenters, regions, or both to which the data is injected may be based on the determinations of the access location predictor at block 410 .
- the decision to send uploaded data to a migration file is shown as having been made prior to determining an urgency or priority of transferring the data.
- the determination to migrate the data is made regardless of whether the data is needed sooner or later, that is, whether the data is also injected or not to the destinations.
- the system may determine to inject the uploaded data to caching servers of remote datacenters, but to not add the uploaded data to migration files in order to avoid the data needlessly being stored permanently at the remote datacenters.
- FIG. 5 is a operational block diagram 500 showing an example operation of an access location predictor 520 predicting geographical regions of the network from which uploaded data is likely to be accessed, and directing the uploaded data to be transferred to the predicted geographical regions, such as is shown in blocks 330 and 340 of routine 300 of FIG. 3 .
- the access location predictor 520 selectively moves or copies data objects 501 1 - 501 N uploaded to a datacenter of Region 4 to migration files 532 , 534 , 536 , caching servers 550 , or both.
- the access location predictor 520 may determine a placement of each uploaded data object 501 1 - 501 N based on metadata from the object and an output of the predictive model used to predict the location or locations at which the uploaded data is likely to be accessed.
- each of the uploaded data objects 501 1 - 501 N has different metadata.
- the access location predictor 520 determines a different placement strategy for each of the uploaded data objects 501 1 - 501 N .
- Uploaded object 1 ( 5010 )
- the access location predictor 520 determines that this object is likely to be accessed at Region 2 . Therefore, object 1 is moved from the data server to migration file 534 , which may be a file dedicated for objects that are to be migrated from the datacenter at which object 1 is uploaded to Region 2 .
- each migration file may include common metadata 542 , such as a header to the file, which may indicate a destination of the file.
- common metadata avoids the need for this metadata to be separately written to the migration file for each appended object 544 , which may save space in the migration file and may further reduce processing requirements for moving the objects from the data server to the migration file.
- this metadata In the case of a destination region, there is also no need for this metadata to be rewritten to the object after the migration, since the destination region will remain common metadata of the object and all the other objects stored at the region. Aside from the common metadata, such as the destination region, the remaining metadata of each object may be moved or copied with the object so as to preserve the object metadata during the migration.
- the migration file may have a predetermined capacity, whereby when moving or copying an object to the migration file causes the migration file to meet or exceed the predetermined capacity, the migration file may transferred to one or more datacenters of the destination region. Additionally, or alternatively, the migration file may be transferred to one or more datacenters of the destination region after a predetermined amount of time has elapsed since creation of the migration file. Operations for transferring the objects are described in greater detail below in connection with FIG. 4 .
- the object may be moved to a single destination, such as in the example of Object 1 ( 501 1 ).
- the access location predictor 520 may determine to move or copy the object to more than one migration file.
- Uploaded object 2 ( 5012 ) the access location predictor 520 determines that this object is likely to be accessed at each of Regions 1 , 3 and 4 . Therefore, object 2 is copied from the data server to each of migration files 532 and 536 , which may be files dedicated for objects that are to be migrated from the datacenter at which object 2 is uploaded to Regions 1 and 3 . The object may also remain permanently stored at the data server so that it may be accessed at Region 4 .
- migration of objects from one region to another may begin after the file has been filled with several data objects. Since the migration file may be large, it may take time before the migration file is filled. However, in some cases, a data object may be in high demand at a remote region of the network, but only for a time before the data migration occurs. In this case, the slow-paced strategic relocation of data objects using migration files would undermine the ability for users of the other regions to efficiently access the data object while it is in high demand.
- the object file may include metadata from which it may be determined how soon after the data is uploaded that it will be accessed.
- the metadata aggregator 160 can collect metadata showing both a time of upload and a time of earliest download for each stored object. This data may then be used to train the predictive model 170 to predict whether a future uploaded data object will be accessed soon after or long after the object is uploaded to the system. In turn, this information may be used by the access location predictor 180 to determine, for any given uploaded data object, an amount of time until the object is likely to be downloaded.
- the system may further store a threshold time value, whereby if the predicted amount of time until an object is likely to be downloaded is equal to or less than the threshold amount, then the object may bypass the usual data migration scheme via the migration files, and be moved or copied to a caching server 350 for a relatively faster transfer of the uploaded data object to other regions of the system.
- the access location predictor 520 determines that this object is likely to be in high demand within a threshold amount of time. Therefore, object 3 is copied from the data server to a caching server 550 , so that the file may be injected to the caching servers of remote datacenters in Region 4 as well as in other regions, including but not limited to Region 1 , Region 2 and Region 3 , based on the determinations of the access location predictor 520 . Object 3 may remain stored at the data server of the datacenter.
- an uploaded object may be accessed from only the region at which it was uploaded. This may be the case for personally stored files that are not shared among users, or for files that shared among a group of users in close geographic proximity to one another.
- the access location predictor 520 may determine that the uploaded object should not be copied to either a migration file 532 , 534 , 536 or to a caching server 550 . Instead, the object remains stored at a data server of the datacenter where the object was uploaded. In the example of Uploaded object 4 ( 5014 ) the access location predictor 520 determines that this object is likely to be accessed at only the originating Region 4 . Therefore, object 4 is not copied from the data server.
- broadcast data such as a streamed video file
- Uploaded object N ( 501 N ) the access location predictor 520 determines that this object is likely to be in high demand within a threshold amount of time, as well as at a later time, at both Regions 3 and 4 . Therefore, object N is copied from the data server to both migration file 536 and caching server 550 , so that the object may be injected to the caching servers of Region 3 to address immediate demand, as well as migrated to permanent storage of Region 3 to address long term demand.
- FIG. 6 is a block diagram showing an example operation for distributing uploaded data objects according to the determinations made by the access location predictor, such as access location predictor 180 of FIG. 1 or access location predictor 520 of FIG. 5 .
- the example of FIG. 6 shows an uploaded object that is both migrated (long term) and injected (short term) from Region 4 ( 640 ) to both Region 1 ( 610 ) and Region 2 ( 620 ) of a distributed network system 600 .
- Each region may include one or more datacenters 610 a , 620 a , 640 a , whereby each datacenter may include one or more respective processors 612 , 622 , 642 , data servers 614 , 624 (not shown for Region 4 ), and caching servers 616 , 626 , 646 .
- migration files 644 are shown as being stored at the datacenter 640 of Region 4 , although migration files may also be stored at the datacenters of other regions to facilitate transferring objects that are uploaded at those other regions to also be transferred throughout the network.
- moving the contents of the migration files 644 of the Region 4 datacenter 640 a may begin with a data migration controller 650 of the system 600 executing a program whereby the datacenter 610 a is queried for files of data stored in its servers.
- the processor 642 may receive the query, and in response may provide information indicating the destination of each file.
- the data migration controller 650 may take no further action.
- the data migration controller 650 may determine to initiate a migration of data from the files to the identified destinations.
- the data migration controller 650 may transmit an instruction to one or more processors of each identified destination, such as processors 612 , 622 . Based on these instructions, the processors may instruct data servers 614 , 624 at their respective locations to perform the data migration, whereby each data server 614 , 624 may access and read the respective migration file 644 at Region 4 . After reading the migration file 644 , the contents of the migration file 644 may be deleted. This process may be repeated for future data objects that are uploaded to the datacenter 610 a in Region 4 ( 640 ).
- Moving the contents of the caching server 646 of the Region 4 datacenter 640 a may begin with the server 646 prefetching the data to be transferred, and injecting it to caching servers 616 , 626 of remote datacenters according to instructions of the access location predictor.
- the injection may be performed on a relatively immediate scale, meaning that there are no further steps to be executed prior to initiating the data injection. This may make the data available in the other datacenters as fast as possible.
- cache injection may place copies of the uploaded and transferred object in multiple datacenters of any given region to which it is sent, including but not limited to caches of all datacenters of the destination region. This may allow for on-demand data to be accessed by many users in a short period of time. Synchronous cache injection is relatively fast and efficient, compared to synchronous replication.
- locations generally describe locations as being closer to a user when that location is geographically closer.
- “closeness” of data is not necessarily a measure of geographic distance, but rather a measure of cost to access data.
- the location may be chosen so as to reduce overall costs for accessing the data, such as bandwidth, time, fees for bandwidth use being contracting parties, any combination of these or other factors.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Computer Hardware Design (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A system and method for storing data in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions. The method may involve receiving data, including metadata, uploaded to a first datacenter of the distributed network, receiving access information about previous data that was previously stored in the plurality of datacenters of the distributed network, predicting one or more of the plurality of geographic regions from which the uploaded data will be accessed based on the metadata and the access information, and instructing the uploaded data to be transferred from the first datacenter to one or more second datacenters located at each of the one or more predicted geographic regions.
Description
- Global cloud storage services provide accessibility for large amounts of data from anywhere in the world once the data has been stored in the cloud. For example, an image uploaded in Europe may be immediately accessible for download in the United States.
- Global cloud storage services are often divided into various geographical regions in order to manage the large volume of uploaded data. As such, a user request to access data is typically routed to a server nearest to the user, and particularly in the user's geographic region. The server then looks up the location of the requested data, and then forwards a request for the data to the server where the data is stored, which may be in a different geographic region.
- When requested data is stored far from the requesting user, fetching the requested data may incur a high latency, which may degrade the requesting user's experience of the requested data. The long distance fetch also costs precious bandwidth for the service vendor, especially if there is a scarcity of network bandwidth between the user's geographic region and the data's geographic region, such as if not enough optic fiber cables are deployed between the two regions.
- Global cloud storage services commonly store uploaded data in the region from which the data is uploaded. This may be effective in those cases where the uploaded data is primarily downloaded in the same geographic region. However, in many cases, uploaded data is accessed primarily from other geographical regions, which could result in high network bandwidth costs.
- One aspect of the present disclosure is directed to a method for storing data in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions. The method may include receiving, by one or more processors, data uploaded to a first datacenter of the distributed network, the uploaded data including metadata, receiving, by the one or more processors, access information about previously uploaded data, prior to the uploaded data being accessed, predicting, by the one or more processors, one or more of the plurality of geographic regions from which the uploaded data will be accessed based on the metadata and the access information, and instructing, by the one or more processors, the uploaded data to be transferred from the first datacenter to one or more second datacenters located at each of the one or more predicted geographic regions.
- In some examples, the access information may be derived from a predictive model trained on metadata of the previously uploaded data.
- In some examples, the predictive model may be a decision tree model.
- In some examples, the metadata may include an identification of a user uploading the uploaded data, and a location from which the uploaded data is uploaded.
- In some examples, the metadata of the previously uploaded data may include a location of the previously uploaded data, an identification of a user uploading the previously uploaded data and a location from which the previously uploaded data is uploaded.
- In some examples, the metadata of the previously uploaded data may include an identification of a directory or a file path at which the previously uploaded data is stored.
- In some examples, the access information may indicate an amount of time between an initial upload of the previously uploaded data and a first download of the previously uploaded data. The method may include predicting, by the one or more processors, an amount of time until the uploaded data is downloaded for a first time, and instructing, by the one or more processors, the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions, based on the predicted amount of time.
- In some examples, instructing the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions may include determining, by the one or more processors, that the uploaded data is broadcast data based on the metadata and the access information, and for each given predicted geographic region, instructing, by the one or more processors, the uploaded data to be transferred to at least one caching server of each datacenter of the given predicted geographic region.
- In some examples, the method may include instructing, by the one or more processors, the uploaded data to be included in a file including previously uploaded data having a common predicted geographic region, and instructing, by the one or more processors, the file to be transferred to one or more second datacenters located at the common predicted geographic region.
- In some examples, the file may be initially stored at one or more source servers located at first datacenter. Instructing the file to be transferred may include instructing, by the one or more processors, data servers of the one or more second datacenters located at the common predicted geographic region to pull the file from the one or more source servers.
- Another aspect of the present disclosure is directed to a system for storing data in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions. The system may include one or more storage devices at a first datacenter of the distributed network, configured to store data uploaded to the first datacenter, the uploaded data including metadata, and one or more processors in communication with the one or more storage devices. The one or more processors may be configured to receive access information about previously uploaded data that was previously stored in the plurality of datacenters of the distributed network, prior to the uploaded data being accessed, predict one or more of the plurality of geographic regions from which the uploaded data will be accessed based on the metadata and the access information, and instruct, the uploaded data to be transferred from the first datacenter to one or more second datacenters located at each of the one or more predicted geographic regions.
- In some examples, the access information may be derived from a predictive model trained on metadata of the previously uploaded data.
- In some examples, the predictive model may be a decision tree model.
- In some examples, the metadata may include an identification of a user uploading the uploaded data, and a location from which the uploaded data is uploaded.
- In some examples, the metadata of the previously uploaded data may include a location of the previously uploaded data, an identification of a user uploading the previously uploaded data and a location from which the previously uploaded data is uploaded.
- In some examples, the metadata of the previously uploaded data may include an identification of a directory or a file path at which the previously uploaded data is stored.
- In some examples, the access information may indicate an amount of time between an initial upload of the previously uploaded data and a first download of the previously uploaded data. The one or more processors may be configured to predict an amount of time until the uploaded data is downloaded for a first time, and instruct the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions, based on the predicted amount of time.
- In some examples, the one or more processors may be configured to instruct the uploaded data to be transmitted to one or more caching servers located at the one or more predicted geographic regions based on a determination that the uploaded data is broadcast data based on the metadata, and for each given predicted geographic region, the one or more processors may be configured to instruct the uploaded data to be transferred to at least one caching server of each datacenter of the given predicted geographic region.
- In some examples, the one or more processors may be configured to instruct the uploaded data to be included in a file including previously uploaded data having a common predicted geographic region, and instruct the file to be transferred to one or more second datacenters located at the common predicted geographic region.
- In some examples, the file is initially stored at one or more source servers located at first datacenter. The one or more processors may be configured to instruct data servers of the one or more second datacenters located at the common predicted geographic region to pull the file from the one or more source servers.
-
FIG. 1 is a block diagram illustrating an example system according to aspects of the disclosure. -
FIG. 2 is a block diagram illustrating an example computing system according to aspects of the disclosure. -
FIGS. 3 and 4 are block diagrams illustrating an example data distribution scheme of a system according to aspects of the disclosure. -
FIG. 5 is a flow diagram illustrating an example method according to aspects of the disclosure. -
FIG. 6 is flow diagram illustrating aspects of the flow diagram ofFIG. 5 . - The technology relates generally to a system for efficiently storing uploaded data across a distributed network. The system may include a location predictor or prediction program that predicts the location or locations from which an uploaded data file may be accessed in the future. The prediction may be based on uploading and downloading patterns, also referred to as “access patterns,” of previously uploaded data. Predicting access patterns of newly uploaded data can improve storage efficiency of the uploaded data, since the data can be strategically stored close to those locations from which it will be accessed in the future.
- In some implementations, the prediction program can be stored in a distributed network having multiple virtual machines across one or more datacenters. The uploaded data may begin by being stored in any datacenter of the network. Subsequently, the uploaded data is analyzed by the prediction program, and migrated to one or more other datacenters at which the uploaded data is predicted to be downloaded.
- Predictions may be based on metadata included in each of the uploaded data and the previously uploaded data. For example, metadata of previously uploaded data may be used to train a predictive model, whereby the metadata may be related to various predictors of the model. Metadata of the uploaded data may be the same or similar to that of the previously uploaded data, whereby these similarities between the uploaded data and certain previously uploaded data may indicate a future access pattern of the uploaded data.
- In some implementations, the uploaded data may be transferred to the datacenters using offline data migration techniques. For example, separate large files may be set up to store uploaded data based on the destination datacenter of the data, whereby each large file has a different destination datacenter. Newly uploaded data may then be appended to one or more large files based on the datacenters at which the uploaded data is predicted to be downloaded. Data migration of the uploaded data may then be performed on a per-file basis, for instance, when a given file reaches a certain size limit, or after a predefined amount of time, such as 12 hours, has elapsed.
- In some implementations, the metadata may also be used to predict an urgency for migrating the uploaded data to its destination. For example, in the case of broadcast data, such as a streamed file broadcast from one user and made immediately accessible to other users worldwide, may be in high demand across multiple regions both immediately as well as at a later time offline data migration may take too long to deliver the uploaded data to its destination datacenter. In such a case, identifying an urgency of migrating the data may be used to initiate a cache injection of the uploaded data, whereby the data is transferred to a caching server at the destination datacenter from which the data may be served to users locally. The cache injection may be performed in addition to, and prior to, the offline data migration.
- The above implementations can improve storage service of unstructured data within the distributed network, particularly for distributed networks having multiple datacenters spread out across multiple geographic regions. The improved storage service may make data that is uploaded in one part of the world more readily accessible in other parts of the world where the data is commonly accessed. This in turn can result in cost and time savings for users and service providers, since accessing data from a distant location is generally more costly and more time consuming than accessing data from a nearby location.
-
FIG. 1 is a block diagram illustrating an example system including a distributed computing environment. Thesystem 100 may be a cloud storage service providing users with the ability to uploaddata 101 to servers distributed across multiplegeographic regions system 100. Each geographic region may include one or more datacenters.FIG. 1 showsdatacenters datacenters datacenters datacenters more data servers 145 configured to store the uploaded data. - The
datacenters data 101. Accessing data may include downloading the data, streaming the data, copying data from one folder or directory to another, or any other means by which data is made accessible in response to a user request received at a server of thesystem 100. In some examples, the datacenters may further communicate with a controller (not shown); thus, accessing the data may include making the data accessible in response to an instruction from the controller. - The
datacenters regions - As shown in
FIG. 2 , each datacenter may include one ormore computing devices 210, such asprocessors 220, servers, shards, cells, or the like. It should be understood that each datacenter may include any number of computing devices, that the number of computing devices in one datacenter may differ from a number of computing devices in another datacenter, and that the number of computing devices in a given datacenter may vary over time, for example, as hardware is removed, replaced, upgraded, or expanded. - Each datacenter may also include a number of storage devices or memory 230, such as hard drives, random access memory, disks, disk arrays, tape drives, or any other types of storage devices. The datacenters may implement any of a number of architectures and technologies, including, but not limited to, direct attached storage (DAS), network attached storage (NAS), storage area networks (SANs), fibre channel (FC), fibre channel over Ethernet (FCoE), mixed architecture networks, or the like. The datacenters may include a number of other devices in addition to the storage devices, such as
communication devices 260 to enable input and output between the computing devices of the same datacenter or different datacenters, between computing devices of the datacenters and controllers (not shown) of the network system, and between the computing devices of the datacenters and client computing devices (not shown) connected to the network system, such as cabling, routers, etc. - Memory 230 of each of the computing devices can store information accessible by the one or
more processors 220, includingdata 240 that is received at or generated by the one ormore computing devices 210, andinstructions 250 that can be executed by the one ormore processors 220. - The
data 240 may include storeddata 242 such as uploaded data objects, ametadata log 244 tracking metadata of the uploaded data objects 242, as well as one ormore migration files 246 and cached data files 248 at which uploadeddata objects 242 may be stored before being transferred from one datacenter to another. Details of the above examples of stored data are discussed in greater detail below. - The
instructions 250 may include a locationaccess prediction program 252 configured to predict the location or locations at which a given data object file is likely to be accessed. Such locations may be one or more regions of the distributed network. Theinstructions 250 may further include adata migration program 254 and adata caching program 256 configured to execute the transfer of data object files stored in the one ormore migration files 246 and cached data files 248, respectively. Details of the above examples of stored programs are also discussed in greater detail below. - In some examples, the controller may communicate with the computing devices in the datacenters, and may facilitate the execution of programs. For example, the controller may track the capacity, status, workload, or other information of each computing device, and use such information to assign tasks. The controller may include a processor and memory, including data and instructions. In other examples, such operations may be performed by one or more of the computing devices in one of the datacenters, and an independent controller may be omitted from the system.
- The uploaded
data 101 uploaded to the datacenters may include metadata, indicating various properties of the uploaded data. The metadata may be logged at the datacenter to which the data is uploaded (in the example ofFIG. 1 ,datacenter 140 b), stored, or both. In the example ofFIG. 1 , ametadata log 150 is provided indatacenter 140 b to store themetadata 155 of the uploaded data. - The
metadata 155 may include an identification of a region, datacenter, or both, to which the data is uploaded. As discussed in greater detail below, because uploaded data is strategically migrated between geographical regions, in some cases the location at which the data is uploaded may differ from the location at which the data is stored. Additionally, in some cases, data may be stored in multiple locations, including or excluding the location to which it is uploaded. In such cases, the metadata may further include an identification of the region, datacenter, or both, at which the data is stored. - Other metadata may include, but is not limited to, a customer identification of the uploading party, an object name, an object type, an object size, a name of a directory or folder to which the data is uploaded (such as a bucket name used to store the object), an object name prefix (such as a file path of the uploaded object if more than one level of directory hierarchy is used to store the object), a time of upload, a time of first download, subsequent downloads are their times, a number of access requests, and so on. Essentially, the metadata may include both properties of the object, as well as a running access log for the object.
- Additionally, in those cases where the data is stored in multiple locations, it should be recognized that each location may be associated with different metadata for the same data object. For example, the object stored in
Region 1 may be downloaded sooner than the same object that is stored inRegion 2. For further example, the number of access requests for the object may vary from one location to the next. As such, metadata from the uploaded object may be separately logged at each location where the object is ultimately stored. In the example ofFIG. 1 , each of thedatacenters datacenter 140 b), which may store a log of data objects or files that have been uploaded to that datacenter, including the metadata of those objects. - Metadata for each uploaded object may be tracked across the multiple regions using a
metadata aggregator 160. Theaggregator 160 may be capable of collecting metadata from the metadata logs of each datacenter on a regular basis, such as according to an aggregation schedule. The aggregated metadata may be timestamped to enable changes in the metadata to be tracked over time. For instance, aggregated logs collected by the metadata log aggregator may be categorized according to a duration of time represented by each aggregated log, such as metadata from the previous week, metadata from the previous month, metadata from the previous three months, metadata from a time period longer than the previous three months, and so on. Differences in metadata across the categorized logs may indicate changes in the uploaded data over time, such as an increasing or a decreasing interest in accessing the uploaded data. Additionally, metadata for a given uploaded data object may be tracked as a whole, or per storage location. Thus, the aggregated logs may indicate overall changes in metadata for a given uploaded data object, as well as region-specific or datacenter-specific changes for the uploaded data object. - The aggregated data may then be fed to a
predictive model 170 in order to train themodel 170 to predict where future uploaded data is most likely to be accessed. Thepredictive model 170 may be a machine learning algorithm stored in thesystem 100, such as at one of the datacenters of the system. The predictive model may be decision tree model, whereby the aggregated data may include information about how often and from where previous uploaded data was accessed, and thus may associate a cost with the placement of the previously uploaded data in the system. Based on this information, thepredictive model 170 may determine strategic placements for future uploaded data based on the access patterns of the past uploaded data. - In other examples, other types of machine learning algorithms may be applied in order to build the predictive model. Also, more heuristic type methods may be possible, whereby certain information may dictate placement of future uploaded data. For example, if several previously uploaded data objects uploaded by a given user have a threshold number of downloads in certain regions, this may warrant future data objects uploaded by that user to be sent to and stored at the regions having the threshold number of downloads.
- The
predictive model 170 may be dynamic. For example, the aggregation of metadata and access logs by theaggregator 160 may occur on a consistent, and possible scheduled, basis, such as once every week, once every two weeks, once every month, or less or more frequently. The frequency at which the model is updated may further depend on the nature of data stored and shared on the particular network to which thissystem 100 is being applied. That is, for some products having relatively slow change, once a month may be sufficient. However, for other platforms or products where user tendencies are constantly changing and access patterns are constantly shifting, a once-a-month update may be insufficient, as more regular updated of the model, such as once a week, may be preferable. - Once the
predictive model 170 has been trained on previously uploaded and stored data, it may be used to predict where current or future uploaded data will most likely be accessed. In the example ofFIG. 1 , when a given unit of uploadeddata 101 is received at thedatacenter 140 b, the data is initially stored in thedata server 145 as it is processed by theaccess location predictor 180. Theaccess location predictor 180 may determine whether the data is likely to be accessed in the same region as which it was uploaded, whether the data is likely to be accessed in another region of the system. Theaccess location predictor 180 may further predict an amount of time between the uploadeddata 101 being uploaded and it being downloaded. Based on these determinations, the uploadeddata 101 may remain stored in thedata server 145, may be stored at one ormore migration files 190, may be stored in thecaching server 195, or any combination thereof. Storage determination operations are discussed in greater detail below in connection withFIG. 5 . - Storage in the
data server 145 is generally permanent, meaning that the data is intended to be stored there and not moved, and thus may remain stored indefinitely or until manually deleted. By contrast, storage in the migration files 190 andcaching server 195 is generally temporary, meaning that the data is intended to be transferred to another location and may deleted automatically at a time after the intended transfer. Data transfer operations are discussed in greater detail below in connection withFIG. 6 . - Although the migration files 190 are shown separately from the data server in
FIG. 1 , it should be recognized that the migration files may actually be stored at one or more servers of thedatacenter 140 b, and thus may be stored at the data server. Additionally, while the contents of the migration files may be regularly deletes, such as after the migration, the file itself may be permanent. Furthermore, the file may include permanent information, such as header information, indicating a destination to which the contents of the file are to be sent. The migration files may be files of a distributed file system of the network. It should also be recognized that data stored in the data servers may also be stored in large files of the distributed file system. Effectively, the large files of the data server may be the files having information indicating that the contents of the file are at their intended destination. For example, a file written to thedata server 145 ofdatacenter 140 b may have a header indicating a destination ofRegion 4, whereby it may be determined that the contents of the file do not need to be sent to a different region. - The
caching servers 195 may be used by the datacenter for both predictive injection caching as described herein, as well as for on-demand caching for actually data requests (as compared to the speculated requests that trigger injection, as described herein). - For purposes of illustration, the
metadata aggregator 160 and thepredictive model 170 ofFIG. 1 are shown separately from each of thegeographical regions access location predictor 180 is shown as being included indatacenter 140 b. However, it should be recognized that the data stored at, and the instructions executed by theaggregator 160, thepredictive model 170, and theaccess location predictor 180 may be anywhere in the system, such as in the regions and datacenters shown, or in other regions or datacenters not shown, or any combination thereof, including distributed across multiple datacenters of a geographical region, or distributed across multiple geographical regions of the system. -
FIG. 3 is a flow diagram illustrating anexample routine 300 for storing data in a distributed network. The network may include multiple datacenters, such asdatacenters FIG. 1 , distributed over various geographic regions, such asregions FIG. 1 . Some of the operations in the method may be carried out by processors of the datacenters from and to which the data is being transferred, whereas some operations may be carried out by processors of other datacenters, or processors and servers independent of the datacenters or geographical regions. - At
block 310, data may be uploaded to a datacenter belong to a first region of the network. The uploaded data may include a data object as well as metadata of the uploaded data object, such as a time of upload, a type of data object, a location from which the object is uploaded, and so on. - At
block 320, access information about previously uploaded data that was previously stored in the network may be received and analyzed. The previously uploaded data may have also included metadata at the time of its upload, and may further have additional metadata that was gathered after the upload, such as metadata indicating a time that the previously uploaded data was accessed, locations from which the previously uploaded data was accessed, and so on. - At
block 330, a prediction as to the geographical regions of the network from which the uploaded data is likely to be accessed may be made. This prediction or determination may be based on the metadata of the currently uploaded data, as well as the access information about the previously uploaded data. Additionally, the prediction may precede the currently uploaded data being accessed, whereby the metadata of the currently uploaded data at a time of upload may be sufficient for the prediction. Patterns recognized in the information of the previously uploaded data may indicate a likely outcome for access of the currently uploaded data, and thus may be used to predict an ideal location to store the currently uploaded data. In many cases, it may be preferable to store the uploaded data at the datacenter to which it is originally uploaded. However, in other cases, it may be preferable to additionally, or alternatively, store the uploaded data in a different datacenter, or even if a different geographical region. - At
block 340, the uploaded data is directed to be transferred from the originating datacenter to other datacenters at which the data is likely to be accessed, The other datacenters may be located in geographical regions other than the first geographical region, thus making access to the uploaded data at those other regions more efficient. Efficiency may be a measure of accessing data faster, costing less overall bandwidth, being performed over connection having more available bandwidth, or any combination of these and other factors. -
FIG. 4 is a flow diagram illustrating an example subroutine 400 of the routine 300 ofFIG. 3 . The subroutine 400 shows example operations that may be performed to carry out the determination of one or more destination regions for an uploaded data object, as well as distribution of the data object to one or more datacenters of the determined destination regions. - At
block 410, one or more geographical regions from which the currently uploaded data will be accessed are predicted. This prediction may be made by an access prediction program, such asaccess location predictor 180 shown inFIG. 1 , and may be based on an output from a prediction model that has been trained on access logs of previously uploaded data, such aspredictive model 170 shown inFIG. 1 . The prediction model may recognize access patterns of the previously uploaded data, and may predict based on those patterns where the currently uploaded data is likely to be accessed. - At
block 420, the currently uploaded data may be sent, such as copied or moved, to one or more files designated for migration of uploaded data to a destination region other than the first region, such as migrations files 190 ofFIG. 1 . The files to which the currently uploaded data is sent may be based on the prediction ofblock 410, whereby the data will be transferred to those regions at which it is expected to be accessed. - At
block 430, an amount of time until the currently uploaded data will be accessed at the predicted geographical regions is predicted. This prediction may also be based on an output from the prediction model. In this case, access logs of previously uploaded data fed to the prediction model as training data should include information from which a direction between upload and a first access of the previously uploaded data can be determined or derived, such as an upload time, and a log of a first time or every time at which the data is accessed. The prediction model may recognize access patterns of the previously uploaded data, and may predict based on those patterns when the currently uploaded data is likely to be accessed. - At
block 440, the prediction of when the currently uploaded data is likely to be accessed may be compared to a threshold value, such as an amount of time from a current time. If the predicted amount of time until the currently uploaded data is accessed exceeds or is equal to the threshold amount, meaning that the currently uploaded data is not expected to be accessed on a relatively immediate basis, then operations may conclude, and the data may be migrated at a relatively slow pace using the one or more migration files. Conversely, if the predicted amount of time until the currently uploaded data is accessed is less than the threshold amount, meaning that the currently uploaded data is expected to be accessed on a relatively immediate basis, then operations may continue atblock 450, whereby the currently uploaded data is sent (copied or moved) to a caching server, such ascaching server 195 ofFIG. 1 , to be injected into caching servers of a remote datacenter, either in the same geographical region or in one or more different geographical regions. As with the migration files, the determination of which datacenters, regions, or both to which the data is injected may be based on the determinations of the access location predictor atblock 410. - In the example of
FIG. 4 , the decision to send uploaded data to a migration file is shown as having been made prior to determining an urgency or priority of transferring the data. Thus, the determination to migrate the data is made regardless of whether the data is needed sooner or later, that is, whether the data is also injected or not to the destinations. However, in other examples, there may be data that is likely to be accessed only on an immediate basis and not accessed at later times, such as a live streamed video with a relatively short shelf life. In such cases, the system may determine to inject the uploaded data to caching servers of remote datacenters, but to not add the uploaded data to migration files in order to avoid the data needlessly being stored permanently at the remote datacenters. -
FIG. 5 is a operational block diagram 500 showing an example operation of anaccess location predictor 520 predicting geographical regions of the network from which uploaded data is likely to be accessed, and directing the uploaded data to be transferred to the predicted geographical regions, such as is shown inblocks routine 300 ofFIG. 3 . Theaccess location predictor 520 selectively moves or copies data objects 501 1-501 N uploaded to a datacenter ofRegion 4 tomigration files servers 550, or both. Also as noted above, theaccess location predictor 520 may determine a placement of each uploaded data object 501 1-501 N based on metadata from the object and an output of the predictive model used to predict the location or locations at which the uploaded data is likely to be accessed. - In the example of
FIG. 5 , each of the uploaded data objects 501 1-501 N has different metadata. Thus, theaccess location predictor 520 determines a different placement strategy for each of the uploaded data objects 501 1-501 N. - In the example of Uploaded object 1 (5010, the
access location predictor 520 determines that this object is likely to be accessed atRegion 2. Therefore,object 1 is moved from the data server tomigration file 534, which may be a file dedicated for objects that are to be migrated from the datacenter at which object 1 is uploaded toRegion 2. - As shown in
FIG. 5 , each migration file may includecommon metadata 542, such as a header to the file, which may indicate a destination of the file. Presenting common metadata avoids the need for this metadata to be separately written to the migration file for each appendedobject 544, which may save space in the migration file and may further reduce processing requirements for moving the objects from the data server to the migration file. In the case of a destination region, there is also no need for this metadata to be rewritten to the object after the migration, since the destination region will remain common metadata of the object and all the other objects stored at the region. Aside from the common metadata, such as the destination region, the remaining metadata of each object may be moved or copied with the object so as to preserve the object metadata during the migration. - In some examples, the migration file may have a predetermined capacity, whereby when moving or copying an object to the migration file causes the migration file to meet or exceed the predetermined capacity, the migration file may transferred to one or more datacenters of the destination region. Additionally, or alternatively, the migration file may be transferred to one or more datacenters of the destination region after a predetermined amount of time has elapsed since creation of the migration file. Operations for transferring the objects are described in greater detail below in connection with
FIG. 4 . - If it is determined that the object is likely to be accessed from only one region, so then the object may be moved to a single destination, such as in the example of Object 1 (501 1). However, in other cases, it may be determined that an object is likely to be accessed from more than one region, including or excluding the region at which the object is uploaded. In such a case, the
access location predictor 520 may determine to move or copy the object to more than one migration file. - In the example of Uploaded object 2 (5012) the
access location predictor 520 determines that this object is likely to be accessed at each ofRegions object 2 is copied from the data server to each ofmigration files Regions Region 4. - As noted above, migration of objects from one region to another may begin after the file has been filled with several data objects. Since the migration file may be large, it may take time before the migration file is filled. However, in some cases, a data object may be in high demand at a remote region of the network, but only for a time before the data migration occurs. In this case, the slow-paced strategic relocation of data objects using migration files would undermine the ability for users of the other regions to efficiently access the data object while it is in high demand. For example, if a user is streaming live video in Europe and several users in the United States wish to access the streamed video immediately, waiting for the video file to migrate from a datacenter in Europe to a datacenter in the United States would not be helpful, and there may not even be a demand for the streamed video anymore after it is stored in the datacenter in the United States.
- In order to address this challenge, the object file may include metadata from which it may be determined how soon after the data is uploaded that it will be accessed. For example, in the system of
FIG. 1 , themetadata aggregator 160 can collect metadata showing both a time of upload and a time of earliest download for each stored object. This data may then be used to train thepredictive model 170 to predict whether a future uploaded data object will be accessed soon after or long after the object is uploaded to the system. In turn, this information may be used by theaccess location predictor 180 to determine, for any given uploaded data object, an amount of time until the object is likely to be downloaded. The system may further store a threshold time value, whereby if the predicted amount of time until an object is likely to be downloaded is equal to or less than the threshold amount, then the object may bypass the usual data migration scheme via the migration files, and be moved or copied to a caching server 350 for a relatively faster transfer of the uploaded data object to other regions of the system. - In the example of Uploaded object 3 (5013) the
access location predictor 520 determines that this object is likely to be in high demand within a threshold amount of time. Therefore,object 3 is copied from the data server to acaching server 550, so that the file may be injected to the caching servers of remote datacenters inRegion 4 as well as in other regions, including but not limited toRegion 1,Region 2 andRegion 3, based on the determinations of theaccess location predictor 520.Object 3 may remain stored at the data server of the datacenter. - In many cases, it may be determined that an uploaded object is likely to be accessed from only the region at which it was uploaded. This may be the case for personally stored files that are not shared among users, or for files that shared among a group of users in close geographic proximity to one another. In such cases, the
access location predictor 520 may determine that the uploaded object should not be copied to either amigration file caching server 550. Instead, the object remains stored at a data server of the datacenter where the object was uploaded. In the example of Uploaded object 4 (5014) theaccess location predictor 520 determines that this object is likely to be accessed at only the originatingRegion 4. Therefore,object 4 is not copied from the data server. - It should be noted that keeping a file at a data server, copying the file to a migration file, or copying the file to a caching server, may be treated as independent operations. Thus, the decision to perform one operation does not preclude any other operation from being performed. For example broadcast data, such as a streamed video file, may be in high demand across multiple regions both immediately as well as at a later time. In such a case, it may be determined that the file should remain in the originating data server, as well as copied to both a migration file and caching server to address both long term and short term demand. The determination may be based at least in part on the uploaded data being broadcast data, as well as on other access patterns detected by the predictive model.
- In the example of Uploaded object N (501 N) the
access location predictor 520 determines that this object is likely to be in high demand within a threshold amount of time, as well as at a later time, at bothRegions migration file 536 andcaching server 550, so that the object may be injected to the caching servers ofRegion 3 to address immediate demand, as well as migrated to permanent storage ofRegion 3 to address long term demand. -
FIG. 6 is a block diagram showing an example operation for distributing uploaded data objects according to the determinations made by the access location predictor, such asaccess location predictor 180 ofFIG. 1 oraccess location predictor 520 ofFIG. 5 . The example ofFIG. 6 shows an uploaded object that is both migrated (long term) and injected (short term) from Region 4 (640) to both Region 1 (610) and Region 2 (620) of a distributednetwork system 600. Each region may include one ormore datacenters respective processors servers datacenter 640 ofRegion 4, although migration files may also be stored at the datacenters of other regions to facilitate transferring objects that are uploaded at those other regions to also be transferred throughout the network. - In operation, moving the contents of the migration files 644 of the
Region 4 datacenter 640 a may begin with adata migration controller 650 of thesystem 600 executing a program whereby thedatacenter 610 a is queried for files of data stored in its servers. Theprocessor 642 may receive the query, and in response may provide information indicating the destination of each file. In the case of files to be stored atRegion 4, whereby the destination isRegion 4, thedata migration controller 650 may take no further action. Conversely, in the case of files to be stored at other regions, such asRegions Region 4, thedata migration controller 650 may determine to initiate a migration of data from the files to the identified destinations. Thedata migration controller 650 may transmit an instruction to one or more processors of each identified destination, such asprocessors Region 4. After reading the migration file 644, the contents of the migration file 644 may be deleted. This process may be repeated for future data objects that are uploaded to thedatacenter 610 a in Region 4 (640). - Moving the contents of the
caching server 646 of theRegion 4 datacenter 640 a may begin with theserver 646 prefetching the data to be transferred, and injecting it to cachingservers - The above examples refer to data distribution schemes according to datacenters and regions. However, those skilled in the art will readily recognize that the same principles may be applied to other systems in which data is organized differently. The underlying principle is that some portions of a large scale network, such as a global network may be closer to any given user than other parts of the network, and to the extent that one may predict the locations from which data will be accessed, it may be advantageous to move the data after it has been uploaded to closer to those locations where it will be accessed. To this extent, if it can be predicted that a particular one or group of users are likely to access the data in the future, that the data can be moved to datacenters, servers, or other units storage that are closer to the predicted accessing users.
- Additionally, the above examples generally describe locations as being closer to a user when that location is geographically closer. However, those skilled in the art will recognize that “closeness” of data is not necessarily a measure of geographic distance, but rather a measure of cost to access data. Thus, when data is positioned “closer” to a user or group of user, or “closer” to a location from which the data is predicted to be accessed, the location may be chosen so as to reduce overall costs for accessing the data, such as bandwidth, time, fees for bandwidth use being contracting parties, any combination of these or other factors.
- Although the technology herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present technology. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present technology as defined by the appended claims.
Claims (20)
1. A method for storing a plurality of data items in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions, the method comprising:
receiving, by one or more processors, a plurality of first data items uploaded to the distributed network from a plurality of first users, each first data item including metadata, the metadata including an upload geographic region at which the first data item is uploaded and one or more accessed geographic regions at which the first data item is accessed;
training, by the one or more processors, a predictive model using the metadata of the plurality of first data items;
after training the predictive model using the metadata of the plurality of first data items, receiving, by the one or more processors, a second data item uploaded to the distributed network by a second user;
determining, by the one or more processors, one or more storage geographic regions at which the second data item is to be stored based at least in part on the predictive model, wherein at least one of the one or more storage geographic locations at which the second data item is to be stored is different from the upload geographic region at which the second data item was uploaded; and
instructing, by the one or more processors, the second data item to be transferred from the upload geographic region of the second data item to one or more datacenters of the one or more storage geographic regions at which the second data item is to be stored.
2. The method of claim 1 , further comprising predicting, by the one or more processors, one or more access geographic regions at which the second data item is predicted to be accessed based on the predictive model, wherein the one or more storage geographic regions at which the second data item is to be stored are determined based on the predicted one or more access geographic regions.
3. The method of claim 1 , wherein the predictive model is a decision tree model.
4. The method of claim 1 , wherein the metadata further includes, and the predictive model is trained with, at least one of:
an identification of a datacenter to which the first data item is uploaded;
an identification of an uploading user;
a time of upload;
a size of the first data item; or
a name of the first data item.
5. The method of claim 1 , wherein the metadata further includes, and the predictive model is trained with, at least one of:
one or more second storage geographic regions at which the first data item is stored; or
one or more times at which the first data item is accessed; or
a number of access requests for the first data item.
6. The method of claim 1 , wherein the plurality of first data items includes at least one data file, wherein the metadata of the data file includes file characteristic data, wherein the file characteristic data includes at least one of: a name of the file; a size of the file; or an identification of a directory or a file path at which the file is stored, and wherein the predictive model is trained at least in part using the file characteristic data.
7. The method of claim 1 , further comprising:
predicting, by the one or more processors, an amount of time until the second data item will be accessed for a first time;
for at least one of the determined storage geographic regions of the second data item, selecting, by the one or more processors, one of a first transfer protocol or a second transfer protocol for transferring the second data item to the at least one storage geographic region, based on the predicted amount of time, wherein an average time for the second data item to arrive at the at least one storage geographic region using the first transfer protocol is less than an average time for the second data item to arrive at the at least one storage geographic region using the second transfer protocol; and
transferring, by the one or more processors, the second data item from the upload geographic region of the second data item to the at least one storage geographic region of the second data item according to the selected first or second transfer protocol.
8. The method of claim 7 , wherein the first transfer protocol comprises cache injection of the second data item to one or more caching servers located at the at least one storage geographic region.
9. The method of claim 8 , wherein the second transfer protocol comprises:
instructing, by the one or more processors, the second data item to be included in a file including other uploaded data items having a common storage geographic region as the second data item; and
instructing, by the one or more processors, the file to be transferred to one or more datacenters located at the common storage geographic region.
10. The method of claim 9 , wherein the first transfer protocol comprises transferring the second data item according to the second transfer protocol in addition to the cache injection.
11. A system for storing a plurality of data items in a distributed network having a plurality of datacenters distributed over a plurality of geographic regions, the system comprising:
one or more storage devices configured to store a plurality of first data items uploaded to the distributed network from a plurality of first users, each first data item including metadata, the metadata including an upload geographic region at which the first data item is uploaded and one or more accessed geographic regions at which the first data item is accessed; and
one or more processors in communication with the one or more storage devices, the one or more processors configured to:
train a predictive model using the metadata of the plurality of first data items;
after training the predictive model using the metadata of the plurality of first data items, for a second data item uploaded to the distributed network by a second user:
determine one or more storage geographic regions at which the second data item is to be stored based at least in part on the predictive model, wherein at least one of the one or more storage geographic locations at which the second data item is to be stored is different from the upload geographic region at which the second data item was uploaded; and
instruct the second data item to be transferred from the upload geographic region of the second data item to one or more datacenters of the one or more storage geographic regions at which the second data item is to be stored.
12. The system of claim 11 , wherein the one or more processors are configured to predict one or more access geographic regions at which the second data item is predicted to be accessed based on the predictive model, wherein the one or more storage geographic regions at which the second data item is to be stored are determined based on the predicted one or more access geographic regions.
13. The system of claim 11 , wherein the predictive model is a decision tree model.
14. The system of claim 11 , wherein the metadata further includes, and the predictive model is trained on, at least one of:
an identification of a datacenter to which the first data item is uploaded;
an identification of an uploading user;
a time of upload;
a size of the first data item; or
a name of the first data item.
15. The system of claim 1 , wherein the metadata further includes, and the predictive model is trained on, at least one of:
one or more second storage geographic regions at which the first data item is stored; or
one or more times at which the first data item is accessed; or
a number of access requests for the first data item.
16. The system of claim 14 , wherein the plurality of first data items includes at least one data file, wherein the metadata of the data file includes file characteristic data, wherein the file characteristic data includes at least one of: a name of the file; a size of the file; or an identification of a directory or a file path at which the file is stored, and wherein the one or more processors are configured to train the predictive model based at least in part on the file characteristic data.
17. The system of claim 11 , wherein the one or more processors are configured to:
predict an amount of time until the second data item will be accessed for a first time; and
for at least one of the determined storage geographic regions of the second data item, select one of a first transfer protocol or a second transfer protocol for transferring the second data item to the at least one storage geographic region, based on the predicted amount of time, wherein an average time for the second data item to arrive at the at least one storage geographic region using the first transfer protocol is less than an average time for the second data item to arrive at the at least one storage geographic region using the second transfer protocol; and
transfer the second data item from the upload geographic region of the second data item to the at least one storage geographic region of the second data item according to the selected first or second transfer protocol.
18. The system of claim 17 , wherein the first transfer protocol comprises cache injection of the second data item to one or more caching servers located at the at least one storage geographic region.
19. The system of claim 11 , wherein the second transfer protocol comprises:
instruction of the second data item to be included in a file including other uploaded data items having a common storage geographic region as the second data item; and
instruction of the file to be transferred to one or more datacenters located at the common storage geographic region.
20. The system of claim 19 , wherein the first transfer protocol comprises performance of the second transfer protocol in addition to the cache injection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/156,541 US20230164219A1 (en) | 2019-11-04 | 2023-01-19 | Access Pattern Driven Data Placement in Cloud Storage |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/673,128 US11588891B2 (en) | 2019-11-04 | 2019-11-04 | Access pattern driven data placement in cloud storage |
US18/156,541 US20230164219A1 (en) | 2019-11-04 | 2023-01-19 | Access Pattern Driven Data Placement in Cloud Storage |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/673,128 Continuation US11588891B2 (en) | 2019-11-04 | 2019-11-04 | Access pattern driven data placement in cloud storage |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230164219A1 true US20230164219A1 (en) | 2023-05-25 |
Family
ID=73598977
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/673,128 Active US11588891B2 (en) | 2019-11-04 | 2019-11-04 | Access pattern driven data placement in cloud storage |
US18/156,541 Pending US20230164219A1 (en) | 2019-11-04 | 2023-01-19 | Access Pattern Driven Data Placement in Cloud Storage |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/673,128 Active US11588891B2 (en) | 2019-11-04 | 2019-11-04 | Access pattern driven data placement in cloud storage |
Country Status (6)
Country | Link |
---|---|
US (2) | US11588891B2 (en) |
EP (1) | EP4026303A1 (en) |
JP (1) | JP7454661B2 (en) |
KR (2) | KR20240055889A (en) |
CN (1) | CN114651433A (en) |
WO (1) | WO2021091851A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220197506A1 (en) * | 2020-12-17 | 2022-06-23 | Advanced Micro Devices, Inc. | Data placement with packet metadata |
US12131065B2 (en) * | 2021-08-19 | 2024-10-29 | Micron Technology, Inc. | Memory device overhead reduction using artificial intelligence |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
SE510048C3 (en) * | 1997-07-24 | 1999-05-03 | Mirror Image Internet Ab | Internet caching system |
US20050259682A1 (en) * | 2000-02-03 | 2005-11-24 | Yuval Yosef | Broadcast system |
US7991957B2 (en) * | 2008-05-27 | 2011-08-02 | Microsoft Corporation | Abuse detection using distributed cache |
US20130091207A1 (en) * | 2011-10-08 | 2013-04-11 | Broadcom Corporation | Advanced content hosting |
US9804928B2 (en) * | 2011-11-14 | 2017-10-31 | Panzura, Inc. | Restoring an archived file in a distributed filesystem |
US9298719B2 (en) * | 2012-09-04 | 2016-03-29 | International Business Machines Corporation | On-demand caching in a WAN separated distributed file system or clustered file system cache |
US9560127B2 (en) | 2013-01-18 | 2017-01-31 | International Business Machines Corporation | Systems, methods and algorithms for logical movement of data objects |
CN103795781B (en) * | 2013-12-10 | 2017-03-08 | 西安邮电大学 | A kind of distributed caching method based on file prediction |
TWI533678B (en) * | 2014-01-07 | 2016-05-11 | 緯創資通股份有限公司 | Methods for synchronization of live streaming broadcast and systems using the same |
US9607004B2 (en) * | 2014-06-18 | 2017-03-28 | International Business Machines Corporation | Storage device data migration |
CN107810490A (en) * | 2015-06-18 | 2018-03-16 | 华为技术有限公司 | System and method for the buffer consistency based on catalogue |
US10742767B2 (en) * | 2016-02-02 | 2020-08-11 | Sony Interactive Entertainment LLC | Systems and methods for downloading and updating save data to a data center |
US9912687B1 (en) * | 2016-08-17 | 2018-03-06 | Wombat Security Technologies, Inc. | Advanced processing of electronic messages with attachments in a cybersecurity system |
CN106713265B (en) * | 2016-11-21 | 2019-05-28 | 清华大学深圳研究生院 | CDN node distribution method and device, CDN node distribution server and CDN network system |
US10645534B1 (en) * | 2019-02-01 | 2020-05-05 | Tile, Inc. | User presence-enabled tracking device functionality |
US11895223B2 (en) * | 2019-02-06 | 2024-02-06 | International Business Machines Corporation | Cross-chain validation |
-
2019
- 2019-11-04 US US16/673,128 patent/US11588891B2/en active Active
-
2020
- 2020-11-03 CN CN202080071465.XA patent/CN114651433A/en active Pending
- 2020-11-03 KR KR1020247012756A patent/KR20240055889A/en active Search and Examination
- 2020-11-03 KR KR1020227012107A patent/KR102659627B1/en active IP Right Grant
- 2020-11-03 WO PCT/US2020/058641 patent/WO2021091851A1/en unknown
- 2020-11-03 JP JP2022522700A patent/JP7454661B2/en active Active
- 2020-11-03 EP EP20816018.4A patent/EP4026303A1/en active Pending
-
2023
- 2023-01-19 US US18/156,541 patent/US20230164219A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
CN114651433A (en) | 2022-06-21 |
US11588891B2 (en) | 2023-02-21 |
JP2023501084A (en) | 2023-01-18 |
US20210136150A1 (en) | 2021-05-06 |
KR20240055889A (en) | 2024-04-29 |
KR102659627B1 (en) | 2024-04-22 |
WO2021091851A1 (en) | 2021-05-14 |
JP7454661B2 (en) | 2024-03-22 |
EP4026303A1 (en) | 2022-07-13 |
KR20220064391A (en) | 2022-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230164219A1 (en) | Access Pattern Driven Data Placement in Cloud Storage | |
US10298670B2 (en) | Real time cloud workload streaming | |
US8612668B2 (en) | Storage optimization system based on object size | |
US20200351344A1 (en) | Data tiering for edge computers, hubs and central systems | |
US20150331635A1 (en) | Real Time Cloud Bursting | |
EP2939145B1 (en) | System and method for selectively routing cached objects | |
US11770451B2 (en) | System and method for automatic block storage volume tier tuning | |
CA2588704A1 (en) | System and method for managing quality of service for a storage system | |
CN108139974B (en) | Distributed cache live migration | |
US10810054B1 (en) | Capacity balancing for data storage system | |
US11662910B2 (en) | Workload and interface cognizant heat-tiered storage | |
US10360189B2 (en) | Data object storage across multiple storage nodes | |
CN110324406B (en) | Method for acquiring business data and cloud service system | |
US11489911B2 (en) | Transmitting data including pieces of data | |
KR101329759B1 (en) | Network block device providing personalized virtual machine in cloud computing environment and control method thereof | |
KR20220078244A (en) | Method and edge server for managing cache file for content fragments caching | |
Katsipoulakis et al. | Adaptive live VM migration in share-nothing IaaS-clouds with LiveFS | |
US10168763B2 (en) | Modification of when workloads access data units and/or on which storage devices data units are stored to conserve power | |
Lee et al. | Dtstorage: Dynamic tape-based storage for cost-effective and highly-available streaming service | |
US10958760B2 (en) | Data processing system using pre-emptive downloading | |
JP2017228211A (en) | Access program, data access method, and information processing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GOOGLE LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZHANG, WANGYUAN;ZHANG, VIVIENNE;GAUD, PRAMOD;AND OTHERS;SIGNING DATES FROM 20191106 TO 20191110;REEL/FRAME:062530/0044 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |