Cloud Application Architecture Guide EN GB PDF
Cloud Application Architecture Guide EN GB PDF
Cloud Application Architecture Guide EN GB PDF
Application
Architecture
Guide
PUBLISHED BY
Microsoft Press
A division of Microsoft Corporation
One Microsoft Way
Redmond, Washington 98052-6399
All rights reserved. No part of the content of this book may be reproduced or transmitted in any form
or by any means without the written permission of the publisher.
Microsoft Press books are available from booksellers and distributors worldwide. If you need
support related to this book, email Microsoft Press Support at mspinput@microsoft.com. Please tell
us what you think of this book at http://aka.ms/tellpress.
This book is provided “as-is” and expresses the author’s views and opinions. The views, opinions and
information expressed in this book, including URL and other Internet website references, may change
without notice.
Some examples depicted herein are provided for illustration only and are fictitious. No real
association or connection is intended or should be inferred.
Microsoft and the trademarks listed at http://www.microsoft.com on the “Trademarks” webpage are
trademarks of the Microsoft group of companies. All other marks are property of their respective
owners.
Acquisitions Editor:
Christopher Bennage
Developmental Editors:
Mike Wasson, Masashi Narumoto and the Microsoft Patterns and Practices team
Editorial Production:
Phil Evans
Copyeditor:
Jamie Letain
i
Contents
Overview …...……...…….....................................................................….……………..…………………… vii
Introduction .................................................................................................................................................................... viii
Chapter 1: Choose an architecture style …...……...……….……………..…………………………………… 1
A quick tour of the styles .............................................................................................................................................. 2
Architecture styles as constraints .............................................................................................................................. 4
Consider challenges and benefits ............................................................................................................................. 5
Chapter 1a: N-tier architecture style …...…………………………………........……………………………… 6
When to use this architecture ..................................................................................................................................... 7
Benefits ................................................................................................................................................................................ 7
Challenges .......................................................................................................................................................................... 7
Best practices .................................................................................................................................................................... 8
N-tier architecture on virtual machines .................................................................................................................. 8
Additional considerations ............................................................................................................................................ 9
Chapter 1b: Web-Queue-Worker architecture style ……...............................................…………… 10
When to use this architecture ................................................................................................................................... 11
Benefits .............................................................................................................................................................................. 11
Challenges ........................................................................................................................................................................ 11
Best practices .................................................................................................................................................................. 11
Web-Queue-Worker on Azure App Service ........................................................................................................ 12
Additional considerations .......................................................................................................................................... 12
Chapter 1c: Microservices architecture style ……..........................................................…………… 14
When to use this architecture ................................................................................................................................... 15
Benefits .............................................................................................................................................................................. 15
Challenges ........................................................................................................................................................................ 16
Best practices .................................................................................................................................................................. 17
Microservices using Azure Container Service ..................................................................................................... 19
Chapter 1d: CQRS architecture style …...............….......................................................….........…… 20
When to use this architecture ................................................................................................................................... 21
Benefits .............................................................................................................................................................................. 21
Challenges ........................................................................................................................................................................ 22
Best practices .................................................................................................................................................................. 22
CQRS in microservices ................................................................................................................................................. 22
ii Contents
Chapter 1e: Event-driven architecture style …...……...……………………..…………………..……….… 24
When to use this architecture .................................................................................................................................. 25
Benefits .............................................................................................................................................................................. 25
Challenges ....................................................................................................................................................................... 25
IoT architectures ............................................................................................................................................................ 26
Chapter 1f: Big data architecture style ……....................................................................…………… 27
Benefits .............................................................................................................................................................................. 29
Challenges ........................................................................................................................................................................ 29
Best practices .................................................................................................................................................................. 30
Chapter 1g: Big compute architecture style ……...........................................................…………… 31
When to use this architecture ................................................................................................................................... 32
Benefits .............................................................................................................................................................................. 32
Challenges ........................................................................................................................................................................ 32
Big compute using Azure Batch .............................................................................................................................. 33
Big compute running on Virtual Machines ............................................................................................................. 33
Chapter 2: Choose compute and data store technologies ….........….......................................…… 35
Chapter 2a: Overview of compute options ….........…........................................….......................… 37
Chapter 2b: Compute comparison ….........….......................................…......................................… 39
Hosting model ................................................................................................................................................................ 39
DevOps .............................................................................................................................................................................. 40
Scalability .......................................................................................................................................................................... 41
Availability ........................................................................................................................................................................ 41
Security .............................................................................................................................................................................. 42
Other .................................................................................................................................................................................. 42
Chapter 2c: Data store overview …......................................................…......................................… 43
Relational database management systems.......................................................................................................... 44
Key/value stores ............................................................................................................................................................. 44
Document databases ................................................................................................................................................... 45
Graph databases ............................................................................................................................................................ 46
Column-family databases .......................................................................................................................................... 47
Data analytics .................................................................................................................................................................. 48
Search Engine Databases ............................................................................................................................................ 48
Time Series Databases ................................................................................................................................................. 48
Object storage ............................................................................................................................................................... 49
Shared files ....................................................................................................................................................................... 49
Chapter 2d: Data store comparison …...................................................….....................................… 50
Criteria for choosing a data store ............................................................................................................................ 50
General Considerations ............................................................................................................................................... 50
Relational database management systems (RDBMS) ...................................................................................... 52
Document databases .................................................................................................................................................... 53
Key/value stores ............................................................................................................................................................. 54
iii Contents
Graph databases ........................................................................................................................................................... 55
Column-family databases .......................................................................................................................................... 56
Search engine databases ........................................................................................................................................... 57
Data warehouse ............................................................................................................................................................. 57
Time series databases ................................................................................................................................................. 58
Object storage ................................................................................................................................................................ 58
Shared files ...................................................................................................................................................................... 59
Chapter 3: Design your Azure application: design principles ……....................................………... 60
Chapter 3a: Design for self-healing …...........................................................................………...…… 62
Recommendations ........................................................................................................................................................ 62
Chapter 3b: Give all things redundancy …..........….....................................................................…… 64
Recommendations ........................................................................................................................................................ 64
Chapter 3c: Minimise coordination …...........….......................................…....................................… 66
Recommendations ........................................................................................................................................................ 67
Chapter 3d: Design to scale out …..........…........................................…..........................................… 69
Recommendations ......................................................................................................................................................... 69
Chapter 3e: Partition around limits …...........….................................…..........................................… 71
Recommendations ......................................................................................................................................................... 72
Chapter 3f: Design for operations …...........….................................…............................................… 73
Recommendations ......................................................................................................................................................... 73
Chapter 3g: Use managed services …..........…..................................…..........................................… 75
Chapter 3h: Use the best data store for the job …...........…..........................................................… 76
Recommendations ......................................................................................................................................................... 77
Chapter 3i: Design for evolution …...........…...................................................................................… 78
Recommendations ......................................................................................................................................................... 78
Chapter 3j: Build for the needs of business …...........….................................................................… 80
Recommendations ......................................................................................................................................................... 80
Chapter 3k: Designing resilient applications for Azure …..........…..............................................… 82
What is resilience? .......................................................................................................................................................... 82
Process to achieve resilience ...................................................................................................................................... 83
Defining your resilience requirements ................................................................................................................... 83
Designing for resilience................................................................................................................................................ 87
Resilience strategies....................................................................................................................................................... 87
Resilient deployment..................................................................................................................................................... 91
Monitoring and diagnostics ....................................................................................................................................... 92
Manual failure responses............................................................................................................................................. 93
Summary ........................................................................................................................................................................... 94
Chapter 4: Design your Azure application: Use these pillars of quality …...........…...................… 95
Scalability .......................................................................................................................................................................... 96
Availability ......................................................................................................................................................................... 98
Resilience ........................................................................................................................................................................... 99
iv Contents
Management and DevOps ..................................................................................................................................... 100
Security ........................................................................................................................................................................... 101
Chapter 5: Design your Azure application: Design patterns …..........…....................................… 103
Challenges in cloud development ........................................................................................................................ 103
Data Management ....................................................................................................................................................... 104
Design and Implementation .................................................................................................................................... 104
Messaging ...................................................................................................................................................................... 105
Management and Monitoring ................................................................................................................................ 106
Performance and Scalability .................................................................................................................................... 107
Resilience ........................................................................................................................................................................ 108
Security ............................................................................................................................................................................ 109
Chapter 6: Catalogue of patterns .…......................................................…...................................… 110
Ambassador pattern .................................................................................................................................................. 110
Anti-Corruption Layer pattern ................................................................................................................................ 112
Backends for Frontends pattern ............................................................................................................................. 114
Bulkhead pattern ......................................................................................................................................................... 116
Cache-Aside pattern .................................................................................................................................................. 119
Circuit Breaker pattern .............................................................................................................................................. 124
CQRS pattern ................................................................................................................................................................ 132
Compensating Transaction pattern ...................................................................................................................... 139
Competing Consumers pattern ............................................................................................................................. 143
Compute Resource Consolidation pattern ........................................................................................................ 148
Event Sourcing pattern .............................................................................................................................................. 156
External Configuration Store pattern ................................................................................................................... 162
Federated Identity pattern ....................................................................................................................................... 170
Gatekeeper pattern ..................................................................................................................................................... 174
Gateway Aggregation pattern ................................................................................................................................ 176
Gateway Offloading pattern .................................................................................................................................... 180
Gateway Routing pattern ......................................................................................................................................... 182
Health Endpoint Monitoring pattern ................................................................................................................... 185
Index Table pattern ..................................................................................................................................................... 191
Leader Election pattern .............................................................................................................................................. 197
Materialised View pattern ........................................................................................................................................ 204
Pipes and Filters pattern ........................................................................................................................................... 208
Priority Queue pattern .............................................................................................................................................. 215
Queue-Based Load Levelling pattern .................................................................................................................. 221
Retry pattern ................................................................................................................................................................. 224
Scheduler Agent Supervisor pattern ................................................................................................................... 227
Sharding pattern ......................................................................................................................................................... 234
Sidecar pattern ............................................................................................................................................................. 243
v Contents
Static Content Hosting pattern .............................................................................................................................. 246
Strangler pattern .......................................................................................................................................................... 250
Throttling pattern ........................................................................................................................................................ 252
Valet Key pattern ......................................................................................................................................................... 256
Chapter 7: Design review checklists .…...........................................................…..........................… 263
DevOps checklist ......................................................................................................................................................... 264
Availability checklist ................................................................................................................................................... 270
Scalability checklist ..................................................................................................................................................... 276
Resilience checklist ...................................................................................................................................................... 276
Azure services ............................................................................................................................................................... 286
Chapter 8: Summary..........................................................................................…..........................… 291
Chapter 9: Azure reference architectures .......................................................…..........................… 292
Identity management …..............................................................................................…........................................… 293
Hybrid network …...........................................................................................................…........................................… 298
Network DMZ …...........................................................................................................…...........................................… 303
Managed web application …...................................................................................…...........................................… 306
Running Linux VM workloads …..................................................................................….....................................… 310
Running Windows VM workloads …..........................................................................….....................................… 315
vi Contents
Cloud Application
Architecture
Guide
This guide presents a structured approach for designing cloud
applications that are scalable, resilient and highly available. The guidance
in this eBook is intended to help your architectural decisions regardless
of your cloud platform, though we will be using Azure so we can share
the best practices that we have learned from many years of customer
engagements.
1. Choosing the right architecture style for your application based on the kind
of solution you are building.
5. Applying design patterns specific to the problem you are trying to solve.
vii Introduction
Introduction
The cloud is changing the way applications are designed. Instead of
monoliths, applications are deconstructed into smaller, decentralised
services. These services communicate through APIs or by using
asynchronous messaging or eventing. Applications scale horizontally,
adding new instances as demand requires.
These trends bring new challenges. Application state is distributed. Operations are done in parallel
and asynchronously. The system as a whole must be resilient when failures occur. Deployments must
be automated and predictable. Monitoring and telemetry are critical for gaining insight into the
system. The Azure Application Architecture Guide is designed to help you navigate these changes.
The cloud is changing the way applications are designed. Instead of monoliths, applications are
deconstructed into smaller, decentralised services. These services communicate through APIs or by
using asynchronous messaging or eventing. Applications scale horizontally, adding new instances as
demand requires.
These trends bring new challenges. Application state is distributed. Operations are done in parallel
and asynchronously. The system as a whole must be resilient when failures occur. Deployments must
be automated and predictable. Monitoring and telemetry are critical for gaining insight into the
system. The Cloud Application Architecture Guide is designed to help you navigate these changes.
viii
Architecture Styles. The first decision point is the most fundamental. What kind of architecture are
you building? It might be a microservices architecture, a more traditional N-tier application, or a big
data solution. We have identified seven distinct architecture styles. There are benefits and challenges
to each.
Technology Choices. Two technology choices should be decided early on, because they affect the
entire architecture. These are the choice of compute and storage technologies. The term compute
refers to the hosting model for the computing resources that your applications runs on. Storage
includes databases but also storage for message queues, caches, IoT data, unstructured log data
and anything else that an application might persist to storage.
• Compute options and Storage options provide detailed comparison criteria for selecting
compute and storage services.
Design Principles. Throughout the design process, keep these ten high-level design principles
in mind.
• For best practices articles that provide specific guidance on auto-scaling, caching,
data partitioning, API design and more, go to
https://docs.microsoft.com/en-us/azure/architec turebest-practices/index.
Pillars. A successful cloud application will focus on these five pillars of software quality: scalability,
availability, resilience, management and security.
• Use our Design review checklists to review your design according to these quality pillars.
Cloud Design Patterns. These design patterns are useful for building reliable, scalable and secure
applications on Azure. Each pattern describes a problem, a pattern that addresses the problem and
an example based on Azure.
ix Introduction
1
Choosing an
architecture style
The first decision you need to make when designing a cloud application
is the architecture. Choosing the best architecture for the application you
are building based on its complexity, type of domain, if it’s an IaaS or PaaS
application and what the application will do. Also consider the skills of
the developer and DevOps teams and if the application has an existing
architecture.
An architecture style places constraints on the design, which guide the “shape” of an architecture
style by restricting the choices. These constraints provide both benefits and challenges for the
design. Use the information in this section to understand what the trade-offs are when adopting
any of these styles.
This section describes ten design principles to keep in mind as you build. Following these principles
will help you build an application that is more scalable, resilient and manageable.
We have identified a set of architecture styles that are commonly found in cloud applications.
The article for each style includes:
●● A description and logical diagram of the style.
●● Recommendations for when to choose this style.
●● Benefits, challenges and best practices.
●● A recommended deployment using relevant Azure services.
N-tier
N-tier is a traditional architecture for enterprise applications. Dependencies are managed by dividing
the application into layers that perform logical functions, such as presentation, business logic and
data access. A layer can only call into layers that sit below it. However, this horizontal layering can be
a liability. It can be hard to introduce changes in one part of the application without touching the rest
of the application. That makes frequent updates a challenge, limiting how quickly new features can
be added.
N-tier is a natural fit for migrating existing applications that already use a layered architecture.
For that reason, N-tier is most often seen in infrastructure as a service (IaaS) solutions or applications
that use a mix of IaaS and managed services.
Web-Queue-Worker
For a purely PaaS solution, consider a Web-Queue-Worker architecture. In this style, the application
has a web front end that handles HTTP requests and a back-end worker that performs CPU-
intensive tasks or long-running operations. The front end communicates to the worker through an
asynchronous message queue.
Web-queue-worker is suitable for relatively simple domains with some resource-intensive tasks. Like
N-tier, the architecture is easy to understand. The use of managed services simplifies deployment
and operations. But with complex domains, it can be hard to manage dependencies. The front end
and the worker can easily become large, monolithic components that are hard to maintain and
update. As with N-tier, this can reduce the frequency of updates and limit innovation.
Each service can be built by a small, focused development team. Individual services can be deployed
without a lot of coordination between teams, which encourages frequent updates. A microservice
architecture is more complex to build and manage than either N-tier or web-queue-worker.
It requires a mature development and DevOps culture. But done right, this style can lead
to higher release velocity, faster innovation and a more resilient architecture.
CQRS
The CQRS (Command and Query Responsibility Segregation) style separates read and write
operations into separate models. This isolates the parts of the system that update data from
the parts that read the data. Moreover, reads can be executed against a materialised view that
is physically separate from the write database. That lets you scale the read and write workloads
independently and optimise the materialised view for queries.
CQRS makes the most sense when it’s applied to a subsystem of a larger architecture. Generally,
you shouldn’t impose it across the entire application, as that will just create unneeded complexity.
Consider it for collaborative domains where many users access the same data.
Consider an event-driven architecture for applications that ingest and process a large volume of data
with very low latency, such as IoT solutions. This style is also useful when different subsystems must
perform different types of processing on the same event data.
The following table summarises how each style manages dependencies and the types of domain
that are best suited for each.
Web-Queue-Worker Front and backend jobs, decoupled by async Relatively simple domain with some
messaging. resource intensive tasks.
CQRS Read/write segregation. Schema and Collaborative domain where lots of users
scale are optimised separately. access the same data.
Big data Divide a huge dataset into small chunks. Batch and real-time data analysis.
Parallel processing on local datasets. Predictive analysis using ML.
Big compute Data allocation to thousands of cores. Compute intensive domains such as simulation.
• Manageability. How hard is it to manage the application, monitor, deploy updates and so on?
N-tier
architecture style
An N-tier architecture divides an application into logical layers and
physical tiers.
Layers are a way to separate responsibilities and manage dependencies. Each layer has a specific
responsibility. A higher layer can use services in a lower layer, but not the other way around.
Tiers are physically separated, running on separate machines. A tier can call to another tier directly
or use asynchronous messaging (message queue). Although each layer might be hosted in its
own tier, that’s not required. Several layers might be hosted on the same tier. Physically separating
the tiers improves scalability and resilience, but also adds latency from the additional network
communication.
A traditional three-tier application has a presentation tier, a middle tier and a database tier. The
middle tier is optional. More complex applications can have more than three tiers. The diagram
above shows an application with two middle tiers, encapsulating different areas of functionality.
• In a closed layer architecture, a layer can only call the next layer immediately down
• In an open layer architecture, a layer can call any of the layers below it.
A closed layer architecture limits the dependencies between layers. However, it might create
unnecessary network traffic, if one layer simply passes requests along to the next layer.
N-tier architectures are very common in traditional on-premises applications, so it’s a natural fit
for migrating existing workloads to Azure.
Benefits
• Portability between cloud and on-premises, and between cloud platforms.
• Less learning curve for most developers.
• Natural evolution from the traditional application model.
• Open to heterogeneous environment (Windows/Linux)
Challenges
• It’s easy to end up with a middle tier that just does CRUD operations on the database,
adding extra latency without doing any useful work.
• Monolithic design prevents independent deployment of features.
• Managing an IaaS application is more work than an application that uses only managed services.
• It can be difficult to manage network security in a large system.
This section describes a recommended N-tier architecture running on VMs. Each tier consists of two
or more VMs, placed in an availability set or VM scale set. Multiple VMs provide resilience in case one
VM fails. Load balancers are used to distribute requests across the VMs in a tier. A tier can be scaled
horizontally by adding more VMs to the pool.
Each tier is also placed inside its own subnet, meaning their internal IP addresses fall within the same
address range. That makes it easy to apply network security group (NSG) rules and route tables
to individual tiers.
The web and business tiers are stateless. Any VM can handle any request for that tier. The data
tier should consist of a replicated database. For Windows, we recommend SQL Server, using Always
On Availability Groups for high availability. For Linux, choose a database that supports replication,
such as Apache Cassandra.
Network Security Groups (NSGs) restrict access to each tier. For example, the database tier
only allows access from the business tier.
Additional considerations
• N-tier architectures are not restricted to three tiers. For more complex applications, it is common
to have more tiers. In that case, consider using layer-7 routing to route requests to a particular tier.
• Tiers are the boundary of scalability, reliability and security. Consider having separate tiers
for services with different requirements in those areas.
• Look for places in the architecture where you can use a managed service without significant
refactoring. In particular, look at caching, messaging, storage and databases.
• For higher security, place a network DMZ in front of the application. The DMZ includes network
virtual appliances (NVAs) that implement security functionality such as firewalls and packet
inspection. For more information, see Network DMZ reference architecture.
• For high availability, place two or more NVAs in an availability set, with an external load balancer
to distribute Internet requests across the instances. For more information, see Deploy highly
available network virtual appliances.
• Do not allow direct RDP or SSH access to VMs that are running application code. Instead,
operators should log into a jumpbox, also called a bastion host. This is a VM on the network that
administrators use to connect to the other VMs. The jumpbox has an NSG that allows RDP or SSH
only from approved public IP addresses.
• You can extend the Azure virtual network to your on-premises network using a site-to-site
virtual private network (VPN) or Azure ExpressRoute. For more information, see Hybrid network
reference architecture.
• If your organisation uses Active Directory to manage identity, you may want to extend your
Active Directory environment to the Azure VNet. For more information, see Identity management
reference architecture.
• If you need higher availability than the Azure SLA for VMs provides, replicate the application
across two regions and use Azure Traffic Manager for failover. For more information, see Run
Windows VMs in multiple regions or Run Linux VMs in multiple regions.
Web-Queue-
Worker
architecture style
The core components of this architecture are a web front end that serves
client requests, and a worker that performs resource-intensive tasks, long-
running workflows or batch jobs. The web front end communicates with
the worker through a message queue.
Other components that are commonly incorporated into this architecture include:
The front end might consist of a web API. On the client side, the web API can be consumed
by a single-page application that makes AJAX calls or by a native client application.
Benefits
• Relatively simple architecture that is easy to understand.
• Easy to deploy and manage.
• Clear separation of concerns.
• The front end is decoupled from the worker using asynchronous messaging.
• The front end and the worker can be scaled independently.
Challenges
• Without careful design, the front end and the worker can become large, monolithic components
that are difficult to maintain and update.
• There may be hidden dependencies, if the front end and worker share data schemas or code
modules.
Best practices
• Use polyglot persistence when appropriate. See Use the best data store for the job.
• For best practices articles that provide specific guidance on auto-scaling, caching, data
partitioning, API design and more, go to https://docs.microsoft.com/en-us/azure/architecture/
best-practices/index.
The front end is implemented as an Azure App Service web app and the worker is implemented as
a WebJob. The web app and the WebJob are both associated with an App Service plan that provides
the VM instances.
You can use either Azure Service Bus or Azure Storage queues for the message queue. (The diagram
shows an Azure Storage queue.)
Azure Redis Cache stores session state and other data that needs low latency access.
Azure CDN is used to cache static content such as images, CSS or HTML.
For storage, choose the storage technologies that best fit the needs of the application. You might use
multiple storage technologies (polyglot persistence). To illustrate this idea, the diagram shows Azure
SQL Database and Azure Cosmos DB.
Additional considerations
• Not every transaction has to go through the queue and worker to storage. The web front end can
perform simple read/write operations directly. Workers are designed for resource-intensive tasks
or long-running workflows. In some cases, you might not need a worker at all.
• Use the built-in autoscale feature of App Service to scale out the number of VM instances. If the
load on the application follows predictable patterns, use schedule-based autoscale. If the load
is unpredictable, use metrics-based autoscaling rules.
• Consider putting the web app and the WebJob into separate App Service plans. That way,
they are hosted on separate VM instances and can be scaled independently.
• Use deployment slots to manage deployments. This lets you to deploy an updated version
to a staging slot, then swap over to the new version. It also lets you swap back to the previous
version, if there was a problem with the update.
Microservices
architecture style
A microservices architecture consists of a collection of small, autonomous
services. Each service is self-contained and should implement a single
business capability.
In some ways, microservices are the natural evolution of service-orientated architectures (SOA),
but there are differences between microservices and SOA. Here are some defining characteristics
of a microservice:
• Each service is a separate codebase, which can be managed by a small development team.
• Services can be deployed independently. A team can update an existing service without
rebuilding and redeploying the entire application.
• Services are responsible for persisting their own data or external state. This differs from the
traditional model, where a separate data layer handles data persistence.
• Services don’t need to share the same technology stack, libraries or frameworks.
Besides for the services themselves, some other components appear in a typical microservices
architecture:
Service Discovery. Maintains a list of services and which nodes they are located on. Enables service
lookup to find the endpoint for a service.
API Gateway. The API gateway is the entry point for clients. Clients don’t call services directly.
Instead, they call the API gateway, which forwards the call to the appropriate services on the
back end. The API gateway might aggregate the responses from several services and return
the aggregated response.
• It decouples clients from services. Services can be versioned or refactored without needing
to update all of the clients.
• Services can use messaging protocols that are not web friendly, such as AMQP.
• The API Gateway can perform other cross-cutting functions such as authentication, logging,
SSL termination and load balancing.
Benefits
• Independent deployments. You can update a service without redeploying the entire application
and roll back or roll forward an update if something goes wrong. Bug fixes and feature releases
are more manageable and less risky.
• Independent development. A single development team can build, test and deploy a service.
The result is continuous innovation and a faster release cadence.
• Fault isolation. If a service goes down, it won’t take out the entire application. However,
that doesn’t mean you get resilience for free. You still need to follow resilience best practices
and design patterns. See Designing resilient applications for Azure.
• Mixed technology stacks. Teams can pick the technology that best fits their service.
• Granular scaling. Services can be scaled independently. At the same time, the higher density
of services per VM means that VM resources are fully utilised. Using placement constraints,
a services can be matched to a VM profile (high CPU, high memory and so on).
Challenges
• Complexity. A microservices application has more moving parts than the equivalent monolithic
application. Each service is simpler, but the entire system as a whole is more complex.
• Development and test. Developing against service dependencies requires a different approach.
Existing tools are not necessarily designed to work with service dependencies. Refactoring across
service boundaries can be difficult. It is also challenging to test service dependencies, especially
when the application is evolving quickly.
• Lack of governance. The decentralised approach to building microservices has advantages, but
it can also lead to problems. You may end up with so many different languages and frameworks
that the application becomes hard to maintain. It may be useful to put some project-wide
standards in place, without overly restricting teams’ flexibility. This especially applies to cross-
cutting functionality such as logging.
• Network congestion and latency. The use of many small, granular services can result in more
interservice communication. Also, if the chain of service dependencies gets too long (service
A calls B, which calls C...), the additional latency can become a problem. You will need to design
APIs carefully. Avoid overly chatty APIs, think about serialisation formats and look for places
to use asynchronous communication patterns.
• Data integrity. With each microservice responsible for its own data persistence. As a result,
data consistency can be a challenge. Embrace eventual consistency where possible.
• Versioning. Updates to a service must not break services that depend on it. Multiple services
could be updated at any given time, so without careful design, you might have problems with
backward or forward compatibility.
• Skillset. Microservices are highly distributed systems. Carefully evaluate whether the team
has the skills and experience to be successful.
• Decentralise everything. Individual teams are responsible for designing and building services.
Avoid sharing code or data schemas.
• Data storage should be private to the service that owns the data. Use the best storage for each
service and data type.
• Avoid coupling between services. Causes of coupling include shared database schemas and
rigid communication protocols.
• Offload cross-cutting concerns, such as authentication and SSL termination, to the gateway.
• Keep domain knowledge out of the gateway. The gateway should handle and route client
requests without any knowledge of the business rules or domain logic. Otherwise, the gateway
becomes a dependency and can cause coupling between services.
• Services should have loose coupling and high functional cohesion. Functions that are likely
to change together should be packaged and deployed together. If they reside in separate
services, those services end up being tightly coupled, because a change in one service will
require updating the other service. Overly chatty communication between two services may
be a symptom of tight coupling and low cohesion.
• Isolate failures. Use resilience strategies to prevent failures within a service from cascading.
See designing resilient applications.
For a list and summary of the resilience patterns available in Azure, go to https://docs.microsoft.com/
en-us/azure/architecture/patterns/category/resiliency.
Public nodes. These nodes are reachable through a public-facing load balancer. The API gateway
is hosted on these nodes.
Backend nodes. These nodes run services that clients reach via the API gateway. These nodes don’t
receive Internet traffic directly. The backend nodes might include more than one pool of VMs, each
with a different hardware profile. For example, you could create separate pools for general compute
workloads, high CPU workloads and high memory workloads.
Management VMs. These VMs run the master nodes for the container orchestrator.
Networking. The public nodes, backend nodes and management VMs are placed in separate
subnets within the same virtual network (VNet).
Load balancers. An externally facing load balancer sits in front of the public nodes. It distributes
internet requests to the public nodes. Another load balancer is placed in front of the management
VMs, to allow secure shell (ssh) traffic to the management VMs, using NAT rules.
For reliability and scalability, each service is replicated across multiple VMs. However, because
services are also relatively lightweight (compared with a monolithic application), multiple services
are usually packed into a single VM. Higher density allows better resource utilisation. If a particular
service doesn’t use a lot of resources, you don’t need to dedicate an entire VM to running that
service.
The Service Fabric Cluster is deployed to one or more VM scale sets. You might have more than one
VM scale set in the cluster, in order to have a mix of VM types. An API Gateway is placed in front
of the Service Fabric cluster, with an external load balancer to receive client requests.
The Service Fabric runtime performs cluster management, including service placement, node failover
and health monitoring. The runtime is deployed on the cluster nodes themselves. There isn’t a
separate set of cluster management VMs.
Services communicate with each other using the reverse proxy that is built into Service Fabric.
Service Fabric provides a discovery service that can resolve the endpoint for a named service.
CQRS
architecture style
Command and Query Responsibility Segregation (CQRS) is an architecture
style that separates read operations from write operations.
In traditional architectures, the same data model is used to query and update a database.
That’s simple and works well for basic CRUD operations. In more complex applications, however,
this approach can become unwieldy. For example, on the read side, the application may perform
many different queries, returning data transfer objects (DTOs) with different shapes. Object mapping
can become complicated. On the write side, the model may implement complex validation and
business logic. As a result, you can end up with an overly complex model that does too much.
Another potential problem is that read and write workloads are often asymmetrical, with very
different performance and scale requirements.
CQRS addresses these problems by separating reads and writes into separate models, using
commands to update data and queries to read data.
• Commands should be task based, rather than data centric. (“Book hotel room”, not “set
ReservationStatus to Reserved”.) Commands may be placed in a queue for asynchronous
processing, rather than being processed synchronously.
• Queries never modify the database. A query returns a DTO that does not encapsulate any domain
knowledge.
If separate read and write databases are used, they must be kept in sync. Typically this is
accomplished by having the write model publish an event whenever it updates the database.
Updating the database and publishing the event must occur in a single transaction.
Some implementations of CQRS use the Event Sourcing pattern. With this pattern, application state
is stored as a sequence of events. Each event represents a set of changes to the data. The current
state is constructed by replaying the events. In a CQRS context, one benefit of Event Sourcing is that
the same events can be used to notify other components – in particular, to notify the read model.
The read model uses the events to create a snapshot of the current state, which is more efficient
for queries. However, Event Sourcing adds complexity to the design.
CQRS is not a top-level architecture that applies to an entire system. Apply CQRS only to those
subsystems where there is clear value in separating reads and writes. Otherwise, you are creating
additional complexity for no benefit.
Benefits
• Independently scaling. CQRS allows the read and write workloads to scale independently and
may result in fewer lock contentions.
• Optimised data schemas. The read side can use a schema that is optimised for queries, while
the write side uses a schema that is optimised for updates.
• Security. It’s easier to ensure that only the right domain entities are performing writes on the data.
• Simpler queries. By storing a materialised view in the read database, the application can
avoid complex joins when querying.
Challenges
• Complexity. The basic idea of CQRS is simple. But it can lead to a more complex application
design, especially if they include the Event Sourcing pattern.
• Messaging. Although CQRS does not require messaging, it’s common to use messaging to
process commands and publish update events. In that case, the application must handle message
failures or duplicate messages.
• Eventual consistency. If you separate the read and write databases, the read data may be stale.
Best practices
• For more information about implementing CQRS, go to https://docs.microsoft.com/en-us/azure/
architecture/patterns/cqrs.
• For information about using the Event Sourcing pattern to avoid update conflicts,
go to https://docs.microsoft.com/en-us/azure/architecture/patterns/event-sourcing.
• For information about using the Materialised View pattern for the read model, to optimise
the schema for queries, go to https://docs.microsoft.com/en-us/azure/architecture/patterns/
materialised-view.
CQRS in microservices
CQRS can be especially useful in a microservices architecture. One of the principles of microservices
is that a service cannot directly access another service’s data store.
Event-driven
architecture style
An event-driven architecture consists of event producers that generate
a stream of events and event consumers that listen for the events.
Events are delivered in near real time, so consumers can respond immediately to events as they
occur. Producers are decoupled from consumers – a producer doesn’t know which consumers are
listening. Consumers are also decoupled from each other and every consumer sees all of the events.
This differs from a Competing Consumers pattern, where consumers pull messages from a queue
and a message is processed just once (assuming no errors). In some systems, such as IoT, events
must be ingested at very high volumes.
An event driven architecture can use a pub/sub model or an event stream model.
• Pub/sub: The messaging infrastructure keeps track of subscriptions. When an event is published,
it sends the event to each subscriber. After an event is received, it cannot be replayed, and new
subscribers do not see the event.
• Event streaming: Events are written to a log. Events are strictly ordered (within a partition)
and durable. Clients don’t subscribe to the stream, instead a client can read from any part of
the stream. The client is responsible for advancing its position in the stream. That means a client
can join at any time and can replay events.
• Complex event processing. A consumer processes a series of events, looking for patterns in
the event data, using a technology such as Azure Stream Analytics or Apache Storm. For example,
you could aggregate readings from an embedded device over a time window and generate
a notification if the moving average crosses a certain threshold.
• Event stream processing. Use a data streaming platform, such as Azure IoT Hub or Apache
Kafka, as a pipeline to ingest events and feed them to stream processors. The stream processors
act to process or transform the stream. There may be multiple stream processors for different
subsystems of the application. This approach is a good fit for IoT workloads.
The source of the events may be external to the system, such as physical devices in an IoT solution.
In that case, the system must be able to ingest the data at the volume and throughput that
is required by the data source.
In the logical diagram above, each type of consumer is shown as a single box. In practice,
it’s common to have multiple instances of a consumer, to avoid having the consumer become
a single point of failure in system. Multiple instances might also be necessary to handle the
volume and frequency of events. Also, a single consumer might process events on multiple threads.
This can create challenges if events must be processed in order or require exactly-once semantics.
See Minimise Coordination.
Benefits
• Producers and consumers are decoupled.
• No point-to point-integrations. It’s easy to add new consumers to the system.
• Consumers can respond to events immediately as they arrive.
• Highly scalable and distributed.
• Subsystems have independent views of the event stream.
Challenges
• Guaranteed delivery. In some systems, especially in IoT scenarios, it’s crucial to guarantee
that events are delivered.
• Processing events in order or exactly once. Each consumer type typically runs in multiple
instances, for resilience and scalability. This can create a challenge if the events must be
processed in order (within a consumer type) or if the processing logic is not idempotent.
The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency
messaging system.
Devices might send events directly to the cloud gateway or through a field gateway. A field gateway
is a specialised device or software, usually collocated with the devices, that receives events and
forwards them to the cloud gateway. The field gateway might also pre-process the raw device events,
performing functions such as filtering, aggregation or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data
(for example, to storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
The boxes that are shaded grey show components of an IoT system that are not directly related
to event streaming, but are included here for completeness.
• The device registry is a database of the provisioned devices, including the device IDs and usually
device metadata, such as location.
• The provisioning API is a common external interface for provisioning and registering new devices.
• Some IoT solutions allow command and control messages to be sent to devices.
This section has presented a very high-level view of IoT and there are many subtleties and challenges
to consider. For more information and a detailed reference architecture, go to https://azure.microsoft.
com/en-us/updates/microsoft-azure-iot-reference-architecture-available/ (PDF download).
Big data
architecture style
A big data architecture is designed to handle the ingestion, processing
and analysis of data that is too large or complex for traditional database
systems.
Big data solutions typically involve one or more of the following types of workload:
Most big data architectures include some or all of the following components:
• Data sources: All big data solutions start with one or more data sources. Examples include:
• Application data stores, such as relational databases.
• Static files produced by applications, such as web server log files.
• Real-time data sources, such as IoT devices.
• Data storage: Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats. This kind of store is often called
a data lake. Options for implementing this storage include Azure Data Lake Store or blob
containers in Azure Storage.
• Real-time message ingestion. If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing. This might be a
simple data store, where incoming messages are dropped into a folder for processing. However,
many solutions need a message ingestion store to act as a buffer for messages and to support
scale-out processing, reliable delivery and other message queuing semantics. Options include
Azure Event Hubs, Azure IoT Hubs and Kafka.
• Stream processing. After capturing real-time messages, the solution must process them by
filtering, aggregating and otherwise preparing the data for analysis. The processed stream data
is then written to an output sink. Azure Stream Analytics provides a managed stream processing
service based on perpetually running SQL queries that operate on unbounded streams.
You can also use open source Apache streaming technologies like Storm and Spark Streaming
in an HDInsight cluster.
• Analytical data store. Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. The analytical
data store used to serve these queries can be a Kimball-style relational data warehouse, as seen
in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented
through a low-latency NoSQL technology such as HBase or an interactive Hive database
that provides a metadata abstraction over data files in the distributed data store. Azure SQL
Data Warehouse provides a managed service for large-scale, cloud-based data warehousing.
HDInsight supports Interactive Hive, HBase and Spark SQL, which can also be used to serve
data for analysis.
• Analysis and reporting. The goal of most big data solutions is to provide insights into the
data through analysis and reporting. To empower users to analyse the data, the architecture
may include a data modelling layer, such as a multidimensional OLAP cube or tabular data
model in Azure Analysis Services. It might also support self-service BI, using the modelling
and visualisation technologies in Microsoft Power BI or Microsoft
• Excel. Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts. For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to leverage their existing skills with Python or R. For large-
scale data exploration, you can use Microsoft R Server, either standalone or with Spark.
• Orchestration. Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store or push the results straight to a report
or dashboard. To automate these workflows, you can use an orchestration technology such as
Azure Data Factory or Apache Oozie and Sqoop.
Azure includes many services that can be used in a big data architecture. They fall roughly into
two categories:
• Managed services, including Azure Data Lake Store, Azure Data Lake Analytics, Azure Data
Warehouse, Azure Stream Analytics, Azure Event Hub, Azure IoT Hub and Azure Data Factory.
These options are not mutually exclusive and many solutions combine open source technologies
with Azure services.
Benefits
• Technology choices. You can mix and match Azure managed services and Apache technologies
in HDInsight clusters, to capitalise on existing skills or technology investments.
• Performance through parallelism. Big data solutions take advantage of parallelism, enabling
high-performance solutions that scale to large volumes of data.
• Elastic scale. All of the components in the big data architecture support scale-out provisioning,
so that you can adjust your solution to small or large workloads, and pay only for the resources
that you use.
• Interoperability with existing solutions. The components of the big data architecture are also
used for IoT processing and enterprise BI solutions, enabling you to create an integrated solution
across data workloads.
Challenges
• Complexity. Big data solutions can be extremely complex, with numerous components to handle
data ingestion from multiple data sources. It can be challenging to build, test and troubleshoot
big data processes. Moreover, there may be a large number of configuration settings across
multiple systems that must be used in order to optimise performance.
• Skillset. Many big data technologies are highly specialised and use frameworks and languages
that are not typical of more general application architectures. On the other hand, big data
technologies are evolving new APIs that build on more established languages. For example,
the U-SQL language in Azure Data Lake Analytics is based on a combination of Transact-SQL
and C#. Similarly, SQL-based APIs are available for Hive, HBase and Spark.
• Technology maturity. Many of the technologies used in big data are evolving. While core
Hadoop technologies such as Hive and Pig have stabilised, emerging technologies such as Spark
introduce extensive changes and enhancements with each new release. Managed services such
as Azure Data Lake Analytics and Azure Data Factory are relatively young, compared with other
Azure services, and will likely evolve over time.
• Security. Big data solutions usually rely on storing all static data in a centralised data lake.
Securing access to this data can be challenging, especially when the data must be ingested
and consumed by multiple applications and platforms.
• Partition data. Batch processing usually happens on a recurring schedule – for example, weekly
or monthly. Partition data files, and data structures such as tables, based on temporal periods
that match the processing schedule. That simplifies data ingestion and job scheduling, and
makes it easier to troubleshoot failures. Also, partitioning tables that are used in Hive, U-SQL
or SQL queries can significantly improve query performance.
• Apply schema-on-read semantics. Using a data lake lets you to combine storage for files in
multiple formats, whether structured, semi-structured or unstructured. Use schema-on-read
semantics, which project a schema onto the data when the data is processing, not when the data
is stored. This builds flexibility into the solution and prevents bottlenecks during data ingestion
caused by data validation and type checking.
• Process data in-place. Traditional BI solutions often use an extract, transform and load (ETL)
process to move data into a data warehouse. With larger volumes of data, and a greater variety
of formats, big data solutions generally use variations of ETL, such as transform, extract and load
(TEL). With this approach, the data is processed within the distributed data store, transforming
it to the required structure, before moving the transformed data into an analytical data store.
• Balance utilisation and time costs. For batch processing jobs, it’s important to consider two
factors: The per-unit cost of the compute nodes and the per-minute cost of using those nodes
to complete the job. For example, a batch job may take eight hours with four cluster nodes.
However, it might turn out that the job uses all four nodes only during the first two hours, and
after that, only two nodes are required. In that case, running the entire job on two nodes would
increase the total job time, but would not double it, so the total cost would be less. In some
business scenarios, a longer processing time may be preferable to the higher cost of using
under-utilised cluster resources.
• Separate cluster resources. When deploying HDInsight clusters, you will normally achieve better
performance by provisioning separate cluster resources for each type of workload. For example,
although Spark clusters include Hive, if you need to perform extensive processing with both
Hive and Spark, you should consider deploying separate dedicated Spark and Hadoop clusters.
Similarly, if you are using HBase and Storm for low latency stream processing and Hive for batch
processing, consider separate clusters for Storm, HBase and Hadoop.
• Orchestrate data ingestion. In some cases, existing business applications may write data files
for batch processing directly into Azure storage blob containers, where they can be consumed by
HDInsight or Azure Data Lake Analytics. However, you will often need to orchestrate the ingestion
of data from on-premises or external data sources into the data lake. Use an orchestration
workflow or pipeline, such as those supported by Azure Data Factory or Oozie, to achieve this
in a predictable and centrally manageable fashion.
• Scrub sensitive data early. The data ingestion workflow should scrub sensitive data early in the
process, to avoid storing it in the data lake.
Big compute
architecture style
The term big compute describes large-scale workloads that require
a large number of cores, often numbering in the hundreds or thousands.
Scenarios include image rendering, fluid dynamics, financial risk
modelling, oil exploration, drug design and engineering stress analysis,
among others.
• The work can be split into discrete tasks, which can be run across many cores simultaneously.
• Each task is finite. It takes some input, does some processing and produces output. The entire
application runs for a finite amount of time (minutes to days). A common pattern is to provision
a large number of cores in a burst and then spin down to zero once the application completes.
• The application does not need to stay up 24/7. However, the system must handle node failures
or application crashes.
• For some applications, tasks are independent and can run in parallel. In other cases, tasks
are tightly coupled, meaning they must interact or exchange intermediate results. In that case,
consider using high-speed networking technologies such as InfiniBand and remote direct
memory access (RDMA).
• Depending on your workload, you might use compute-intensive VM sizes (H16r, H16mr and A9).
• Simulations that are computationally intensive and must be split across CPUs in multiple
computers (10-1000s).
• Simulations that require too much memory for one computer and must be split across multiple
computers.
• Long-running computations that would take too long to complete on a single computer.
• Smaller computations that must be run hundreds or thousands of times, such as Monte Carlo
simulations.
Benefits
• High performance with “embarrassingly parallel” processing.
• Can harness hundreds or thousands of computer cores to solve large problems faster.
• You can provision VMs as needed to do work and then tear them down.
Challenges
• Managing the VM infrastructure.
• For tightly coupled tasks, adding more cores can have diminishing returns. You may need
to experiment to find the optimum number of cores.
Using Azure Batch, you configure a VM pool, and upload the applications and data files. Then the
Batch service provisions the VMs, assign tasks to the VMs, runs the tasks and monitors the progress.
Batch can automatically scale out the VMs in response to the workload. Batch also provides job
scheduling.
The head node provides management and job scheduling services to the cluster. For tightly coupled
tasks, use an RDMA network that provides very high bandwidth, low latency communication between
VMs. For more information see Deploy an HPC Pack 2016 cluster in Azure.
Choose compute
and data store
technologies
Choose the right technologies for Azure applications.
When designing a solution for Azure, there are two technology choices that you should make early in
the design process, because they affect the entire architecture. These are the choice of compute and
data store technologies.
Your compute option is which hosting model you choose for the computing resources that your
application runs on. Broadly, the choice is between Infrastructure-as-a-Service (IaaS), Platform-as-
a-Service (PaaS) or Functions-as-a-Service (FaaS), and the spectrum in between. There are seven
main compute options currently available in Azure for you to choose from. To make your choice,
consider the appropriate features and limitations of the service, availability and scalability, cost and
considerations for DevOps. The comparison tables in this section will help you narrow down your
choices.
The data store includes any kind of data your application needs to manage, ingest, generate or
that users create. Business data, caches, IoT data, telemetry and unstructured log data are the most
common types, and applications often contain more than one data type. Different data types have
different processing requirements and so you need to choose the right store for each type for the
best results. Some data store technologies support multiple storage models. Use the information in
this section to first choose which storage model is best suited for your requirements. Then consider
a particular data store within that category, based on factors such as feature set, cost and ease
of management.
This section of the Application Architecture Guide contains the following topics:
• Compute options overview introduces some general considerations for choosing a compute
service in Azure.
• Criteria for choosing a compute option compares specific Azure compute services across several
axes, including hosting model, DevOps, availability and scalability.
• Comparison criteria for choosing a data store describes some of the factors to consider
when choosing a data store.
36
2a
Overview of
compute options
The term compute refers to the hosting model for the computing
resources that your application runs on.
At one end of the spectrum is Infrastructure-as-a-Service (IaaS). With IaaS, you provision the VMs
that you need, along with associated network and storage components. Then you deploy whatever
software and applications you want onto those VMs. This model is the closest to a traditional on-
premises environment, except that Microsoft manages the infrastructure. You still manage the
individual VMs.
Platform-as-a-Service (PaaS) provides a managed hosting environment, where you can deploy
your application without needing to manage VMs or networking resources. For example, instead of
creating individual VMs, you specify an instance count, and the service will provision, configure and
manage the necessary resources. Azure App Service is an example of a PaaS service.
There is a spectrum from IaaS to pure PaaS. For example, Azure VMs can auto-scale by using VM
Scale Sets. This automatic scaling capability isn’t strictly PaaS, but it’s the type of management feature
that might be found in a PaaS service.
Functions-as-a-Service (FaaS) goes even further in removing the need to worry about the hosting
environment. Instead of creating compute instances and deploying code to those instances, you
simply deploy your code, and the service automatically runs it. You don’t need to administer the
compute resources. These services make use of serverless architecture and seamlessly scale up or
down to whatever level necessary to handle the traffic. Azure Functions are a FaaS service.
IaaS gives the most control, flexibility and portability. FaaS provides simplicity, elastic scale and
potential cost savings, because you pay only for the time your code is running.
PaaS falls somewhere between the two. In general, the more flexibility a service provides, the more
you are responsible for configuring and managing the resources. FaaS services automatically manage
nearly all aspects of running an application, while IaaS solutions require you to provision, configure
and manage the VMs and network components you create.
• Service Fabric is a distributed systems platform that can run in many environments, including
Azure or on premises. Service Fabric is an orchestrator of microservices across a cluster
of machines.
• Azure Container Service lets you create, configure and manage a cluster of VMs that are
preconfigured to run containerised applications.
• Azure Batch is a managed service for running large-scale parallel and high-performance
computing (HPC) applications.
• Cloud Services is a managed service for running cloud applications. It uses a PaaS hosting model.
• Hosting model. How is the service hosted? What requirements and limitations are imposed
by this hosting environment?
• DevOps. Is there built-in support for application upgrades? What is the deployment model?
• Scalability. How does the service handle adding or removing instances? Can it auto-scale based
on load and other metrics?
• Cost. In addition to the cost of the service itself, consider the operations cost for managing
a solution built on that service. For example, IaaS solutions might have a higher operations cost.
38
2b
Compute
comparison
The term compute refers to the hosting model for the computing
resources that your application runs on. The following tables compare
Azure compute services across several axes. Refer to these tables when
selecting a compute option for your application.
Hosting model
Azure
Virtual Azure Cloud Azure
Criteria App Service Service Fabric Container
Machines Functions Services Batch
Services
Minimum
2 3 No dedicated 4
number 1 1 5 3 2 1
nodes 1
of nodes
Self-host, IIS in
Web hosting Agnostic Built in N/A Agnostic Built-in (IIS) No
containers
Can be
deployed 6
Supported Supported Supported Not supported Supported Supported Supported
to dedicated
VNet?
Hybrid 8
Supported Supported Supported Not supported Supported Supported Supported
Connectivity
DevOps
Azure
Virtual Azure Cloud Azure
Criteria App Service Service Fabric Container
Machines Functions Services Batch
Services
Azure Local
Local IIS Express, Local node Local Not
Agnostic Functions container
debugging others cluster emulator supported
CLI runtime
Guest executable,
Web application, Command
Programming Service model, Functions Web role,
Agnostic Web Jobs for Agnostic line
model Actor model, with triggers worker role
background tasks application
Containers
Resource 2
Supported Supported Supported Supported Supported Limited Supported
Manager
Notes:
1. Options include IIS Express for ASP.NET or node.js (iisnode); PHP web server; Azure Toolkit
for IntelliJ, Azure Toolkit for Eclipse. App Service also supports remote debugging of
deployed web app.
20 core limit
Platform image:
20 instances, by default.
1000 nodes per 1 No defined
50 with 100 nodes Contact
Scale limit VMSS, Custom Infinite 100 limit, 200 max
App Service per VMSS customer
image: 100 nodes recommended
Environment service for
per VMSS
increase.
Notes:
Availability
Azure
Virtual Azure Cloud
Criteria App Service Service Fabric Container Azure Batch
Machines Functions Services
Services
Traffic manager,
Multiregion Traffic Not Traffic Not
Traffic manager Multi-region Traffic manager
failover manager supported manager supported
cluster
Notes:
Configured Configured
SSL Supported Supported Supported Supported Supported
in VM in VM
Not
RBAC Supported Supported Supported Supported Supported Supported
supported
Other
Azure
Virtual Service Cloud Azure
Criteria App Service Azure Functions Container
Machines Fabric Services Batch
Services
Azure
Windows, App service Service Azure Cloud services
Cost container Supported
Linux pricing fabric pricing functions pricing pricing
service pricing
Suitable
Microservices, Web-Queue Big
architecture Supported Supported Supported Supported
EDA Worker Compute
styles
Notes:
Data store
overview
Choose the right data store.
Modern business systems manage increasingly large volumes of data. Data may be ingested
from external services, generated by the system itself or created by users. These data sets may have
extremely varied characteristics and processing requirements. Businesses use data to assess trends,
trigger business processes, audit their operations, analyse customer behaviour and many other
things.
This heterogeneity means that a single data store is usually not the best approach. Instead, it’s
often better to store different types of data in different data stores, each focused towards a specific
workload or usage pattern. The term polyglot persistence is used to describe solutions that use
a mix of data store technologies.
Selecting the right data store for your requirements is a key design decision. There are literally
hundreds of implementations to choose from among SQL and NoSQL databases. Data stores are
often categorised by how they structure data and the types of operations they support. This article
describes several of the most common storage models. Note that a particular data store technology
may support multiple storage models. For example, a relational database management system
(RDBMS) may also support key/value or graph storage. In fact, there is a general trend for so-called
multimodel support, where a single database system supports several models. But it’s still useful
to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide
server-side functionality to query and process data. Sometimes this functionality is built into the data
storage engine. In other cases, the data storage and processing capabilities are separated, and there
may be several options for processing and analysis. Data stores also support different programmatic
and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements.
Then consider a particular data store within that category, based on factors such as feature set, cost
and ease of management.
An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead
of time, and all read or write operations must use the schema. This is in contrast to most NoSQL data
stores, particularly key/value types, where the schema-on-read model assumes that the client will
be imposing its own interpretive schema on data coming out of the database, and is agnostic to the
data format being written.
An RDBMS is very useful when strong consistency guarantees are important – where all changes
are atomic and transactions always leave the data in a consistent state. However, the underlying
structures do not lend themselves to scaling out by distributing storage and processing across
machines. Also, information stored in an RDBMS, must be put into a relational structure by following
the normalisation process. While this process is well understood, it can lead to inefficiencies because
of the need to disassemble logical entities into rows in separate tables and then reassemble the data
when running queries.
Key/value stores
A key/value store is essentially a large hash table. You associate each data value with a unique
key and the key/value store uses this key to store the data by using an appropriate hashing function.
The hashing function is selected to provide an even distribution of hashed keys across the data storage.
Most key/value stores only support simple query, insert and delete operations. To modify a value
(either partially or completely), an application must overwrite the existing data for the entire value.
Inmost implementations, reading or writing a single value is an atomic operation. If the value is large,
writing may take some time.
An application can store arbitrary data as a set of values, although some key/value stores impose
limits on the maximum size of values. The stored values are opaque to the storage system software.
Any schema information must be provided and interpreted by the application. Essentially, values are
blobs and the key/value store simply retrieves or stores the value by key.
A single key/value store can be extremely scalable, as the data store can easily distribute data across
multiple nodes on separate machines.
Document databases
A document database is conceptually similar to a key/value store, except that it stores a collection
of named fields and data (known as documents), each of which could be simple scalar items or
compound elements such as lists and child collections. The data in the fields of a document can
be encoded in a variety of ways, including XML, YAML, JSON or BSON, or even stored as plain text.
Unlike key/value stores, the fields in documents are exposed to the storage management system,
enabling an application to query and filter data by using the values in these fields.
Typically, a document contains the entire data for an entity. What items constitute an entity are
application specific. For example, an entity could contain the details of a customer, an order or
a combination of both. A single document may contain information that would be spread across
several relational tables in an RDBMS.
A document store does not require that all documents have the same structure. This free-form
approach provides a great deal of flexibility. Applications can store different data in documents
as business requirements change.
The application can retrieve documents by using the document key. This is a unique identifier for the
document, which is often hashed, to help distribute data evenly. Some document databases create
the document key automatically. Others enable you to specify an attribute of the document to use
as the key. The application can also query documents based on the value of one or more fields. Some
document databases support indexing to facilitate fast lookup of documents based on one or more
indexed fields.
Graph databases
A graph database stores two types of information, nodes and edges. You can think of nodes
as entities. Edges which specify the relationships between nodes. Both nodes and edges can
have properties that provide information about that node or edge, similar to columns in a table.
Edges can also have a direction indicating the nature of the relationship.
The purpose of a graph database is to allow an application to efficiently perform queries that
traverse the network of nodes and edges and to analyse the relationships between entities. The
following diagram shows an organisation’s personnel database structured as a graph. The entities are
employees and departments, and the edges indicate reporting relationships and the department in
which employees work. In this graph, the arrows on the edges show the direction of the relationships.
Column-family databases
A column-family database organises data into rows and columns. In its simplest form, a column-
family database can appear very similar to a relational database, at least conceptually. The real
power of a column-family database lies in its denormalised approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but
the columns are divided into groups known as column families. Each column family holds a set of
columns that are logically related together and are typically retrieved or manipulated as a unit. Other
data that is accessed separately can be stored in separate column families. Within a column family,
new columns can be added dynamically, and rows can be sparse (that is, a row doesn’t need to
have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Information.
The data for a single entity has the same row key in each column-family. This structure, where the rows
for any given object in a column family can vary dynamically, is an important benefit of the column-
family approach, making this form of data store highly suited for storing structured, volatile data.
Unlike a key/value store or a document database, most column-family databases store data in key
order, rather than by computing a hash. Many implementations allow you to create indexes over
specific columns in a column-family. Indexes let you retrieve data by columns value, rather than
row key.
Read and write operations for a row are usually atomic with a single column-family, although
some implementations provide atomicity across the entire row, spanning multiple column-families.
The key characteristics of a search engine database are the ability to store and index information
very quickly, and provide fast response times for search requests. Indexes can be multi-dimensional
and may support free-text searches across large volumes of text data. Indexing can be performed
using a pull model, triggered by the search engine database, or using a push model, initiated by
external application code.
Searching can be exact or fuzzy. A fuzzy search finds documents that match a set of terms and
calculates how closely they match. Some search engines also support linguistic analysis that can
return matches based on synonyms, genre expansions (for example, matching dogs to pets) and
stemming (matching words with the same root).
Time series databases are good for storing telemetry data. Scenarios include IoT sensors
or application/system counters.
Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving
information. Using file shares enables files to be accessed across a network. Given appropriate
security and concurrent access control mechanisms, sharing data in this way can enable distributed
services to provide highly scalable data access for performing basic, low-level operations such
as simple read and write requests.
Data store
comparison
Criteria for choosing a data store
Azure supports many types of data storage solutions, each providing different features and
capabilities. This article describes the comparison criteria you should use when evaluating a
data store. The goal is to help you determine which data storage types can meet your solution’s
requirements.
General Considerations
To start your comparison, gather as much of the following information as you can about your data
needs. This information will help you to determine which data storage types will meet your needs.
Functional requirements
• Data format. What type of data are you intending to store? Common types include transactional
data, JSON objects, telemetry, search indexes or flat files.
• Data size. How large are the entities you need to store? Will these entities need to be
maintained as a single document or can they be split across multiple documents, tables,
collections and so forth?
• Scale and structure. What is the overall amount of storage capacity you need? Do you anticipate
partitioning your data?
• Consistency model. How important is it for updates made in one node to appear in other
nodes, before further changes can be made? Can you accept eventual consistency? Do you
need ACID guarantees for transactions?
• Schema flexibility. What kind of schemas will you apply to your data? Will you use a fixed
schema, a schema-on-write approach or a schema-on-read approach?
• Data movement. Will your solution need to perform ETL tasks to move data to other stores
or data warehouses?
• Data lifecycle. Is the data write-once, read-many? Can it be moved into cool or cold storage?
• Other supported features. Do you need any other specific features, such as schema validation,
aggregation, indexing, full-text search, MapReduce or other query capabilities?
Non-functional requirements
• Performance and scalability. What are your data performance requirements? Do you have
specific requirements for data ingestion rates and data processing rates? What are the acceptable
response times for querying and aggregation of data once ingested? How large will you need the
data store to scale up? Is your workload more read-heavy or write-heavy?
• Reliability. What overall SLA do you need to support? What level of fault-tolerance do you
need to provide for data consumers? What kind of backup and restore capabilities do you need?
• Replication. Will your data need to be distributed among multiple replicas or regions?
What kind of data replication capabilities do you require?
• Limits. Will the limits of a particular data store support your requirements for scale, number
of connections and throughput?
• Region availability. For managed services, is the service available in all Azure regions?
Does your solution need to be hosted in certain Azure regions?
• Portability. Will your data need to migrated to on-premises, external datacentres or other cloud
hosting environments?
• Licensing. Do you have a preference of a proprietary versus OSS licence type? Are there any
other external restrictions on what type of licence you can use?
• Overall cost. What is the overall cost of using the service within your solution? How many
instances will need to run, to support your uptime and throughput requirements? Consider
operations costs in this calculation. One reason to prefer managed services is the reduced
operational cost.
• Cost effectiveness. Can you partition your data, to store it more cost effectively? For example,
can you move large objects out of an expensive relational database into an object store?
• Networking requirements. Do you need to restrict or otherwise manage access to your data
from other network resources? Does data need to be accessible only from inside the Azure
environment? Does the data need to be accessible from specific IP addresses or subnets? Does
it need to be accessible from applications or services hosted on-premises or in other external
datacentres?
DevOps
• Skill set. Are there particular programming languages, operating systems or other technology
that your team is particularly adept at using? Are there others that would be difficult for your
team to work with?
The following sections compare various data store models in terms of workload profile, data types
and example use cases.
Data type
• Data is highly normalised.
• Constraints are defined in the schema and imposed on any data in the database.
• Data requires high integrity. Indexes and relationships need to be maintained accurately.
Examples
• Line of business (human capital management, customer relationship management,
enterprise resource planning)
• Inventory management
• Reporting database
• Accounting
• Asset management
• Fund management
• Order management
Document databases
Workload
• General purpose.
• Insert and update operations are common. Both the creation of new records and updates
to existing data happen regularly.
• No object-relational impedance mismatch. Documents can better match the object structures
used in application code.
Data type
• Data can be managed in denormalised way.
• Size of individual document data is relatively small.
• Each document type can use its own schema.
• Documents can include optional fields.
• Document data is semi-structured, meaning that data types of each field are not strictly defined.
• Data aggregation is supported.
Key/value stores
Workload
• Data is identified and accessed using a single ID key, like a dictionary.
• Massively scalable.
• No joins, lock or unions are required.
• No aggregation mechanisms are used.
• Secondary indexes are generally not used.
Data type
• Data size tends to be large.
• Each key is associated with a single value, which is an unmanaged data BLOB.
• There is no schema enforcement.
• No relationships between entities.
Examples
• Data caching
• Session management
• User preference and profile management
• Product recommendation and ad serving
• Dictionaries
Data type
• Data size tends to be large.
• Each key is associated with a single value, which is an unmanaged data BLOB.
• There is no schema enforcement.
• No relationships between entities.
Examples
• Data caching
• Session management
• User preference and profile management
• Product recommendation and ad serving
• Dictionaries
Graph databases
Workload
• The relationships between data items are very complex, involving many hops between related
data items.
• The relationship between data items are dynamic and change over time.
• Relationships between objects are first-class citizens, without requiring foreign-keys and
joins to traverse.
Data type
Column-family databases
Workload
• Most column-family databases perform write operations extremely quickly.
• Update and delete operations are rare.
• Designed to provide high throughput and low-latency access.
• Supports easy query access to a particular set of fields within a much larger record.
• Massively scalable.
Data type
• Data is stored in tables consisting of a key column and one or more column families.
• Specific columns can vary by individual rows.
• Individual cells are accessed via get and put commands
• Multiple rows are returned using a scan command.
Examples
• Recommendations
• Personalisation
• Sensor data
• Telemetry
• Messaging
• Social media analytics
• Web analytics
• Activity monitoring
• Weather and other time-series data
Data type
• Semi-structured or unstructured
• Text
• Text with reference to structured data
Examples
• Product Catalogues
• Site search
• Logging
• Analytics
• Shopping sites
Data warehouse
Workload
• Data analytics
• Enterprise BI
Data type
• Historical data from multiple sources.
• Usually denormalised in a “star” or “snowflake” schema, consisting of fact and dimension tables.
• Usually loaded with new data on a scheduled basis.
• Dimension tables often include multiple historic versions of an entity, referred to as a
slowly changing dimension.
Examples
• An enterprise data warehouse that provides data for analytical models, reports and dashboards.
Data type
• A time stamp that is used as the primary key and sorting mechanism.
• Measurements from the entry or descriptions of what the entry represents.
• Tags that define additional information about the type, origin and other information
about the entry.
Examples
• Monitoring and event telemetry.
• Sensor or other IoT data.
Object storage
Workload
• Identified by key.
• Objects may be publicly or privately accessible.
• Content is typically an asset such as a spreadsheet, image or video file.
• Content must be durable (persistent) and external to any application tier or virtual machine.
Data type
• Data size is large.
• Blob data.
• Value is opaque.
Shared files
Workload
• Migration from existing apps that interact with the file system.
• Requires SMB interface.
Data type
• Files in a hierarchical set of folders.
• Accessible with standard I/O libraries.
Examples
• Legacy files.
• Shared content accessible among a number of VMs or app instances.
Design
your Azure
application:
design principles
Now that you have chosen your architecture and your compute and data
store technologies, you are ready to start designing and building your
cloud application. This section and the two following it provide guidance
and resources for optimal application design for the cloud.
This section describes ten design principles to keep in mind as you build. Following these principles
will help you build an application that is more scalable, resilient and manageable.
2. Give all things redundancy. Build redundancy into your application, to avoid having single
points of failure.
4. Design to scale out. Design your application so that it can scale horizontally, adding
or removing new instances as demand requires.
5. Partition around limits. Use partitioning to work around database, network and compute limits.
6. Design for operations. Design your application so that the operations team has the tools
they need.
7. Use managed services. When possible, use platform as a service (PaaS) rather than
infrastructure as a service (IaaS).
8. Use the best data store for the job. Pick the storage technology that is the best fit for your
data and how it will be used.
9. Design for evolution. All successful applications change over time. An evolutionary design
is key for continuous innovation.
10. Build for the needs of business. Every design decision must be justified by a business
requirement.
Therefore, design an application to be self-healing when failures occur. This requires a three-pronged
approach:
• Detect failures.
• Respond to failures gracefully.
• Log and monitor failures, to give operational insight.
How you respond to a particular type of failure may depend on your application’s availability
requirements. For example, if you require very high availability, you might automatically fail over
to a secondary region during a regional outage. However, that will incur a higher cost than a single-
region deployment.
Also, don’t just consider big events like regional outages, which are generally rare. You should focus
as much, if not more, on handling local, short-lived failures, such as network connectivity failures
or failed database connections.
Recommendations
Retry failed operations. For more information, see Retry Pattern and go to https://docs.microsoft.
com/en-us/azure/architecture/best-practices/transient-faults.
Protect failing remote services (Circuit Breaker). It’s good to retry after a transient failure, but if
the failure persists, you can end up with too many callers hammering a failing service. This can lead
to cascading failures, as requests back up. Use the Circuit Breaker Pattern to fail fast (without making
the remote call) when an operation is likely to fail.
Isolate critical resources (Bulkhead). Failures in one subsystem can sometimes cascade. This can
happen if a failure causes some resources, such as threads or sockets, not to get freed in a timely
manner, leading to resource exhaustion. To avoid this, partition a system into isolated groups,
so that a failure in one partition does not bring down the entire system.
Failover. If an instance can’t be reached, fail over to another instance. For things that are stateless,
like a web server, put several instances behind a load balancer or traffic manager. For things
that store state, like a database, use replicas and failover. Depending on the data store and how
it replicates, this may require the application to deal with eventual consistency.
Degrade gracefully. Sometimes you can’t work around a problem, but you can provide reduced
functionality that is still useful. Consider an application that shows a catalogue of books. If the
application can’t retrieve the thumbnail image for the cover, it might show a placeholder image.
Entire subsystems might be noncritical for the application. For example, in an e-commerce site,
showing product recommendations is probably less critical than processing orders.
Throttle clients. Sometimes a small number of users create excessive load, which can reduce your
application’s availability for other users. In this situation, throttle the client for a certain period of
time. See Throttling Pattern.
Block bad actors. Just because you throttle a client, it doesn’t mean the client was acting maliciously.
It just means the client exceeded their service quota. But if a client consistently exceeds their quota
or otherwise behaves badly, you might block them. Define an out-of-band process for user to request
getting unblocked.
Use leader election. When you need to coordinate a task, use Leader Election to select a
coordinator. That way, the coordinator is not a single point of failure. If the coordinator fails, a new
one is selected. Rather than implement a leader election algorithm from scratch, consider an off-the-
shelf solution such as Zookeeper.
Test with fault injection. All too often, the success path is well tested, but not the failure path.
A system could run in production for a long time before a failure path is exercised. Use fault injection
to test the resilience of the system to failures, either by triggering actual failures or by simulating them.
Embrace chaos engineering. Chaos engineering extends the notion of fault injection, by randomly
injecting failures or abnormal conditions into production instances.
For a structured approach to making your applications self healing, see Design resilient applications
for Azure.
Recommendations
Consider business requirements. The amount of redundancy built into a system can affect both
cost and complexity. Your architecture should be informed by your business requirements, such
as recovery time objective (RTO). For example, a multi-region deployment is more expensive
than a single-region deployment and is more complicated to manage. You will need operational
procedures to handle failover and failback. The additional cost and complexity might be justified
for some business scenarios and not others.
Place VMs behind a load balancer. Don’t use a single VM for mission-critical workloads. Instead,
place multiple VMs behind a load balancer. If any VM becomes unavailable, the load balancer
distributes traffic to the remaining healthy VMs. To learn how to deploy this configuration,
see Multiple VMs for scalability and availability.
Enable geo-replication. Geo-replication for Azure SQL Database and Cosmos DB creates secondary
readable replicas of your data in one or more secondary regions. In the event of an outage, the
database can fail over to the secondary region for writes. For more information about Azure SQL
database, go to https://docs.microsoft.com/en-us/azure/sql-database/sql-database-geo-replication-
overview. For more information about Cosmos DB, go to https://docs.microsoft.com/en-us/azure/
documentdb/documentdb-distribute-data-globally.
Partition for availability. Database partitioning is often used to improve scalability, but it can also
improve availability. If one shard goes down, the other shards can still be reached. A failure in one
shard will only disrupt a subset of the total transactions.
Deploy to more than one region. For the highest availability, deploy the application to more
than one region. That way, in the rare case when a problem affects an entire region, the application
can fail over to another region. The following diagram shows a multi-region application that uses
Azure Traffic Manager to handle failover.
Synchronise front and backend failover. Use Azure Traffic Manager to fail over the frontend.
If the frontend becomes unreachable in one region, Traffic Manager will route new requests to the
secondary region. Depending on your database solution, you may need to coordinate failing over
the database.
Use automatic failover, but manual failback. Use Traffic Manager for automatic failover, but not
for automatic failback. Automatic failback carries a risk that you might switch to the primary region
before the region is completely healthy. Instead, verify that all application subsystems are healthy
before manually failing back. Also, depending on the database, you might need to check data
consistency before failing back.
Include redundancy for Traffic Manager. Traffic Manager is a possible failure point. Review
the Traffic Manager SLA and determine whether using Traffic Manager alone meets your business
requirements for high availability. If not, consider adding another traffic management solution as
a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to point
to the other traffic management service.
Minimise
coordination
Minimise coordination between application services to achieve scalability.
Most cloud applications consist of multiple application services – web front ends, databases, business
processes, reporting and analysis and so on. To achieve scalability and reliability, each of those
services should run on multiple instances.
What happens when two instances try to perform concurrent operations that affect some shared
state? In some cases, there must be coordination across nodes, for example to preserve ACID
guarantees. In this diagram, Node2 is waiting for Node1 to release a database lock:
Coordination limits the benefits of horizontal scale and creates bottlenecks. In this example, as you
scale out the application and add more instances, you’ll see increased lock contention. In the worst
case, the frontend instances will spend most of their time waiting on locks. "Exactly once" semantics
are another frequent source of coordination. For example, an order must be processed exactly once.
Two workers are listening for new orders. Worker 1 picks up an order for processing. The application
must ensure that Worker 2 doesn’t duplicate the work, but also if Worker 1 crashes, that the order
isn’t dropped.
Recommendations
Embrace eventual consistency. When data is distributed, it takes coordination to enforce strong
consistency guarantees. For example, suppose an operation updates two databases. Instead
of putting it into a single transaction scope, it’s better if the system can accommodate eventual
consistency, perhaps by using the Compensating Transaction pattern to logically roll back after
a failure.
Use domain events to synchronise state. A domain event is an event that records when something
happens that has significance within the domain. Interested services can listen for the event, rather
than using a global transaction to coordinate across multiple services. If this approach is used,
the system must tolerate eventual consistency (see previous item).
Consider patterns such as CQRS and event sourcing. These two patterns can help to reduce
contention between read workloads and write workloads.
• The CQRS pattern separates read operations from write operations. In some implementations,
the read data is physically separated from the write data.
• In the Event Sourcing pattern, state changes are recorded as a series of events to an append-only
data store. Appending an event to the stream is an atomic operation, requiring minimal locking.
These two patterns complement each other. If the write-only store in CQRS uses event sourcing,
the read-only store can listen for the same events to create a readable snapshot of the current state,
optimised for queries. Before adopting CQRS or event sourcing, however, be aware of the challenges
of this approach. For more information, see CQRS architecture style.
Partition data. Avoid putting all of your data into one data schema that is shared across many application
services. A microservices architecture enforces this principle by making each service responsible for its
own data store. Within a single database, partitioning the data into shards can improve concurrency,
because a service writing to one shard does not affect a service writing to a different shard.
Design idempotent operations. When possible, design operations to be idempotent. That way,
they can be handled using at-least-once semantics. For example, you can put work items in a queue.
If a worker crashes in the middle of an operation, another worker simply picks up the work item.
Use asynchronous parallel processing. If an operation requires multiple steps that are performed
asynchronously (such as remote service calls), you might be able to call them in parallel and then
aggregate the results. This approach assumes that each step does not depend on the results of the
previous step.
Use optimistic concurrency when possible. Pessimistic concurrency control uses database locks
to prevent conflicts. This can cause poor performance and reduce availability. With optimistic
concurrency control, each transaction modifies a copy or snapshot of the data. When the transaction
is committed, the database engine validates the transaction and rejects any transactions that would
affect database consistency.
Consider MapReduce or other parallel, distributed algorithms. Depending on the data and
type of work to be performed, you may be able to split the work into independent tasks that can
be performed by multiple nodes working in parallel. See Big Compute Architecture Style.
Use leader election for coordination. In cases where you need to coordinate operations, make
sure the coordinator does not become a single point of failure in the application. Using the Leader
Election pattern, one instance is the leader at any time and acts as the coordinator. If the leader fails,
a new instance is elected to be the leader.
Design to
scale out
Design your application so that it can scale horizontally
A primary advantage of the cloud is elastic scaling – the ability to use as much capacity as you need,
scaling out as load increases and scaling in when the extra capacity is not needed. Design your
application so that it can scale horizontally, adding or removing new instances as demand requires.
Recommendations
Avoid instance stickiness. Stickiness, or session affinity, is when requests from the same client are
always routed to the same server. Stickiness limits the application’s ability to scale out. For example,
traffic from a high-volume user will not be distributed across instances. Causes of stickiness include
storing session state in memory and using machine-specific keys for encryption. Make sure that
any instance can handle any request.
Identify bottlenecks. Scaling out isn’t a magic fix for every performance issue. For example, if your
backend database is the bottleneck, it won’t help to add more web servers. Identify and resolve the
bottlenecks in the system first, before throwing more instances at the problem. Stateful parts of the
system are the most likely cause of bottlenecks.
Offload resource-intensive tasks. Tasks that require a lot of CPU or I/O resources should be moved
to background jobs when possible, to minimise the load on the front end that is handling user
requests.
Use built-in autoscaling features. Many Azure compute services have built-in support
for autoscaling. If the application has a predictable, regular workload, scale out on a schedule.
For example, scale out during business hours. Otherwise, if the workload is not predictable, use
performance metrics such as CPU or request queue length to trigger autoscaling. For autoscaling
best practices, see Autoscaling. For autoscaling best practices, go to https://docs.microsoft.com/en-
us/azure/architecture/best-practices/auto-scaling.
Design for scale in. Remember that with elastic scale, the application will have periods of scale
in, when instances get removed. The application must gracefully handle instances being removed.
Here are some ways to handle scaling:
• Listen for shutdown events (when available) and shut down cleanly.
• Clients/consumers of a service should support transient fault handling and retry.
• For long-running tasks, consider breaking up the work, using checkpoints or the Pipes and
Filters pattern.
• Put work items in a queue so that another instance can pick up the work, if an instance
is removed in the middle of processing.
Partition around
limits
Use Partitioning to work around database, network and compute limits.
A primary advantage of the cloud is elastic scaling – the ability to use as much capacity as you need,
scaling out as load increases and scaling in when the extra capacity is not needed. Design your
application so that it can scale horizontally, adding or removing new instances as demand requires.
In the cloud, all services have limits in their ability to scale up. Azure service limits are documented
in Azure subscription and service limits, quotas and constraints. Limits include number of cores,
database size, query throughput and network throughput. If your system grows sufficiently large,
you may hit one or more of these limits. Use partitioning to work around these limits.
• In horizontal partitioning, also called sharding, each partition holds data for a subset of the total
data set. The partitions share the same data schema. For example, customers whose names start
with A-M go into one partition, N-Z into another partition.
• In vertical partitioning, each partition holds a subset of the fields for the items in the data store.
For example, put frequently accessed fields in one partition and less frequently accessed fields
in another.
• In functional partitioning, data is partitioned according to how it is used by each bounded
context in the system. For example, store invoice data in one partition and product inventory
data in another. The schemas are independent.
Design the partition key to avoid hot spots. If you partition a database, but one shard still gets the
majority of the requests, then you haven’t solved your problem. Ideally, load gets distributed evenly
across all the partitions. For example, hash by customer ID and not the first letter of the customer
name, because some letters are more frequent. The same principle applies when partitioning
a message queue. Pick a partition key that leads to an even distribution of messages across the
set of queues. For more information, see Sharding.
Partition around Azure subscription and service limits. Individual components and services
have limits, but there are also limits for subscriptions and resource groups. For very large
applications, you might need to partition around those limits.
Partition at different levels. Consider a database server deployed on a VM. The VM has a VHD that
is backed by Azure Storage. The storage account belongs to an Azure subscription. Notice that each
step in the hierarchy has limits. The database server may have a connection pool limit. VMs have
CPU and network limits. Storage has IOPS limits. The subscription has limits on the number of VM
cores. Generally, it’s easier to partition lower in the hierarchy. Only large applications should need
to partition at the subscription level.
Design for
operations
Design an application so that the operations team has the tools
they need.
The cloud has dramatically changed the role of the operations team. They are no longer responsible
for managing the hardware and infrastructure that hosts the application. That said, operations
is still a critical part of running a successful cloud application. Some of the important functions
of the operations team include:
• Deployment
• Monitoring
• Escalation
• Incident response
• Security auditing
Robust logging and tracing are particularly important in cloud applications. Involve the operations
team in design and planning, to ensure the application gives them the data and insight they need
to be successful.
Recommendations
Make all things observable. Once a solution is deployed and running, logs and traces are your
primary insight into the system. Tracing records a path through the system and is useful to pinpoint
bottlenecks, performance issues and failure points. Logging captures individual events such as
application state changes, errors and exceptions. Log in production or else you lose insight at
the very times when you need it the most.
Instrument for monitoring. Monitoring gives insight into how well (or poorly) an application is
performing, in terms of availability, performance and system health. For example, monitoring tells
you whether you are meeting your SLA. Monitoring happens during the normal operation of the
system. It should be as close to real-time as possible, so that the operations staff can react to issues
quickly. Ideally, monitoring can help avert problems before they lead to a critical failure. For more
information, go to https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring.
Use distributed tracing. Use a distributed tracing system that is designed for concurrency,
asynchrony and cloud scale. Traces should include a correlation ID that flows across service
boundaries. A single operation may involve calls to multiple application services. If an operation fails,
the correlation ID helps to pinpoint the cause of the failure.
Standardise logs and metrics. The operations team will need to aggregate logs from across the
various services in your solution. If every service uses its own logging format, it becomes difficult or
impossible to get useful information from them. Define a common schema that includes fields such
as correlation ID, event name, IP address of the sender and so forth. Individual services can derive
custom schemas that inherit the base schema and contain additional fields.
Treat configuration as code. Check configuration files into a version control system, so that you
can track and version your changes, and roll back if needed.
Use managed
services
When possible, use platform as a service (PaaS) rather than
infrastructure as a service (IaaS).
IaaS is like having a box of parts. You can build anything, but you have to assemble it yourself.
Managed services are easier to configure and administer. You don’t need to provision VMs, set up
VNets, manage patches and updates and all of the other overhead associated with running software
on a VM.
For example, suppose your application needs a message queue. You could set up your own
messaging service on a VM, using something like RabbitMQ. But Azure Service Bus already
provides reliable messaging as service and it’s simpler to set up. Just create a Service Bus namespace
(which can be done as part of a deployment script) and then call Service Bus using the client SDK.
Of course, your application may have specific requirements that make an IaaS approach more
suitable. However, even if your application is based on IaaS, look for places where it may be
natural to incorporate managed services. These include cache, queues and data storage.
IaaS is like having a box of parts. You can build anything, but you have to assemble it yourself. Managed
services are easier to configure and administer. You don’t need to provision VMs, set up VNets, manage
patches and updates and all of the other overhead associated with running software on a VM.
For example, suppose your application needs a message queue. You could set up your own
messaging service on a VM, using something like RabbitMQ. But Azure Service Bus already provides
reliable messaging as service and it’s simpler to set up. Just create a Service Bus namespace
(which can be done as part of a deployment script) and then call Service Bus using the client SDK.
Of course, your application may have specific requirements that make an IaaS approach more
suitable. However, even if your application is based on IaaS, look for places where it may be
natural to incorporate managed services. These include cache, queues and data storage.
Gone are the days when you would just stick all of your data into a big relational SQL database.
Relational databases are very good at what they do – providing ACID guarantees for transactions
over relational data. But they come with some costs:
In any large solution, it’s likely that a single data store technology won’t fill all your needs.
Alternatives to relational databases include key/value stores, document databases, search engine
databases, time series databases, column family databases and graph databases. Each has pros and
cons, and different types of data fit more naturally into one or another.
For example, you might store a product catalogue in a document database, such as Cosmos DB,
which allows for a flexible schema. In that case, each product description is a self-contained document.
For queries over the entire catalogue, you might index the catalogue and store the index in Azure
Search. Product inventory might go into a SQL database, because that data requires ACID guarantees.
Remember that data includes more than just the persisted application data. It also includes
application logs, events, messages and caches.
Embrace polyglot persistence. In any large solution, it’s likely that a single data store technology
won’t fill all your needs.
Consider the type of data. For example, put transactional data into SQL, put JSON documents
into a document database, put telemetry data into a time series data base, put application logs
in Elasticsearch and put blobs in Azure Blob Storage.
Prefer availability over (strong) consistency. The CAP theorem implies that a distributed system
must make trade-offs between availability and consistency. (Network partitions, the other leg of the
CAP theorem, can never be completely avoided.) Often, you can achieve higher availability by
adopting an eventual consistency model.
Consider the skill set of the development team. There are advantages to using polyglot
persistence, but it’s possible to go overboard. Adopting a new data storage technology requires a
new set of skills. The development team must understand how to get the most out of the technology.
They must understand appropriate usage patterns, how to optimise queries, tune for performance
and so on. Factor this in when considering storage technologies.
Use compensating transactions. A side effect of polyglot persistence is that single transaction
might write data to multiple stores. If something fails, use compensating transactions to undo
any steps that already completed.
Look at bounded contexts. Bounded context is a term from domain driven design. A bounded
context is an explicit boundary around a domain model and defines which parts of the domain the
model applies to. Ideally, a bounded context maps to a subdomain of the business domain. The
bounded contexts in your system are a natural place to consider polyglot persistence. For example,
“products” may appear in both the Product Catalogue subdomain and the Product Inventory
subdomain, but it’s very likely that these two subdomains have different requirements for storing,
updating and querying products.
All successful applications change over time, whether to fix bugs, add new features, bring in new
technologies or make existing systems more scalable and resilient. If all the parts of an application
are tightly coupled, it becomes very hard to introduce changes into the system. A change in one part
of the application may break another part or cause changes to ripple through the entire codebase.
This problem is not limited to monolithic applications. An application can be deconstructed into
services, but still exhibit the sort of tight coupling that leaves the system rigid and brittle. But when
services are designed to evolve, teams can innovate and continuously deliver new features.
Microservices are becoming a popular way to achieve an evolutionary design, because they address
many of the considerations listed here.
Recommendations
Enforce high cohesion and loose coupling. A service is cohesive if it provides functionality that
logically belongs together. Services are loosely coupled if you can change one service without
changing the other. High cohesion generally means that changes in one function will require changes
in other related functions. If you find that updating a service requires coordinated updates to other
services, it may be a sign that your services are not cohesive. One of the goals of domain-driven
design (DDD) is to identity those boundaries.
Encapsulate domain knowledge. When a client consumes a service, the responsibility for enforcing
the business rules of the domain should not fall on the client. Instead, the service should encapsulate
all of the domain knowledge that falls under its responsibility. Otherwise, every client has to enforce
the business rules and you end up with domain knowledge spread across different parts of the
application.
Don’t build domain knowledge into a gateway. Gateways can be useful in a microservices
architecture, for things like request routing, protocol translation, load balancing or authentication.
However, the gateway should be restricted to this sort of infrastructure functionality. It should not
implement any domain knowledge, to avoid becoming a heavy dependency.
Design and test against service contracts. When services expose well-defined APIs, you can
develop and test against those APIs. That way, you can develop and test an individual service without
spinning up all of its dependent services. (Of course, you would still perform integration and load
testing against the real services.)
Abstract infrastructure away from domain logic. Don’t let domain logic get mixed up with
infrastructure-related functionality, such as messaging or persistence. Otherwise, changes in the
domain logic will require updates to the infrastructure layers and vice versa.
Offload cross-cutting concerns to a separate service. For example, if several services need to
authenticate requests, you could move this functionality into its own service. Then you could evolve
the authentication service – for example, by adding a new authentication flow – without touching
any of the services that use it.
Deploy services independently. When the DevOps team can deploy a single service independently
of other services in the application, updates can happen more quickly and safely. Bug fixes and new
features can be rolled out at a more regular cadence. Design both the application and the release
process to support independent updates.
This design principle may seem obvious, but it’s crucial to keep in mind when designing a solution.
Do you anticipate millions of users or a few thousand? Is a one-hour application outage acceptable?
Do you expect large bursts in traffic or a very predictable workload? Ultimately, every design decision
must be justified by a business requirement.
Recommendations
Define business objectives, including the recovery time objective (RTO), recovery point objective (RPO)
and maximum tolerable outage (MTO). These numbers should inform decisions about the architecture.
For example, to achieve a low RTO, you might implement automated failover to a secondary region.
But if your solution can tolerate a higher RTO, that degree of redundancy might be unnecessary.
Document service level agreements (SLA) and service level objectives (SLO). Including
availability and performance metrics. You might build a solution that delivers 99.95% availability.
Is that enough? The answer is a business decision.
Model the application around the business domain. Start by analysing the business requirements.
Use these requirements to model the application. Consider using a domain-driven design (DDD)
approach to create domain models that reflect the business processes and use cases.
Capture both functional and non-functional requirements. Functional requirements let you judge
whether the application does the right thing. Non-functional requirements let you judge whether the
application does those things well. In particular, make sure that you understand your requirements
for scalability, availability and latency. These requirements will influence design decisions and choice
of technology.
Deconstruct by workload. The term “workload” in this context means a discrete capability or
computing task, which can be logically separated from other tasks. Different workloads may have
different requirements for availability, scalability, data consistency and disaster recovery.
Manage costs. In a traditional on-premises application, you pay upfront for hardware (CAPEX).
In a cloud application, you pay for the resources that you consume. Make sure that you understand
the pricing model for the services that you consume. The total cost will include network bandwidth
usage, storage, IP addresses, service consumption and other factors. See Azure pricing for
more information. Also consider your operations costs. In the cloud, you don’t have to manage
the hardware or other infrastructure, but you still need to manage your applications, including
DevOps, incident response, disaster recovery and so forth.
Designing resilient
applications
for Azure
Rather than purchasing higher-end hardware to scale up, in a cloud environment you must scale out.
Costs for cloud environments are kept low and the goal is to minimise the effect of a failure.
In a distributed system, failures will happen. Hardware can fail. The network can have transient
failures. Rarely, an entire service or region may experience a disruption, but even those must
be planned for.
Building a reliable application in the cloud is different than building a reliable application in an
enterprise setting. While historically you may have purchased higher-end hardware to scale up,
in a cloud environment you must scale out instead of scaling up. Costs for cloud environments
are kept low through the use of commodity hardware. Instead of focusing on preventing failures
and optimising “mean time between failures”, in this new environment the focus shifts to “mean
time to restore”. The goal is to minimise the effect of a failure.
This article provides an overview of how to build resilient applications in Microsoft Azure. It starts
with a definition of the term resilience and related concepts. Then it describes a process for
achieving resilience, using a structured approach over the lifetime of an application, from design
and implementation to deployment and operations.
What is resilience?
Resilience is the ability of a system to recover from failures and continue to function. It’s not about
avoiding failures, but responding to failures in a way that avoids downtime or data loss. The goal
of resilience is to return the application to a fully functioning state following a failure.
Two important aspects of resilience are high availability and disaster recovery.
• High availability (HA) is the ability of the application to continue running in a healthy state,
without significant downtime. By “healthy state”, we mean the application is responsive and
users can connect to the application and interact with it.
One way to think about HA versus DR is that DR starts when the impact of a fault exceeds the ability
of the HA design to handle it. For example, putting several VMs behind a load balancer will provide
availability if one VM fails, but not if they all fail at the same time.
When you design an application to be resilient, you have to understand your availability
requirements. How much downtime is acceptable? This is partly a function of cost. How much will
potential downtime cost your business? How much should you invest in making the application
highly available? You also have to define what it means for the application to be available. For
example, is the application “down” if a customer can submit an order, but the system cannot process
it within the normal timeframe? Also consider the probability of a particular type of outage occurring
and whether a mitigation strategy is cost-effective.
Another common term is business continuity (BC), which is the ability to perform essential business
functions during and after adverse conditions, such as a natural disaster or a downed service. BC
covers the entire operation of the business, including physical facilities, people, communications,
transportation and IT. This article focuses on cloud applications, but resilience planning must be
done in the context of overall BC requirements. For more information, see the [Contingency Planning
Guide][capacity-planning-guide] from the National Institute of Science and Technology (NIST).
In the remainder of this article, we discuss each of these steps in more detail.
These workloads might have different requirements for availability, scalability, data consistency,
disaster recovery and so forth. Again, these are business decisions.
Also consider usage patterns. Are there certain critical periods when the system must be available?
For example, a tax-filing service can’t go down right before the filing deadline, a video streaming
service must stay up during a big sports event and so on. During the critical periods, you might have
redundant deployments across several regions, so the application could fail over if one region failed.
However, a multi-region deployment is more expensive, so during less critical times, you might run
the application in a single region.
• Recovery point objective (RPO) is the maximum duration of data loss that is acceptable during
a disaster. For example, if you store data in a single database, with no replication to other
databases and perform hourly backups, you could lose up to an hour of data.
RTO and RPO are business requirements. Conducting a risk assessment can help you define the
application’s RTO and RPO. Another common metric is mean time to recover (MTTR), which is the
average time that it takes to restore the application after a failure. MTTR is an empirical fact about
a system. If MTTR exceeds the RTO, then a failure in the system will cause an unacceptable business
disruption, because it won’t be possible to restore the system within the defined RTO.
SLAs
In Azure, the Service Level Agreement (SLA) describes Microsoft’s commitments for uptime and
connectivity. If the SLA for a particular service is 99.9%, it means you should expect the service
to be available 99.9% of the time.
Notes:
The Azure SLA also includes provisions for obtaining a service credit if the SLA is not met, along with
specific definitions of “availability” for each service. That aspect of the SLA acts as an enforcement policy.
Of course, higher availability is better, everything else being equal. But as you strive for more 9s,
the cost and complexity to achieve that level of availability grows. An uptime of 99.99% translates
to about 5 minutes of total downtime per month. Is it worth the additional complexity and cost
to reach five 9s? The answer depends on the business requirements.
• To achieve four 9s (99.99%), you probably can’t rely on manual intervention to recover
from failures. The application must be self-diagnosing and self-healing.
• Beyond four 9s, it is challenging to detect outages quickly enough to meet the SLA.
• Think about the time window that your SLA is measured against. The smaller the window, the
tighter the tolerances. It probably doesn’t make sense to define your SLA in terms of hourly
or daily uptime.
Composite SLAs
Consider an App Service web app that writes to Azure SQL Database. At the time of this writing,
these Azure services have the following SLAs:
What is the maximum downtime you would expect for this application? If either service fails, the
whole application fails. In general, the probability of each service failing is independent, so the
composite SLA for this application is 99.95% × 99.99% = 99.94%. That’s lower than the individual
SLAs, which isn’t surprising, because an application that relies on multiple services has more potential
failure points.
On the other hand, you can improve the composite SLA by creating independent fallback paths.
For example, if SQL Database is unavailable, put transactions into a queue, to be processed later.
But there are trade-offs to this approach. The application logic is more complex, you are paying
for the queue, and there may be data consistency issues to consider.
SLA for multi-region deployments. Another HA technique is to deploy the application in more than
one region and use Azure Traffic Manager to fail over if the application fails in one region. For a two-
region deployment, the composite SLA is calculated as follows.
Let N be the composite SLA for the application deployed in one region. The expected chance that the
application will fail in both regions at the same time is (1 − N) × (1 − N).
Therefore,
• Combined SLA for both regions = 1 − (1 − N)(1 − N) = N + (1 − N)N
Finally, you must factor in the SLA for Traffic Manager. At the time of this writing, the SLA for
Traffic Manager SLA is 99.99%.
Also, failing over is not instantaneous and can result in some downtime during a failover. See Traffic
Manager endpoint monitoring and failover.
The calculated SLA number is a useful baseline, but it doesn’t tell the whole story about availability.
Often, an application can degrade gracefully when a non-critical path fails. Consider an application
that shows a catalogue of books. If the application can’t retrieve the thumbnail image for the
cover, it might show a placeholder image. In that case, failing to get the image does not reduce
the application’s uptime, although it affects the user experience.
For more information about the FMA process, with specific recommendations for Azure, see Azure
resilience guidance: Failure mode analysis.
During the design phase, you should perform a failure mode analysis (FMA). The goal of an FMA is to
identify possible points of failure and define how the application will respond to those failures.
For more information about the FMA process, with specific recommendations for Azure, see Azure
resilience guidance: Failure mode analysis.
Resilience strategies
This section provides a survey of some common resilience strategies. Most of these are not limited
to a particular technology. The descriptions in this section summarise the general idea behind each
technique, with links to further reading.
Each retry attempt adds to the total latency. Also, too many failed requests can cause a bottleneck, as
pending requests accumulate in the queue. These blocked requests might hold critical system resources
such as memory, threads, database connections and so on, which can cause cascading failures. To avoid
this, increase the delay between each retry attempt, and limit the total number of failed requests.
For scalability, a cloud application should be able to scale out by adding more instances.
This approach also improves resilience, because unhealthy instances can be removed from rotation.
For example:
• Put two or more VMs behind a load balancer. The load balancer distributes traffic to all the VMs.
See Run load-balanced VMs for scalability and availability.
• Scale out an Azure App Service app to multiple instances. App Service automatically balances
load across instances. See Basic web application.
• Use Azure Traffic Manager to distribute traffic across a set of endpoints.
Replicate data
Replicating data is a general strategy for handling non-transient failures in a data store. Many storage
technologies provide built-in replication, including Azure SQL Database, Cosmos DB and Apache
Cassandra.
It’s important to consider both the read and write paths. Depending on the storage technology,
you might have multiple writable replicas or a single writable replica and multiple read-only replicas.
To maximise availability, replicas can be placed in multiple regions. However, this increases the latency
when replicating the data. Typically, replicating across regions is done asynchronously, which implies
an eventual consistency model and potential data loss if a replica fails.
Degrade gracefully
If a service fails and there is no failover path, the application may be able to degrade gracefully
while still providing an acceptable user experience.
For example:
When a single client makes an excessive number of requests, the application might throttle the client
for a certain period of time. During the throttling period, the application refuses some or all of the
requests from that client (depending on the exact throttling strategy). The threshold for throttling
might depend on the customer’s service tier.
Throttling does not imply the client was necessarily acting maliciously, only that it exceeded its
service quota. In some cases, a consumer might consistently exceed their quota or otherwise behave
badly. In that case, you might go further and block the user. Typically, this is done by blocking an API
key or an IP address range.
The Circuit Breaker pattern can prevent an application from repeatedly trying an operation that is
likely to fail. This is similar to a physical circuit breaker, a switch that interrupts the flow of current
when a circuit is overloaded.
• Closed. This is the normal state. The circuit breaker sends requests to the service and a counter
tracks the number of recent failures. If the failure count exceeds a threshold within a given time
period, the circuit breaker switches to the Open state.
• Open. In this state, the circuit breaker immediately fails all requests, without calling the service.
The application should use a mitigation path, such as reading data from a replica or simply
returning an error to the user. When the circuit breaker switches to Open, it starts a timer.
When the timer expires, the circuit breaker switches to the Half-open state.
• Half-open. In this state, the circuit breaker lets a limited number of requests go through to the
service. If they succeed, the service is assumed to be recovered and the circuit breaker switches
back to the Closed state. Otherwise, it reverts to the Open state. The Half-Open state prevents
a recovering service from suddenly being inundated with requests.
To avoid this, you can use a queue as a buffer. When there is a new work item, instead of calling the
backend service immediately, the application queues a work item to run asynchronously. The queue
acts as a buffer that smooths out peaks in the load.
To avoid this, you can partition a system into isolated groups, so that a failure in one partition
does not bring down the entire system. This technique is sometimes called the Bulkhead pattern.
Examples:
• Partition a database (for example, by tenant) and assign a separate pool of web server instances
for each partition.
• Use separate thread pools to isolate calls to different services. This helps to prevent cascading
failures if one of the services fails. For an example, see the Netflix Hystrix library.
• Use containers to limit the resources available to a particular subsystem.
For example, to book a trip, a customer might reserve a car, a hotel room and a flight. If any of these
steps fails, the entire operation fails. Instead of trying to use a single distributed transaction for the
entire operation, you can define a compensating transaction for each step. For example, to undo a
car reservation, you cancel the reservation. In order to complete the whole operation, a coordinator
executes each step. If any step fails, the coordinator applies compensating transactions to undo any
steps that were completed.
Generally, you can’t test resilience in the same way that you test application functionality (by running
unit tests and so on). Instead, you must test how the end-to-end workload performs under failure
conditions which only occur intermittently.
Testing is an iterative process. Test the application, measure the outcome, analyse and address any
failures that result, and repeat the process.
Fault injection testing. Test the resilience of the system during failures, either by triggering actual
failures or by simulating them. Here are some common failure scenarios to test:
Measure the recovery times and verify that your business requirements are met. Test combinations
of failure modes as well. Make sure that failures don’t cascade and are handled in an isolated way.
This is another reason why it’s important to analyse possible failure points during the design phase.
The results of that analysis should be input into your test plan.
Load testing. Load test the application using a tool such as Visual Studio Team Services or Apache
JMeter. Load testing is crucial for identifying failures that only happen under load, such as the
backend database being overwhelmed or service throttling. Test for peak load, using production
data or synthetic data that is as close to production data as possible. The goal is to see how the
application behaves under real-world conditions.
Resilient deployment
Once an application is deployed to production, updates are a possible source of errors. In the worst
case, a bad update can cause downtime. To avoid this, the deployment process must be predictable
and repeatable. Deployment includes provisioning Azure resources, deploying application code and
applying configuration settings. An update may involve all three or a subset.
The crucial point is that manual deployments are prone to error. Therefore, it’s recommended to
have an automated, idempotent process that you can run on demand and re-run if something fails.
Another question is how to roll out an application update. We recommend techniques such as
blue-green deployment or canary releases, which push updates in highly controlled way to minimise
possible impacts from a bad deployment.
Whatever approach you take, make sure that you can roll back to the last-known-good deployment,
in case the new version is not functioning. Also, if errors occur, the application logs must indicate
which version caused the error.
Monitoring a large-scale distributed system poses a significant challenge. Think about an application
that runs on a few dozen VMs – it’s not practical to log into each VM, one at a time, and look through
log files, trying to troubleshoot a problem. Moreover, the number of VM instances is probably not
static. VMs get added and removed as the application scales in and out, and occasionally an instance
may fail and need to be reprovisioned. In addition, a typical cloud application might use multiple
data stores (Azure storage, SQL Database, Cosmos DB, Redis cache) and a single user action may
span multiple subsystems.
You can think of the monitoring and diagnostics process as a pipeline with several distinct stages:
• Collection and storage. Raw instrumentation data can be held in various locations and with
various formats (e.g., application trace logs, IIS logs, performance counters). These disparate
sources are collected, consolidated and put into reliable storage.
• Analysis and diagnosis. After the data is consolidated, it can be analysed to troubleshoot
issues and provide an overall view of application health.
• Visualisation and alerts. In this stage, telemetry data is presented in such a way that an
operator can quickly notice problems or trends. Examples include dashboards or email alerts.
Monitoring is not the same as failure detection. For example, your application might detect
a transient error and retry, resulting in no downtime. But it should also log the retry operation,
so that you can monitor the error rate, in order to get an overall picture of application health.
Application logs are an important source of diagnostics data. Best practices for application logging
include:
• Log in production. Otherwise, you lose insight where you need it most.
• Log events at service boundaries. Include a correlation ID that flows across service boundaries.
If a transaction flows through multiple services and one of them fails, the correlation ID will help
you pinpoint why the transaction failed.
• Use semantic logging, also known as structured logging. Unstructured logs make it hard
to automate the consumption and analysis of the log data, which is needed at cloud scale.
• Use asynchronous logging. Otherwise, the logging system itself can cause the application
to fail by causing requests to back up, as they block while waiting to write a logging event.
• Application logging is not the same as auditing. Auditing may be done for compliance or
regulatory reasons. As such, audit records must be complete and it’s not acceptable to drop
any while processing transactions. If an application requires auditing, this should be kept
separate from diagnostics logging.
For more information about monitoring and diagnostics, see Monitoring and diagnostics guidance.
• Alerts. Monitor your application for warning signs that may require proactive intervention.
For example, if you see that SQL Database or Cosmos DB consistently throttles your application,
you might need to increase your database capacity or optimise your queries. In this example,
even though the application might handle the throttling errors transparently, your telemetry
should still raise an alert so that you can follow up.
• Manual failover. Some systems cannot fail over automatically and require a manual failover.
Document and test your disaster recovery plan. Evaluate the business impact of application failures.
Automate the process as much as possible and document any manual steps, such as manual failover
or data restoration from backups. Regularly test your disaster recovery process to validate and
improve the plan.
Summary
This article discussed resilience from a holistic perspective, emphasising some of the unique
challenges of the cloud. These include the distributed nature of cloud computing, the use
of commodity hardware and the presence of transient network faults.
Here are the major points to take away from this article:
• Resilience leads to higher availability and lower mean time to recover from failures.
• Achieving resilience in the cloud requires a different set of techniques from traditional
on-premises solutions.
• Resilience does not happen by accident. It must be designed and built in from the start.
• Resilience touches every part of the application lifecycle, from planning and coding to
operations.
• Test and monitor.
Design
your Azure
application: Use
these pillars of
quality
Scalability, availability, resilience, management and security are the five
pillars of quality software. Focusing on these pillars will help you design
a successful cloud application. You can use the checklists in this guide
to review your application against these pillars.
Availability Define a service level objective (SLO) that clearly defines the expected
availability and how it is measured. Use the critical path to define. An extra
percentage point of availability can add up to additional hours or days
of uptime over a year.
Resilience Cloud applications have occasional failures and must be built to recover
from them. Build resilience mitigations into your application at all levels.
Focus on tactical mitigations first and include monitoring.
Management Automate deployments and make them a fast and routine process to speed
the release of new features or bug fixes. Build in roll back or roll forward
operations, and include monitoring and diagnostics.
Scalability
Scalability is the ability of a system to handle increased load. There are two main ways that an
application can scale. Vertical scaling (scaling up) means increasing the capacity of a resource,
for example by using a larger VM size. Horizontal scaling (scaling out) is adding new instances
of a resource, such as VMs or database replicas.
An advantage of vertical scaling is that you can do it without making any changes to the application.
But at some point you’ll hit a limit, where you can’t scale up anymore. At that point, any further
scaling must be horizontal.
Horizontal scale must be designed into the system. For example, you can scale out VMs by placing
them behind a load balancer. But each VM in the pool must be able to handle any client request,
so the application must be stateless or store state externally (say, in a distributed cache). Managed
PaaS services often have horizontal scaling and auto-scaling built in. The ease of scaling these
services is a major advantage of using PaaS services.
Always conduct performance and load testing to find these potential bottlenecks. The stateful
parts of a system, such as databases, are the most common cause of bottlenecks, and require
careful design to scale horizontally. Resolving one bottleneck may reveal other bottlenecks elsewhere.
Use the Scalability checklist to review your design from a scalability standpoint.
Scalability guidance
• Design patterns for scalability and performance
Best practices
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/background-jobs
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/caching
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/cdn
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/data-partitioning
A cloud application should have a service level objective (SLO) that clearly defines the expected
availability and how the availability is measured. When defining availability, look at the critical path.
The web front-end might be able to service client requests, but if every transaction fails because
it can’t connect to the database, the application is not available to users.
Availability is often described in terms of “9s” – for example, “four 9s” means 99.99% uptime.
The following table shows the potential cumulative downtime at different availability levels.
SLAD Downtime per week Downtime per month Downtime per year
Notice that 99% uptime could translate to an almost two-hour service outage per week. For many
applications, especially consumer-facing applications, that is not an acceptable SLO. On the other
hand, five 9s (99.999%) means no more than 5 minutes of downtime in a year. It’s challenging enough
just detecting an outage that quickly, let alone resolving the issue. To get very high availability
(99.99% or higher), you can’t rely on manual intervention to recover from failures. The application
must be self-diagnosing and self-healing, which is where resilience becomes crucial.
Scalability guidance
• Design patterns for availability
Best practices
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/background-jobs
In traditional application development, there has been a focus on reducing mean time between
failures (MTBF). Effort was spent trying to prevent the system from failing. In cloud computing,
a different mind-set is required, due to several factors:
• Distributed systems are complex and a failure at one point can potentially cascade throughout
the system.
• Costs for cloud environments are kept low through the use of commodity hardware,
so occasional hardware failures must be expected.
• Applications often depend on external services, which may become temporarily unavailable
or throttle high-volume users.
• Today’s users expect an application to be available 24/7 without ever going offline.
All of these factors mean that cloud applications must be designed to expect occasional failures and
recover from them. Azure has many resilience features already built into the platform. For example,
• Azure Storage, SQL Database and Cosmos DB all provide built-in data replication, both within
a region and across regions.
• Azure Managed Disks are automatically placed in different storage scale units, to limit the effects
of hardware failures.
• VMs in an availability set are spread across several fault domains. A fault domain is a group
of VMs that share a common power source and network switch. Spreading VMs across fault
domains limits the impact of physical hardware failures, network outages or power interruptions.
That said, you still need to build resilience into your application. Resilience strategies can be applied
at all levels of the architecture. Some mitigations are more tactical in nature – for example, retrying a
remote call after a transient network failure. Other mitigations are more strategic, such as failing over
the entire application to a secondary region. Tactical mitigations can make a big difference. While it’s
rare for an entire region to experience a disruption, transient problems such as network congestion
are more common – so target these first. Having the right monitoring and diagnostics is also
important, both to detect failures when they happen and to find the root causes.
When designing an application to be resilient, you must understand your availability requirements.
How much downtime is acceptable? This is partly a function of cost. How much will potential
downtime cost your business? How much should you invest in making the application highly
available?
Use the Resilience checklist to review your design from a resilience standpoint.
Best practices
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/transient-faults
• https://docs.microsoft.com/en-us/azure/architecture/best-practices/retry-service-specific
Monitoring and diagnostics are crucial. For best practices for monitoring and diagnostics, go to
https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring. Cloud applications
run in a remote datacentre where you do not have full control of the infrastructure or, in some cases,
the operating system. In a large application, it’s not practical to log into VMs to troubleshoot an
issue or sift through log files. With PaaS services, there may not even be a dedicated VM to log into.
Monitoring and diagnostics give insight into the system, so that you know when and where failures
occur. All systems must be observable. Use a common and consistent logging schema that lets you
correlate events across systems.
Use the DevOps checklist to review your design from a management and DevOps standpoint.
100 CHAPTER 4 | Design your Azure application: Use these pillars of quality
Security
You must think about security throughout the entire lifecycle of an application, from design and
implementation to deployment and operations. The Azure platform provides protections against
a variety of threats, such as network intrusion and DDoS attacks. But you still need to build security
into your application and into your DevOps processes.
Identity management
Consider using Azure Active Directory (Azure AD) to authenticate and authorise users. Azure AD
is a fully managed identity and access management service. You can use it to create domains that
exist purely on Azure or integrate with your on-premises Active Directory identities. Azure AD
also integrates with Office365, Dynamics CRM Online and many third-party SaaS applications. For
consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing
social accounts (such as Facebook, Google or LinkedIn) or create a new user account that is managed
by Azure AD.
If you want to integrate an on-premises Active Directory environment with an Azure network, several
approaches are possible, depending on your requirements. For more information, see our Identity
Management reference architectures.
Application security
In general, the security best practices for application development still apply in the cloud. These
include things like using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL
injection attacks and so on.
Cloud applications often use managed services that have access keys. Never check these into source
control. Consider storing application secrets in Azure Key Vault.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt
keys and secrets by using keys that are protected by hardware security modules (HSMs).
Many Azure storage and DB services support data encryption at rest, including Azure Storage,
Azure SQL Database, Azure SQL Data Warehouse and Cosmos DB.
101 CHAPTER 4 | Design your Azure application: Use these pillars of quality
For more information, go to:
• https://docs.microsoft.com/en-us/azure/storage/storage-service-encryption
• https://docs.microsoft.com/en-us/azure/sql-database/sql-database-always-encrypted-azure-key-
vault
• https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-security-overview#data-
protection
• https://docs.microsoft.com/en-us/azure/cosmos-db/database-security
Security resources
• Azure Security Centre {no link} provides integrated security monitoring and policy management
across your Azure subscriptions. Go to https://azure.microsoft.com/en-us/services/security-
center/.
• For information about how to protect your applications in the cloud, go to https://docs.microsoft.
com/en-us/azure/security/.
102 CHAPTER 4 | Design your Azure application: Use these pillars of quality
5
Pattern Summary
Health Endpoint Monitoring Implement functional checks in an application that external tools
can access through exposed endpoints at regular intervals.
Queue-Based Load Levelling Use a queue that acts as a buffer between a task and a service
that it invokes in order to smooth intermittent heavy loads.
Pattern Summary
Cache-Aside Load data on demand into a cache from a data store
CQRS Segregate operations that read data from operations that update data
by using separate interfaces.
Event Sourcing Use an append-only store to record the full series of events that
describe actions taken on data in a domain.
Index Table Create indexes over the fields in data stores that are frequently
referenced by queries.
Materialised View Generate prepopulated views over the data in one or more data stores
when the data isn’t ideally formatted for required query operations.
Static Content Hosting Deploy static content to a cloud-based storage service that can deliver
them directly to the client.
Valet Key Use a token or key that provides clients with restricted direct access
to a specific resource or service.
Pattern Summary
Ambassador Create helper services that send network requests on behalf
of a consumer service or application.
Backends for Frontends Create separate backend services to be consumed by specific frontend
applications or interfaces.
Gateway Aggregation Use a gateway to aggregate multiple individual requests into a single
request.
Pipes and Filters Break down a task that performs complex processing into a series
of separate elements that can be reused.
Static Content Hosting Deploy static content to a cloud-based storage service that can deliver
it directly to the client.
Messaging
The distributed nature of cloud applications requires a messaging infrastructure that connects
the components and services, ideally in a loosely coupled manner in order to maximise scalability.
Asynchronous messaging is widely used, and provides many benefits, but also brings challenges
such as the ordering of messages, poison message management, idempotency and more.
Pattern Summary
Competing Consumers Enable multiple concurrent consumers to process messages received
on the same messaging channel.
Pipes and Filters Break down a task that performs complex processing into a series
of separate elements that can be reused.
Priority Queue Prioritise requests sent to services so that requests with a higher
priority are received and processed more quickly than those with
a lower priority.
Scheduler Agent Coordinate a set of actions across a distributed set of services and
Supervisor other remote resources.
Pattern Summary
Ambassador Create helper services that send network requests on behalf
of a consumer service or application.
Pattern Summary
Cache-Aside Load data on demand into a cache from a data store
CQRS Segregate operations that read data from operations that update
data by using separate interfaces.
Event Sourcing Use an append-only store to record the full series of events that
describe actions taken on data in a domain.
Index Table Create indexes over the fields in data stores that are frequently
referenced by queries.
Materialised View Generate prepopulated views over the data in one or more data stores
when the data isn’t ideally formatted for required query operations.
Priority Queue Prioritise requests sent to services so that requests with a higher priority
are received and processed more quickly than those with a lower
priority.
Queue-Based Load Use a queue that acts as a buffer between a task and a service
Levelling that it invokes in order to smooth intermittent heavy loads.
Static Content Hosting Deploy static content to a cloud-based storage service that can
deliver them directly to the client.
Pattern Summary
Circuit Breaker Handle faults that might take a variable amount of time to fix when
connecting to a remote service or resource.
Compensating Undo the work performed by a series of steps, which together define
Transaction an eventually consistent operation.
Queue-Based Load Use a queue that acts as a buffer between a task and a service that
Levelling it invokes in order to smooth intermittent heavy loads.
Scheduler Agent Coordinate a set of actions across a distributed set of services and
Supervisor other remote resources.
Pattern Summary
Federated Identity Delegate authentication to an external identity provider.
Valet Key Use a token or key that provides clients with restricted direct access
to a specific resource or service.
Catalogue of patterns
Ambassador pattern
Create helper services that send network requests on behalf of a consumer service or application.
An ambassador service can be thought of as an out-of-process proxy that is co-located with the client.
This pattern can be useful for offloading common client connectivity tasks such as monitoring,
logging, routing, security (such as TLS) and resilience patterns in a language agnostic way. It is often
used with legacy applications, or other applications that are difficult to modify, in order to extend
their networking capabilities. It can also enable a specialised team to implement those features.
Network calls may also require substantial configuration for connection, authentication and
authorisation. If these calls are used across multiple applications, built using multiple languages and
frameworks, the calls must be configured for each of these instances. In addition, network and security
functionality may need to be managed by a central team within your organisation. With a large code
base, it can be risky for that team to update application code they aren’t familiar with.
Solution
Put client frameworks and libraries into an external process that acts as a proxy between
your application and external services. Deploy the proxy on the same host environment as
your application to allow control over routing, resilience, security features and to avoid any host-
related access restrictions. You can also use the ambassador pattern to standardise and extend
instrumentation. The proxy can monitor performance metrics such as latency or resource usage,
and this monitoring happens in the same host environment as the application.
• Consider the possible impact of including generalised features in the proxy. For example,
the ambassador could handle retries, but that might not be safe unless all operations are
idempotent.
• Consider a mechanism to allow the client to pass some context to the proxy, as well as back to
the client. For example, include HTTP request headers to opt out of retry or specify the maximum
number of times to retry.
• Consider whether to use a single shared instance for all clients or an instance for each client.
• Need to build a common set of client connectivity features for multiple languages or frameworks.
• Need to offload cross-cutting client connectivity concerns to infrastructure developers
or other more specialised teams.
• Need to support cloud or cluster connectivity requirements in a legacy application
or an application that is difficult to modify.
• When network request latency is critical. A proxy will introduce some overhead, although
minimal, and in some cases this may affect the application.
• When client connectivity features are consumed by a single language. In that case, a better
option might be a client library that is distributed to the development teams as a package.
• When connectivity features cannot be generalised and require deeper integration with the client
application.
Often these legacy systems suffer from quality issues such as convoluted data schemas or obsolete
APIs. The features and technologies used in legacy systems can vary widely from more modern
systems. To interoperate with the legacy system, the new application may need to support outdated
infrastructure, protocols, data models, APIs or other features that you wouldn’t otherwise put into
a modern application.
Maintaining access between new and legacy systems can force the new system to adhere to at least
some of the legacy system’s APIs or other semantics. When these legacy features have quality issues,
supporting them “corrupts” what might otherwise be a cleanly designed modern application.
Solution
Isolate the legacy and modern systems by placing an anti-corruption layer between them. This layer
translates communications between the two systems, allowing the legacy system to remain unchanged
while the modern application can avoid compromising its design and technological approach.
• The anti-corruption layer adds an additional service that must be managed and maintained.
• Consider whether you need more than one anti-corruption layer. You may want to decompose
functionality into multiple services using different technologies or languages, or there may be
other reasons to partition the anti-corruption layer.
• Consider how the anti-corruption layer will be managed in relation with your other applications
or services. How will it be integrated into your monitoring, release and configuration processes?
• Make sure transaction and data consistency are maintained and can be monitored.
• Consider whether the anti-corruption layer needs to handle all communication between legacy
and modern systems, or just a subset of features.
• Consider whether the anti-corruption layer is meant to be permanent or eventually retired once
all legacy functionality has been migrated.
• A migration is planned to happen over multiple stages, but integration between new and legacy
systems needs to be maintained.
• New and legacy systems have different semantics, but still need to communicate.
This pattern may not be suitable if there are no significant semantic differences between new and
legacy systems.
But the capabilities of a mobile device differ significantly from a desktop browser, in terms screen
size, performance and display limitations. As a result, the requirements for a mobile application
backend differ from the desktop web UI.
These differences result in competing requirements for the backend. The backend requires regular
and significant changes to serve both the desktop web UI and the mobile application. Often,
separate interface teams work on each frontend, causing the backend to become a bottleneck
in the development process. Conflicting update requirements, and the need to keep the service
working for both frontends, can result in spending a lot of effort on a single deployable resource.
As the development activity focuses on the backend service, a separate team may be created to
manage and maintain the backend. Ultimately, this results in a disconnect between the interface
and backend development teams, placing a burden on the backend team to balance the competing
requirements of the different UI teams. When one interface team requires changes to the backend,
those changes must be validated with other interface teams before they can be integrated into the
backend.
Because each backend is specific to one interface, it can be optimised for that interface. As a result,
it will be smaller, less complex and likely faster than a generic backend that tries to satisfy the
requirements for all interfaces. Each interface team has autonomy to control their own backend
and doesn’t rely on a centralised backend development team. This gives the interface team flexibility
in language selection, release cadence, prioritisation of workload and feature integration in their
backend.
• If different interfaces (such as mobile clients) will make the same requests, consider whether
it is necessary to implement a backend for each interface or if a single backend will suffice.
• Code duplication across services is highly likely when implementing this pattern.
• Frontend-focused backend services should only contain client-specific logic and behaviour.
General business logic and other global features should be managed elsewhere in your
application.
• Think about how this pattern might be reflected in the responsibilities of a development team.
• Consider how long it will take to implement this pattern. Will the effort of building the new
backends incur technical debt, while you continue to support the existing generic backend?
• A shared or general purpose backend service must be maintained with significant development
overhead.
• You want to optimise the backend for the requirements of specific client interfaces.
• An alternative language is better suited for the backend of a different user interface.
Related guidance
• Gateway Aggregation pattern
Bulkhead pattern
Isolate elements of an application into pools so that if one fails, the others will continue to function.
This pattern is named Bulkhead because it resembles the sectioned partitions of a ship’s hull. If the
hull of a ship is compromised, only the damaged section fills with water, which prevents the ship
from sinking.
Moreover, a consumer may send requests to multiple services simultaneously, using resources
for each request. When the consumer sends a request to a service that is misconfigured or
not responding, the resources used by the client’s request may not be freed in a timely manner.
As requests to the service continue, those resources may be exhausted. For example, the client’s
connection pool may be exhausted. At that point, requests by the consumer to other services
are impacted. Eventually the consumer can no longer send requests to other services, not just
the original unresponsive service.
The same issue of resource exhaustion affects services with multiple consumers. A large number of
requests originating from one client may exhaust available resources in the service. Other consumers
are no longer able to consume the service, causing a cascading failure effect.
A consumer can also partition resources, to ensure that resources used to call one service don’t
affect the resources used to call another service. For example, a consumer that calls multiple services
may be assigned a connection pool for each service. If a service begins to fail, it only affects the
connection pool assigned for that service, allowing the consumer to continue using the other
services.
The following diagram shows bulkheads structured around connection pools that call individual
services. If Service A fails or causes some other issue, the connection pool is isolated, so only
workloads using the thread pool assigned to Service A are affected. Workloads that use Service B
and C are not affected and can continue working without interruption.
The next diagram shows multiple clients calling a single service. Each client is assigned a separate
service instance. Client 1 has made too many requests and overwhelmed its instance. Because each
service instance is isolated from the others, the other clients can continue making calls.
• When partitioning services or consumers into bulkheads, consider the level of isolation offered
by the technology as well as the overhead in terms of cost, performance and manageability.
• Consider combining bulkheads with retry, circuit breaker and throttling patterns to provide more
sophisticated fault handling.
• When partitioning consumers into bulkheads, consider using processes, thread pools and
semaphores. Projects like Netflix Hystrix and Polly offer a framework for creating consumer
bulkheads
• When partitioning services into bulkheads, consider deploying them into separate virtual
machines, containers or processes. Containers offer a good balance of resource isolation with
fairly low overhead.
• Services that communicate using asynchronous messages can be isolated through different sets
of queues. Each queue can have a dedicated set of instances processing messages on the queue
or a single group of instances using an algorithm to dequeue and dispatch processing.
• Determine the level of granularity for the bulkheads. For example, if you want to distribute
tenants across partitions, you could place each tenant into a separate partition or several
tenants into one partition.
apiVersion: v1
kind: Pod
metadata:
name: drone-management
spec:
containers:
- name: drone-management-container
image: drone-service
resources:
requests:
memory: “64Mi”
cpu: “250m”
limits:
memory: “128Mi”
cpu: “1”
Related guidance
Cache-Aside pattern
Load data on demand into a cache from a data store. This can improve performance and also helps
to maintain consistency between data held in the cache and data in the underlying data store.
Solution
Many commercial caching systems provide read-through and write-through/write-behind operations.
In these systems, an application retrieves data by referencing the cache. If the data isn’t in the cache,
it’s retrieved from the data store and added to the cache. Any modifications to data held in the cache
are automatically written back to the data store as well.
An application can emulate the functionality of read-through caching by implementing the cache-
aside strategy. This strategy loads data into the cache on demand. The figure illustrates using the
Cache-Aside pattern to store data in the cache.
When the item is next required, using the Cache-Aside strategy will cause the updated data
to be retrieved from the data store and added back into the cache.
Lifetime of cached data. Many caches implement an expiration policy that invalidates data and
removes it from the cache if it’s not accessed for a specified period. For Cache-Aside to be effective,
ensure that the expiration policy matches the pattern of access for applications that use the data.
Don’t make the expiration period too short because this can cause applications to continually retrieve
data from the data store and add it to the cache. Similarly, don’t make the expiration period so long
that the cached data is likely to become stale. Remember that caching is most effective for relatively
static data or data that is read frequently.
Evicting data. Most caches have a limited size compared to the data store where the data originates
and they’ll evict data if necessary. Most caches adopt a least-recently-used policy for selecting
items to evict, but this might be customisable. Configure the global expiration property, other
properties of the cache and the expiration property of each cached item to ensure that the cache is
cost effective. It isn’t always appropriate to apply a global eviction policy to every item in the cache.
For example, if a cached item is very expensive to retrieve from the data store, it can be beneficial
to keep this item in the cache at the expense of more frequently accessed but less costly items.
Consistency. Implementing the Cache-Aside pattern doesn’t guarantee consistency between the data
store and the cache. An item in the data store can be changed at any time by an external process and
this change might not be reflected in the cache until the next time the item is loaded. In a system that
replicates data across data stores, this problem can become serious if synchronisation occurs frequently.
Local (in-memory) caching. A cache could be local to an application instance and stored in-
memory. Cache-Aside can be useful in this environment if an application repeatedly accesses the
same data. However, a local cache is private and so different application instances could each have
a copy of the same cached data. This data could quickly become inconsistent between caches, so
it might be necessary to expire data held in a private cache and refresh it more frequently. In these
scenarios, consider investigating the use of a shared or a distributed caching mechanism.
Lifetime of cached data. Many caches implement an expiration policy that invalidates data and
removes it from the cache if it’s not accessed for a specified period. For Cache-Aside to be effective,
ensure that the expiration policy matches the pattern of access for applications that use the data.
Don’t make the expiration period too short because this can cause applications to continually retrieve
data from the data store and add it to the cache. Similarly, don’t make the expiration period so long
that the cached data is likely to become stale. Remember that caching is most effective for relatively
static data or data that is read frequently.
Evicting data. Most caches have a limited size compared to the data store where the data originates
and they’ll evict data if necessary. Most caches adopt a least-recently-used policy for selecting
items to evict, but this might be customisable. Configure the global expiration property and other
properties of the cache, and the expiration property of each cached item, to ensure that the cache
is cost effective. It isn’t always appropriate to apply a global eviction policy to every item in the cache.
For example, if a cached item is very expensive to retrieve from the data store, it can be beneficial
to keep this item in the cache at the expense of more frequently accessed but less costly items.
Priming the cache. Many solutions prepopulate the cache with the data that an application is likely
to need as part of the start-up processing. The Cache-Aside pattern can still be useful if some of this
data expires or is evicted.
Consistency. Implementing the Cache-Aside pattern doesn’t guarantee consistency between the data
store and the cache. An item in the data store can be changed at any time by an external process and
this change might not be reflected in the cache until the next time the item is loaded. In a system that
replicates data across data stores, this problem can become serious if synchronisation occurs frequently.
Local (in-memory) caching. A cache could be local to an application instance and stored in-
memory. Cache-Aside can be useful in this environment if an application repeatedly accesses the
same data. However, a local cache is private and so different application instances could each have
a copy of the same cached data. This data could quickly become inconsistent between caches, so
it might be necessary to expire data held in a private cache and refresh it more frequently. In these
scenarios, consider investigating the use of a shared or a distributed caching mechanism.
• Resource demand is unpredictable. This pattern enables applications to load data on demand.
It makes no assumptions about which data an application will require in advance.
Example
In Microsoft Azure you can use Azure Redis Cache to create a distributed cache that can be shared
by multiple instances of an application.
To connect to an Azure Redis Cache instance, call the static Connect method and pass in the
connection string. The method returns a ConnectionMultiplexer that represents the connection. One
approach to sharing a ConnectionMultiplexer instance in your application is to have a static property
that returns a connected instance, similar to the following example. This approach provides a thread-
safe way to initialise only a single connected instance.
The GetMyEntityAsync method in the following code example shows an implementation of the
Cache-Aside pattern based on Azure Redis Cache. This method retrieves an object from the cache
using the read-though approach.
An object is identified by using an integer ID as the key. The GetMyEntityAsync method tries to
retrieve an item with this key from the cache. If a matching item is found, it’s returned. If there’s
no match in the cache, the GetMyEntityAsync method retrieves the object from a data store, adds
it to the cache and then returns it. The code that actually reads the data from the data store is not
shown here, because it depends on the data store. Note that the cached item is configured to expire
to prevent it from becoming stale if it’s updated elsewhere.
return value;
}
The examples use the Azure Redis Cache API to access the store and retrieve information from the
cache. For more information, see Using Microsoft Azure Redis Cache and How to create a Web App
with Redis Cache
Note:
The order of the steps is important. Update the data store before removing the item from the cache. If you remove
the cached item first, there is a small window of time when a client might fetch the item before the data store is
updated. That will result in a cache miss (because the item was removed from the cache), causing the earlier version
of the item to be fetched from the data store and added back into the cache. The result will be stale cache data.
• Caching Guidance. Provides additional information on how you can cache data in a cloud
solution and the issues that you should consider when you implement a cache.
• Data Consistency Primer. Cloud applications typically use data that’s spread across data stores.
Managing and maintaining data consistency in this environment is a critical aspect of the system,
particularly the concurrency and availability issues that can arise. This primer describes issues
about consistency across distributed data and summarises how an application can implement
eventual consistency to maintain the availability of data.
However, there can also be situations where faults are due to unanticipated events and that might
take much longer to fix. These faults can range in severity from a partial loss of connectivity to the
complete failure of a service. In these situations it might be pointless for an application to continually
retry an operation that is unlikely to succeed and instead the application should quickly accept that
the operation has failed and handle this failure accordingly.
Additionally, if a service is very busy, failure in one part of the system might lead to cascading
failures. For example, an operation that invokes a service could be configured to implement a
timeout and reply with a failure message if the service fails to respond within this period. However,
this strategy could cause many concurrent requests to the same operation to be blocked until the
timeout period expires. These blocked requests might hold critical system resources such as memory,
threads, database connections and so on. Consequently, these resources could become exhausted,
causing failure of other possibly unrelated parts of the system that need to use the same resources.
In these situations, it would be preferable for the operation to fail immediately and only attempt to
invoke the service if it’s likely to succeed. Note that setting a shorter timeout might help to resolve
this problem, but the timeout shouldn’t be so short that the operation fails most of the time, even
if the request to the service would eventually succeed.
The purpose of the Circuit Breaker pattern is different than the Retry pattern. The Retry pattern
enables an application to retry an operation in the expectation that it’ll succeed. The Circuit Breaker
pattern prevents an application from performing an operation that is likely to fail. An application
can combine these two patterns by using the Retry pattern to invoke an operation through a circuit
breaker. However, the retry logic should be sensitive to any exceptions returned by the circuit breaker
and abandon retry attempts if the circuit breaker indicates that a fault is not transient.
A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number
of recent failures that have occurred, and use this information to decide whether to allow the
operation to proceed or simply return an exception immediately.
The proxy can be implemented as a state machine with the following states that mimic the
functionality of an electrical circuit breaker:
• Closed: The request from the application is routed to the operation. The proxy maintains
a count of the number of recent failures and if the call to the operation is unsuccessful the
proxy increments this count. If the number of recent failures exceeds a specified threshold
within a given time period, the proxy is placed into the Open state. At this point the proxy
starts a timeout timer and when this timer expires the proxy is placed into the Half-Open state.
• The purpose of the timeout timer is to give the system time to fix the problem that
caused the failure before allowing the application to try to perform the operation again.
• Open: The request from the application fails immediately and an exception is returned to the
application.
• Half-Open: A limited number of requests from the application are allowed to pass through
and invoke the operation. If these requests are successful, it’s assumed that the fault that was
previously causing the failure has been fixed and the circuit breaker switches to the Closed state
(the failure counter is reset). If any request fails, the circuit breaker assumes that the fault is still
present so it reverts back to the Open state and restarts the timeout timer to give the system
a further period of time to recover from the failure.
• The Half-Open state is useful to prevent a recovering service from suddenly being
flooded with requests. As a service recovers, it might be able to support a limited volume
of requests until the recovery is complete, but while recovery is in progress a flood of
work can cause the service to time out or fail again.
How the system recovers is handled externally, possibly by restoring or restarting a failed component
or repairing a network connection.
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimises
the impact on performance. It can help to maintain the response time of the system by quickly
rejecting a request for an operation that’s likely to fail, rather than waiting for the operation to time
out or never return. If the circuit breaker raises an event each time it changes state, this information
can be used to monitor the health of the part of the system protected by the circuit breaker or to
alert an administrator when a circuit breaker trips to the Open state.
The pattern is customisable and can be adapted according to the type of the possible failure.
For example, you can apply an increasing timeout timer to a circuit breaker. You could place the
circuit breaker in the Open state for a few seconds initially and then if the failure hasn’t been
resolved increase the timeout to a few minutes, and so on. In some cases, rather than the Open
state returning failure and raising an exception, it could be useful to return a default value that
is meaningful to the application.
Types of exceptions. A request might fail for many reasons, some of which might indicate a more
severe type of failure than others. For example, a request might fail because a remote service has
crashed and will take several minutes to recover or because of a timeout due to the service being
temporarily overloaded. A circuit breaker might be able to examine the types of exceptions that occur
and adjust its strategy depending on the nature of these exceptions. For example, it might require
a larger number of timeout exceptions to trip the circuit breaker to the Open state compared to the
number of failures due to the service being completely unavailable.
Logging. A circuit breaker should log all failed requests (and possibly successful requests) to enable
an administrator to monitor the health of the operation.
Recoverability. You should configure the circuit breaker to match the likely recovery pattern of
the operation it’s protecting. For example, if the circuit breaker remains in the Open state for a long
period, it could raise exceptions even if the reason for the failure has been resolved. Similarly, a circuit
breaker could fluctuate and reduce the response times of applications if it switches from the Open
state to the Half-Open state too quickly.
Testing failed operations. In the Open state, rather than using a timer to determine when to switch
to the Half-Open state, a circuit breaker can instead periodically ping the remote service or resource
to determine whether it’s become available again. This ping could take the form of an attempt to
invoke an operation that had previously failed or it could use a special operation provided by the
remote service specifically for testing the health of the service, as described by the Health Endpoint
Monitoring pattern.
Manual override. In a system where the recovery time for a failing operation is extremely variable,
it’s beneficial to provide a manual reset option that enables an administrator to close a circuit breaker
(and reset the failure counter). Similarly, an administrator could force a circuit breaker into the Open
state (and restart the timeout timer) if the operation protected by the circuit breaker is temporarily
unavailable.
Concurrency. The same circuit breaker could be accessed by a large number of concurrent instances
of an application. The implementation shouldn’t block concurrent requests or add excessive overhead
to each call to an operation.
Resource differentiation. Be careful when using a single circuit breaker for one type of resource if
there might be multiple underlying independent providers. For example, in a data store that contains
multiple shards, one shard might be fully accessible while another is experiencing a temporary issue.
If the error responses in these scenarios are merged, an application might try to access some shards
even when failure is highly likely, while access to other shards might be blocked even though it’s
likely to succeed.
Accelerated circuit breaking. Sometimes a failure response can contain enough information for
the circuit breaker to trip immediately and stay tripped for a minimum amount of time. For example,
the error response from a shared resource that’s overloaded could indicate that an immediate retry
isn’t recommended and that the application should instead try again in a few minutes.
Replaying failed requests. In the Open state, rather than simply failing quickly, a circuit breaker
could also record the details of each request to a journal and arrange for these requests to be
replayed when the remote resource or service becomes available.
Inappropriate timeouts on external services. A circuit breaker might not be able to fully
protect applications from operations that fail in external services that are configured with a lengthy
timeout period. If the timeout is too long, a thread running a circuit breaker might be blocked for
an extended period before the circuit breaker indicates that the operation has failed. In this time,
many other application instances might also try to invoke the service through the circuit breaker
and tie up a significant number of threads before they all fail.
Example
In a web application, several of the pages are populated with data retrieved from an external service.
If the system implements minimal caching, most hits to these pages will cause a round trip to the
service. Connections from the web application to the service could be configured with a timeout
period (typically 60 seconds) and if the service doesn’t respond in this time the logic in each web
page will assume that the service is unavailable and throw an exception.
However, if the service fails and the system is very busy, users could be forced to wait for up to
60 seconds before an exception occurs. Eventually resources such as memory, connections and
threads could be exhausted, preventing other users from connecting to the system, even if they
aren’t accessing pages that retrieve data from the service.
Scaling the system by adding further web servers and implementing load balancing might delay
when resources become exhausted, but it won’t resolve the issue because user requests will still
be unresponsive and all web servers could still eventually run out of resources.
Wrapping the logic that connects to the service and retrieves the data in a circuit breaker could help
to solve this problem and handle the service failure more elegantly. User requests will still fail, but
they’ll fail more quickly and the resources won’t be blocked.
interface ICircuitBreakerStateStore
{
CircuitBreakerStateEnum State { get; }
void Reset();
void HalfOpen();
The State property indicates the current state of the circuit breaker, and will be either Open, HalfOpen
or Closed as defined by the CircuitBreakerStateEnum enumeration. The IsClosed property should
be true if the circuit breaker is closed, but false if it’s open or half open. The Trip method switches
the state of the circuit breaker to the open state and records the exception that caused the change
in state, together with the date and time that the exception occurred. The LastException and the
LastStateChangedDateUtc properties return this information. The Reset method closes the circuit
breaker and the HalfOpen method sets the circuit breaker to half open.
The ExecuteAction method in the CircuitBreaker class wraps an operation, specified as an Action
delegate. If the circuit breaker is closed, ExecuteAction invokes the Action delegate. If the operation
fails, an exception handler calls TrackException, which sets the circuit breaker state to open.
The following code example highlights this flow.
public class CircuitBreaker
{
private readonly ICircuitBreakerStateStore stateStore =
CircuitBreakerStateStoreFactory.GetCircuitBreakerStateStore();
The following example shows the code (omitted from the previous example) that is executed if the
circuit breaker isn’t closed. It first checks if the circuit breaker has been open for a period longer than
the time specified by the local OpenToHalfOpenWaitTime field in the CircuitBreaker class. If this is
the case, the ExecuteAction method sets the circuit breaker to half open, then tries to perform the
operation specified by the Action delegate.
If the operation is successful, the circuit breaker is reset to the closed state. If the operation fails,
it is tripped back to the open state and the time the exception occurred is updated so that the
circuit breaker will wait for a further period before trying to perform the operation again.
If the circuit breaker has only been open for a short time, less than the OpenToHalfOpenWaitTime
value, the ExecuteAction method simply throws a CircuitBreakerOpenException exception and returns
the error that caused the circuit breaker to transition to the open state.
Additionally, it uses a lock to prevent the circuit breaker from trying to perform concurrent calls
to the operation while it’s half open. A concurrent attempt to invoke the operation will be handled
as if the circuit breaker was open and it’ll fail with an exception as described later.
// If this action succeeds, reset the state and allow other operations.
// In reality, instead of immediately returning to the Closed state, a counter
// here would record the number of successful operations and return the
// circuit breaker to the Closed state only after a specified number succeed.
this.stateStore.Reset();
return;
}
catch (Exception ex)
{
// If there’s still an exception, trip the breaker again immediately.
this.stateStore.Trip(ex);
// Throw the exception so that the caller knows which exception occurred.
throw;
}
finally
{
if (lockTaken)
{
Monitor.Exit(halfOpenSyncObject);
}
}
}
}
// The Open timeout hasn’t yet expired. Throw a CircuitBreakerOpen exception to
// inform the caller that the call was not actually attempted,
// and return the most recent exception received.
throw new CircuitBreakerOpenException(stateStore.LastException);
}
...
try
{
breaker.ExecuteAction(() =>
{
// Operation protected by the circuit breaker.
...
});
}
catch (CircuitBreakerOpenException ex)
{
// Perform some different action when the breaker is open.
// Last exception details are in the inner exception.
...
}
catch (Exception ex)
{
...
}
The following patterns might also be useful when implementing this pattern:
• Retry Pattern. Describes how an application can handle anticipated temporary failures when
it tries to connect to a service or network resource by transparently retrying an operation that
has previously failed.
• Health Endpoint Monitoring Pattern. A circuit breaker might be able to test the health of a
service by sending a request to an endpoint exposed by the service. The service should return
information indicating its status.
Traditional CRUD designs work well when only limited business logic is applied to the data
operations. Scaffold mechanisms provided by development tools can create data access code
very quickly, which can then be customised as required.
For a deeper understanding of the limits of the CRUD approach see CRUD, Only When You Can Afford It.
Solution
Command and Query Responsibility Segregation (CQRS) is a pattern that segregates the operations
that read data (queries) from the operations that update data (commands) by using separate
interfaces. This means that the data models used for querying and updates are different. The models
can then be isolated, as shown in the following figure, although that’s not an absolute requirement.
The query model for reading data and the update model for writing data can access the same
physical store, perhaps by using SQL views or by generating projections on the fly. However, it’s
common to separate the data into different physical stores to maximise performance, scalability
and security, as shown in the next figure.
The read store can be a read-only replica of the write store or the read and write stores can have
a different structure altogether. Using multiple read-only replicas of the read store can greatly
increase query performance and application UI responsiveness, especially in distributed scenarios
where read-only replicas are located close to the application instances. Some database systems
(SQL Server) provide additional features such as failover replicas to maximise availability.
Separation of the read and write stores also allows each to be scaled appropriately to match the load.
For example, read stores typically encounter a much higher load than write stores.
When the query/read model contains denormalised data (see Materialised View pattern),
performance is maximised when reading data for each of the views in an application or when
querying the data in the system.
• Consider applying CQRS to limited sections of your system where it will be most valuable.
Using the stream of events as the write store rather than the actual data at a point in time avoids
update conflicts on a single aggregate and maximises performance and scalability. The events can
be used to asynchronously generate materialised views of the data that are used to populate the
read store.
Because the event store is the official source of information, it is possible to delete the materialised
views and replay all past events to create a new representation of the current state when the system
evolves, or when the read model must change. The materialised views are in effect a durable read-
only cache of the data.
When using CQRS combined with the Event Sourcing pattern, consider the following:
• As with any system where the write and read stores are separate, systems based on this pattern
are only eventually consistent. There will be some delay between the event being generated
and the data store being updated.
• The pattern adds complexity because code must be created to initiate and handle events
and assemble or update the appropriate views or objects required by queries or a read model.
The complexity of the CQRS pattern when used with the Event Sourcing pattern can make a
successful implementation more difficult and requires a different approach to designing systems.
However, event sourcing can make it easier to model the domain and makes it easier to rebuild
views or create new ones because the intent of the changes in the data is preserved.
• Generating materialised views for use in the read model or projections of the data by replaying
and handling the events for specific entities or collections of entities can require significant
processing time and resource usage. This is especially true if it requires summation or analysis
of values over long periods, because all the associated events might need to be examined.
Resolve this by implementing snapshots of the data at scheduled intervals, such as a total
count of the number of a specific action that have occurred or the current state of an entity.
Example
The following code shows some extracts from an example of a CQRS implementation that uses
different definitions for the read and the write models. The model interfaces don’t dictate any
features of the underlying data stores. They can evolve and be fine-tuned independently because
these interfaces are separated. The following code shows the read model definition.
The system allows users to rate products. The application code does this using the RateProduct
command shown in the following code.
The system uses the ProductsCommandHandler class to handle commands sent by the
application. Clients typically send commands to the domain through a messaging system such
as a queue. The command handler accepts these commands and invokes methods of the domain
interface. The granularity of each command is designed to reduce the chance of conflicting requests.
The following code shows an outline of the ProductsCommandHandler class.
The following code shows the IProductsDomain interface from the write model.
If an operation that implements eventual consistency spans several heterogeneous data stores,
undoing the steps in the operation will require visiting each data store in turn. The work performed
in every data store must be undone reliably to prevent the system from remaining inconsistent.
Not all data affected by an operation that implements eventual consistency might be held in a
database. In a service oriented architecture (SOA) environment an operation could invoke an action
in a service and cause a change in the state held by that service. To undo the operation, this state
change must also be undone. This can involve invoking the service again and performing another
action that reverses the effects of the first.
Solution
The solution is to implement a compensating transaction. The steps in a compensating transaction
must undo the effects of the steps in the original operation. A compensating transaction might
not be able to simply replace the current state with the state the system was in at the start of the
operation because this approach could overwrite changes made by other concurrent instances
of an application. Instead, it must be an intelligent process that takes into account any work done
by concurrent instances. This process will usually be application specific, driven by the nature of
the work performed by the original operation.
This approach is similar to the Sagas strategy discussed in Clemens Vasters’ blog.
A compensating transaction is also an eventually consistent operation and it could also fail.
The system should be able to resume the compensating transaction at the point of failure and
continue. It might be necessary to repeat a step that’s failed so the steps in a compensating
transaction should be defined as idempotent commands. For more information, see Idempotency
Patterns on Jonathan Oliver’s blog.
In some cases it might not be possible to recover from a step that has failed except through manual
intervention. In these situations the system should raise an alert and provide as much information
as possible about the reason for the failure.
You should define the steps in a compensating transaction as idempotent commands. This enables
the steps to be repeated if the compensating transaction itself fails.
The infrastructure that handles the steps in the original operation and the compensating transaction,
must be resilient. It must not lose the information required to compensate for a failing step and
it must be able to reliably monitor the progress of the compensation logic.
A compensating transaction doesn’t necessarily return the data in the system to the state it was in
at the start of the original operation. Instead, it compensates for the work performed by the steps
that completed successfully before the operation failed.
The order of the steps in the compensating transaction doesn’t necessarily have to be the exact
opposite of the steps in the original operation. For example, one data store might be more sensitive
to inconsistencies than another and so the steps in the compensating transaction that undo the
changes to this store should occur first.
Placing a short-term timeout-based lock on each resource that’s required to complete an operation
and obtaining these resources in advance can help increase the likelihood that the overall activity will
succeed. The work should be performed only after all the resources have been acquired. All actions
must be finalised before the locks expire.
Consider using retry logic that is more forgiving than usual to minimise failures that trigger
a compensating transaction. If a step in an operation that implements eventual consistency fails,
try handling the failure as a transient exception and repeat the step. Only stop the operation and
initiate a compensating transaction if a step fails repeatedly or irrecoverably.
Many of the challenges of implementing a compensating transaction are the same as those with
implementing eventual consistency. See the section Considerations for Implementing Eventual
Consistency in the Data Consistency Primer for more information.
These steps constitute an eventually consistent operation, although each step is a separate action.
Therefore, as well as performing these steps, the system must also record the counter operations
necessary to undo each step in case the customer decides to cancel the itinerary. The steps necessary
to perform the counter operations can then run as a compensating transaction.
Notice that the steps in the compensating transaction might not be the exact opposite of the
original steps and the logic in each step in the compensating transaction must take into account
any business-specific rules. For example, unbooking a seat on a flight might not entitle the customer
to a complete refund of any money paid. The figure illustrates generating a compensating transaction
to undo a long-running transaction to book a travel itinerary.
It might be possible for the steps in the compensating transaction to be performed in parallel,
depending on how you’ve designed the compensating logic for each step.
In many business solutions, failure of a single step doesn’t always necessitate rolling the system
back by using a compensating transaction. For example, if – after having booked flights F1, F2
and F3 in the travel website scenario – the customer is unable to reserve a room at hotel H1, it’s
preferable to offer the customer a room at a different hotel in the same city rather than cancelling
the flights. The customer can still decide to cancel (in which case the compensating transaction runs
and undoes the bookings made on flights F1, F2 and F3), but this decision should be made by the
customer rather than by the system.
The number of requests can vary significantly over time for many reasons. A sudden increase in user
activity or aggregated requests coming from multiple tenants can cause an unpredictable workload.
At peak hours a system might need to process many hundreds of requests per second, while at other
times the number could be very small. Additionally, the nature of the work performed to handle
these requests might be highly variable. Using a single instance of the consumer service can cause
that instance to become flooded with requests or the messaging system might be overloaded by
an influx of messages coming from the application. To handle this fluctuating workload, the system
can run multiple instances of the consumer service. However, these consumers must be coordinated
to ensure that each message is only delivered to a single consumer. The workload also needs to be
load balanced across consumers to prevent an instance from becoming a bottleneck.
Solution
Use a message queue to implement the communication channel between the application and the
instances of the consumer service. The application posts requests in the form of messages to the
queue and the consumer service instances receive messages from the queue and process them.
This approach enables the same pool of consumer service instances to handle messages from
any instance of the application. The figure illustrates using a message queue to distribute work
to instances of a service.
• Designing services for resilience. If the system is designed to detect and restart failed service
instances, it might be necessary to implement the processing performed by the service instances
as idempotent operations to minimise the effects of a single message being retrieved and
processed more than once.
• Detecting poison messages. A malformed message or a task that requires access to resources
that aren’t available, can cause a service instance to fail. The system should prevent such
messages being returned to the queue and instead capture and store the details of these
messages elsewhere so that they can be analysed if necessary.
• Handling results. The service instance handling a message is fully decoupled from the
application logic that generates the message and they might not be able to communicate
directly. If the service instance generates results that must be passed back to the application
logic, this information must be stored in a location that’s accessible to both. In order to prevent
the application logic from retrieving incomplete data the system must indicate when processing
is complete.
If you’re using Azure, a worker process can pass results back to the application logic by using a
dedicated message reply queue. The application logic must be able to correlate these results with the
original message. This scenario is described in more detail in the Asynchronous Messaging Primer.
• Scaling the messaging system. In a large-scale solution, a single message queue could be
overwhelmed by the number of messages and become a bottleneck in the system. In this
situation, consider partitioning the messaging system to send messages from specific producers
to a particular queue or use load balancing to distribute messages across multiple message
queues.
Example
Azure provides storage queues and Service Bus queues that can act as a mechanism for
implementing this pattern. The application logic can post messages to a queue and consumers
implemented as tasks in one or more roles can retrieve messages from this queue and process them.
For resilience, a Service Bus queue enables a consumer to use PeekLock mode when it retrieves a
message from the queue. This mode doesn’t actually remove the message, but simply hides it from
other consumers. The original consumer can delete the message when it’s finished processing it.
If the consumer fails, the peek lock will time out and the message will become visible again allowing
another consumer to retrieve it.
For detailed information on using Azure Service Bus queues, see Service Bus queues, topics and
subscriptions. For information on using Azure storage queues, see Get started with Azure Queue
storage using .NET.
The following code from the QueueManager class in CompetingConsumers solution available
on GitHub shows how you can create a queue by using a QueueClient instance in the Start event
handler in a web or worker role.
// Set the maximum delivery count for messages in the queue. A message
// is automatically dead-lettered after this number of deliveries. The
// default value for dead letter count is 10.
queueDescription.MaxDeliveryCount = 3;
await manager.CreateQueueAsync(queueDescription);
}
...
The next code snippet shows how an application can create and send a batch of messages
to the queue.
The following code shows how a consumer service instance can receive messages from the queue
by following an event-driven approach. The processMessageTask parameter to the ReceiveMessages
method is a delegate that references the code to run when a message is received. This code is run
asynchronously.
Each computational unit consumes chargeable resources, even when it’s idle or lightly used.
Therefore, this isn’t always the most cost-effective solution.
In Azure, this concern applies to roles in a Cloud Service, App Services and Virtual Machines.
These items run in their own virtual environment. Running a collection of separate roles, websites
or virtual machines that are designed to perform a set of well-defined operations, but that need
to communicate and cooperate as part of a single solution, can be an inefficient use of resources.
Solution
To help reduce costs, increase utilisation, improve communication speed and reduce management
it’s possible to consolidate multiple tasks or operations into a single computational unit.
Tasks can be grouped according to criteria based on the features provided by the environment and
the costs associated with these features. A common approach is to look for tasks that have a similar
profile concerning their scalability, lifetime and processing requirements. Grouping these together
allows them to scale as a unit. The elasticity provided by many cloud environments enables additional
instances of a computational unit to be started and stopped according to the workload. For example,
Azure provides autoscaling that you can apply to roles in a Cloud Service, App Services and Virtual
Machines. For more information, see Autoscaling Guidance.
As a counter example to show how scalability can be used to determine which operations shouldn’t
be grouped together, consider the following two tasks:
• Task 1 polls for infrequent, time-insensitive messages sent to a queue.
• Task 2 handles high-volume bursts of network traffic.
The second task requires elasticity that can involve starting and stopping a large number of instances
of the computational unit. Applying the same scaling to the first task would simply result in more
tasks listening for infrequent messages on the same queue and is a waste of resources.
In many cloud environments it’s possible to specify the resources available to a computational unit
in terms of the number of CPU cores, memory, disk space and so on. Generally, the more resources
specified, the greater the cost. To save money, it’s important to maximise the work an expensive
computational unit performs and not let it become inactive for an extended period.
Scalability and elasticity. Many cloud solutions implement scalability and elasticity at the level of
the computational unit by starting and stopping instances of units. Avoid grouping tasks that have
conflicting scalability requirements in the same computational unit.
Lifetime. The cloud infrastructure periodically recycles the virtual environment that hosts a
computational unit. When there are many long-running tasks inside a computational unit, it might
be necessary to configure the unit to prevent it from being recycled until these tasks have finished.
Alternatively, design the tasks by using a check-pointing approach that enables them to stop cleanly
and continue at the point they were interrupted when the computational unit is restarted.
Security. Tasks in the same computational unit might share the same security context and be
able to access the same resources. There must be a high degree of trust between the tasks, and
confidence that one task isn’t going to corrupt or adversely affect another. Additionally, increasing
the number of tasks running in a computational unit increases the attack surface of the unit.
Each task is only as secure as the one with the most vulnerabilities.
Fault tolerance. If one task in a computational unit fails or behaves abnormally, it can affect the
other tasks running within the same unit. For example, if one task fails to start correctly it can cause
the entire start-up logic for the computational unit to fail and prevent other tasks in the same unit
from running.
Contention. Avoid introducing contention between tasks that compete for resources in the same
computational unit. Ideally, tasks that share the same computational unit should exhibit different
resource utilisation characteristics. For example, two compute-intensive tasks should probably not
reside in the same computational unit and neither should two tasks that consume large amounts
of memory. However, mixing a compute intensive task with a task that requires a large amount
of memory is a workable combination.
Complexity. Combining multiple tasks into a single computational unit adds complexity to the
code in the unit, possibly making it more difficult to test, debug and maintain.
Stable logical architecture. Design and implement the code in each task so that it shouldn’t
need to change, even if the physical environment the task runs in does change.
This pattern might not be suitable for tasks that perform critical fault-tolerant operations or tasks that
process highly sensitive or private data and require their own security context. These tasks should run
in their own isolated environment, in a separate computational unit.
Example
When building a cloud service on Azure, it’s possible to consolidate the processing performed
by multiple tasks into a single role. Typically this is a worker role that performs background or
asynchronous processing tasks.
In some cases it’s possible to include background or asynchronous processing tasks in the web role.
This technique helps to reduce costs and simplify deployment, although it can impact the scalability
and responsiveness of the public-facing interface provided by the web role. The article Combining
Multiple Azure Worker Roles into an Azure Web Role contains a detailed description of implementing
background or asynchronous processing tasks in a web role.
The role is responsible for starting and stopping the tasks. When the Azure fabric controller loads
a role, it raises the Start event for the role. You can override the OnStart method of the WebRole or
WorkerRole class to handle this event, perhaps to initialise the data and other resources the tasks
in this method depend on.
When the OnStartmethod completes, the role can start responding to requests. You can find more
information and guidance about using the OnStart and Run methods in a role in the Application
Start-up Processes section in the patterns & practices guide Moving Applications to the Cloud.
Keep the code in the OnStart method as concise as possible. Azure doesn’t impose any limit on the
time taken for this method to complete, but the role won’t be able to start responding to network
requests sent to it until this method completes.
When the OnStart method has finished, the role executes the Run method. At this point, the fabric
controller can start sending requests to the role.
Place the code that actually creates the tasks in the Run method. Note that the Run method defines
the lifetime of the role instance. When this method completes, the fabric controller will arrange for
the role to be shut down.
When a role shuts down or is recycled, the fabric controller prevents any more incoming requests
being received from the load balancer and raises the Stop event. You can capture this event by
overriding the OnStop method of the role and perform any tidying up required before the role
terminates.
The MyWorkerTask1 and the MyWorkerTask2 methods illustrate how to perform different tasks
within the same worker role. The following code shows MyWorkerTask1. This is a simple task that
sleeps for 30 seconds and then outputs a trace message. It repeats this process until the task
is cancelled. The code in MyWorkerTask2 is similar.
try
{
while (!ct.IsCancellationRequested)
{
// Wake up and do some background processing if not cancelled.
// TASK PROCESSING CODE HERE
Trace.TraceInformation(“Doing Worker Task 1 Work”);
The sample code shows a common implementation of a background process. In a real world
application you can follow this same structure, except that you should place your own processing
logic in the body of the loop that waits for the cancellation request.
/// <summary>
/// The cancellation token source use to cooperatively cancel running tasks
/// </summary>
private readonly CancellationTokenSource cts = new CancellationTokenSource();
/// <summary>
/// List of running tasks on the role instance
/// </summary>
private readonly List<Task> tasks = new List<Task>();
// If there wasn’t a cancellation request, stop all tasks and return from Run()
// An alternative to cancelling and returning when a task exits would be to
// restart the task.
if (!cts.IsCancellationRequested)
{
Trace.TraceInformation(“Task returned without cancellation request”);
Stop(TimeSpan.FromMinutes(5));
}
}
...
In this example, the Run method waits for tasks to be completed. If a task is cancelled, the Run
method assumes that the role is being shut down and waits for the remaining tasks to be cancelled
before finishing (it waits for a maximum of five minutes before terminating). If a task fails due to
an expected exception, the Run method cancels the task.
The Stop method shown in the following code is called when the fabric controller shuts down
the role instance (it’s invoked from the OnStop method). The code stops each task gracefully
by cancelling it. If any task takes more than five minutes to complete, the cancellation processing
in the Stop method ceases waiting and the role is terminated.
// Stop running tasks and wait for tasks to complete before returning
// unless the timeout expires.
private void Stop(TimeSpan timeout)
{
Trace.TraceInformation(“Stop called. Cancelling tasks.”);
// Cancel running tasks.
cts.Cancel();
// Wait for all the tasks to complete before returning. Note that the
// emulator currently allows 30 seconds and Azure allows five
// minutes for processing to complete.
try
{
Task.WaitAll(tasks.ToArray(), timeout);
}
catch (AggregateException ex)
{
Trace.TraceError(ex.Message);
For a deeper understanding of the limits of the CRUD approach see CRUD, Only When You Can
Afford It.
Solution
The Event Sourcing pattern defines an approach to handling operations on data that’s driven
by a sequence of events, each of which is recorded in an append-only store. Application code
sends a series of events that imperatively describe each action that has occurred on the data to
the event store where they’re persisted. Each event represents a set of changes to the data (such
as AddedItemToOrder).
The events are persisted in an event store that acts as the system of record (the authoritative data
source) about the current state of the data. The event store typically publishes these events so that
consumers can be notified and can handle them if needed. Consumers could, for example, initiate
tasks that apply the operations in the events to other systems or perform any other associated action
that’s required to complete the operation. Notice that the application code that generates the events
is decoupled from the systems that subscribe to the events.
Typical uses of the events published by the event store are to maintain materialised views of entities
as actions in the application change them, and for integration with external systems. For example,
a system can maintain a materialised view of all customer orders that's used to populate parts of
the UI. As the application adds new orders, adds or removes items on the order and adds shipping
information, the events that describe these changes can be handled and used to update the
materialised view.
The figure shows an overview of the pattern, including some of the options for using the event
stream such as creating a materialised view, integrating events with external applications and
systems and replaying events to create projections of the current state of specific entities.
Events are simple objects that describe some action that occurred together with any associated data
required to describe the action represented by the event. Events don't directly update a data store.
They're simply recorded for handling at the appropriate time. This can simplify implementation and
management.
Events typically have meaning for a domain expert, whereas object-relational impedance mismatch
can make complex database tables hard to understand. Tables are artificial constructs that represent
the current state of the system, not the events that occurred.
Event sourcing can help prevent concurrent updates from causing conflicts because it avoids the
requirement to directly update objects in the data store. However, the domain model must still
be designed to protect itself from requests that might result in an inconsistent state.
The event store raises events, and tasks perform operations in response to those events. This
decoupling of the tasks from the events provides flexibility and extensibility. Tasks know about the
type of event and the event data, but not about the operation that triggered the event. In addition,
multiple tasks can handle each event. This enables easy integration with other services and systems
that only listen for new events raised by the event store. However, the event sourcing events tend
to be very low level, and it might be necessary to generate specific integration events instead.
Event sourcing is commonly combined with the CQRS pattern by performing the data management
tasks in response to the events, and by materialising views from the stored events.
Notes:
See the Data Consistency Primer for information about eventual consistency.
The event store is the permanent source of information, and so the event data should never be
updated. The only way to update an entity to undo a change is to add a compensating event to
the event store. If the format (rather than the data) of the persisted events needs to change, perhaps
during a migration, it can be difficult to combine existing events in the store with the new version.
It might be necessary to iterate through all the events making changes so that they're compliant with
the new format, or add new events that use the new format. Consider using a version stamp on each
version of the event schema to maintain both the old and the new event formats.
Multi-threaded applications and multiple instances of applications might be storing events in the
event store. The consistency of events in the event store is vital, as is the order of events that affect a
specific entity (the order that changes occur to an entity affects its current state). Adding a timestamp
to every event can help to avoid issues. Another common practice is to annotate each event resulting
from a request with an incremental identifier. If two actions attempt to add events for the same entity
at the same time, the event store can reject an event that matches an existing entity identifier and
event identifier.
There's no standard approach, or existing mechanisms such as SQL queries, for reading the events to
obtain information. The only data that can be extracted is a stream of events using an event identifier
as the criteria. The event ID typically maps to individual entities. The current state of an entity can be
determined only by replaying all of the events that relate to it against the original state of that entity.
Even though event sourcing minimises the chance of conflicting updates to the data, the application
must still be able to deal with inconsistencies that result from eventual consistency and the lack
of transactions. For example, an event that indicates a reduction in stock inventory might arrive
in the data store while an order for that item is being placed, resulting in a requirement to reconcile
the two operations either by advising the customer or creating a back order.
Event publication might be "at least once", and so consumers of the events must be idempotent.
They must not reapply the update described in an event if the event is handled more than once.
For example, if multiple instances of a consumer maintain an aggregate an entity's property, such
as the total number of orders placed, only one must succeed in incrementing the aggregate when
an order placed event occurs. While this isn't a key characteristic of event sourcing, it's the usual
implementation decision.
• When it's vital to minimise or completely avoid the occurrence of conflicting updates to data.
• When you want to record events that occur, and be able to replay them to restore the state
of a system, roll back changes or keep a history and audit log. For example, when a task involves
multiple steps you might need to execute actions to revert updates and then replay some steps
to bring the data back into a consistent state.
• When using events is a natural feature of the operation of the application, and requires little
additional development or implementation effort.
• When you need to decouple the process of inputting or updating data from the tasks required
to apply these actions. This might be to improve UI performance, or to distribute events to other
listeners that take action when the events occur. For example, integrating a payroll system with
an expense submission website so that events raised by the event store in response to data
updates made in the website are consumed by both the website and the payroll system.
• When you want flexibility to be able to change the format of materialised models and entity data
if requirements change, or – when used in conjunction with CQRS – you need to adapt a read
model or the views that expose the data.
• When used in conjunction with CQRS, and eventual consistency is acceptable while a read model
is updated, or the performance impact of rehydrating entities and data from an event stream
is acceptable.
• Systems where consistency and real-time updates to the views of the data are required.
• Systems where audit trails, history and capabilities to roll back and replay actions are not
required.
• Systems where there's only a very low occurrence of conflicting updates to the underlying data.
For example, systems that predominantly add data rather than updating it.
Example
A conference management system needs to track the number of completed bookings for a
conference so that it can check whether there are seats still available when a potential attendee tries
to make a booking. The system could store the total number of bookings for a conference in at least
two ways:
• The system could store the information about the total number of bookings as a separate entity
in a database that holds booking information. As bookings are made or cancelled, the system
could increment or decrement this number as appropriate. This approach is simple in theory, but
can cause scalability issues if a large number of attendees are attempting to book seats during
a short period of time. For example, in the last day or so prior to the booking period closing.
• The system could store information about bookings and cancellations as events held in an
event store. It could then calculate the number of seats available by replaying these events.
This approach can be more scalable due to the immutability of events. The system only needs to
be able to read data from the event store, or append data to the event store. Event information
about bookings and cancellations is never modified.
The following diagram illustrates how the seat reservation subsystem of the conference management
system might be implemented using event sourcing.
2. An aggregate containing information about all reservations for the conference is constructed
by querying the events that describe bookings and cancellations. This aggregate is called
SeatAvailability, and is contained within a domain model that exposes methods for querying
and modifying the data in the aggregate.
Some optimisations to consider are using snapshots (so that you don't need to query and replay
the full list of events to obtain the current state of the aggregate), and maintaining a cached
copy of the aggregate in memory.
3. The command handler invokes a method exposed by the domain model to make the
reservations.
4. The SeatAvailability aggregate records an event containing the number of seats that were
reserved. The next time the aggregate applies events, all the reservations will be used
to compute how many seats remain.
5. The system appends the new event to the list of events in the event store.
As well as providing more scope for scalability, using an event store also provides a complete history,
or audit trail, of the bookings and cancellations for a conference. The events in the event store are
the accurate record. There is no need to persist aggregates in any other way because the system
can easily replay the events and restore the state to any point in time.
You can find more information about this example in Introducing Event Sourcing.
• Materialised View Pattern. The data store used in a system based on event sourcing is typically
not well suited to efficient querying. Instead, a common approach is to generate prepopulated
views of the data at regular intervals, or when the data changes. Shows how this can be done.
• Compensating Transaction Pattern. The existing data in an event sourcing store is not updated,
instead new entries are added that transition the state of entities to the new values. To reverse
a change, compensating entries are used because it isn't possible to simply reverse the previous
change. Describes how to undo the work that was performed by a previous operation.
• Data Consistency Primer. When using event sourcing with a separate read store or materialised
views, the read data won't be immediately consistent, instead it'll be only eventually consistent.
Summarises the issues surrounding maintaining consistency over distributed data.
• Data Partitioning Guidance. Data is often partitioned when using event sourcing to improve
scalability, reduce contention and optimise performance. Describes how to divide data into
discrete partitions, and the issues that can arise.
It's challenging to manage changes to local configurations across multiple running instances
of the application, especially in a cloud-hosted scenario. It can result in instances using different
configuration settings while the update is being deployed.
Solution
Store the configuration information in external storage, and provide an interface that can be used
to quickly and efficiently read and update configuration settings. The type of external store depends
on the hosting and runtime environment of the application. In a cloud-hosted scenario it's typically
a cloud-based storage service, but could be a hosted database or other system.
The backing store you choose for configuration information should have an interface that provides
consistent and easy-to-use access. It should expose the information in a correctly typed and
structured format. The implementation might also need to authorise users access in order to protect
configuration data, and be flexible enough to allow storage of multiple versions of the configuration
(such as development, staging or production, including multiple release versions of each one).
Many built-in configuration systems read the data when the application starts up, and cache the data
in memory to provide fast access and minimise the impact on application performance. Depending
on the type of backing store used, and the latency of this store, it might be helpful to implement a
caching mechanism within the external configuration store. For more information, see the Caching
Guidance. The figure illustrates an overview of the External Configuration Store pattern with optional
local cache.
Consider the following points when deciding how to implement this pattern:
Choose a backing store that offers acceptable performance, high availability, robustness and can
be backed up as part of the application maintenance and administration process. In a cloud-hosted
application, using a cloud storage mechanism is usually a good choice to meet these requirements.
Design the schema of the backing store to allow flexibility in the types of information it can hold.
Ensure that it provides for all configuration requirements such as typed data, collections of settings,
multiple versions of settings and any other features that the applications using it require. The schema
should be easy to extend to support additional settings as requirements change.
Consider the physical capabilities of the backing store, how it relates to the way configuration
information is stored and the effects on performance. For example, storing an XML document
containing configuration information will require either the configuration interface or the application
to parse the document in order to read individual settings. It'll make updating a setting more
complicated, though caching the settings can help to offset slower read performance.
Consider how the configuration interface will permit control of the scope and inheritance
of configuration settings. For example, it might be a requirement to scope configuration settings at
the organisation, application and the machine level. It might need to support delegation of control
over access to different scopes, and to prevent or allow individual applications to override settings.
Ensure that the configuration interface can expose the configuration data in the required formats
such as typed values, collections, key/value pairs or property bags.
Consider how the configuration store interface will behave when settings contain errors, or don't exist
in the backing store. It might be appropriate to return default settings and log errors. Also consider
aspects such as the case sensitivity of configuration setting keys or names, the storage and handling
of binary data and the ways that null or empty values are handled.
Consider how to protect the configuration data to allow access to only the appropriate users and
applications. This is likely a feature of the configuration store interface, but it's also necessary
to ensure that the data in the backing store can't be accessed directly without the appropriate
permission. Ensure strict separation between the permissions required to read and to write
configuration data. Also consider whether you need to encrypt some or all of the configuration
settings, and how this'll be implemented in the configuration store interface.
Centrally stored configurations, which change application behaviour during runtime, are critically
important and should be deployed, updated and managed using the same mechanisms as deploying
application code. For example, changes that can affect more than one application must be carried
out using a full test and staged deployment approach to ensure that the change is appropriate for all
applications that use this configuration. If an administrator edits a setting to update one application,
it could adversely impact other applications that use the same setting.
• Configuration settings that are shared between multiple applications and application instances,
or where a standard configuration must be enforced across multiple applications and application
instances.
• A standard configuration system that doesn't support all of the required configuration settings,
such as storing images or complex data types.
• As a complementary store for some of the settings for applications, perhaps allowing
applications to override some or all of the centrally-stored settings.
Example
In a Microsoft Azure hosted application, a typical choice for storing configuration information
externally is to use Azure Storage. This is resilient, offers high performance, and is replicated three
times with automatic failover to offer high availability. Azure Table storage provides a key/value store
with the ability to use a flexible schema for the values. Azure Blob storage provides a hierarchical,
container-based store that can hold any type of data in individually named blobs.
The following example shows how a configuration store can be implemented over Blob storage
to store and expose configuration information. The BlobSettingsStore class abstracts Blob storage
for holding configuration information, and implements the ISettingsStore interface shown in the
following code.
This interface defines methods for retrieving and updating configuration settings held in the
configuration store, and includes a version number that can be used to detect whether any
configuration settings have been modified recently. The BlobSettingsStore class uses the ETag
property of the blob to implement versioning. The ETag property is updated automatically each
time the blob is written.
By design, this simple solution exposes all configuration settings as string values rather than
typed values.
The GetSettings method invokes the CheckForConfigurationChanges method to detect whether the
configuration information in blob storage has changed. It does this by examining the version number
and comparing it with the current version number held by the ExternalConfigurationManager object.
If one or more changes have occurred, the Changed event is raised and the configuration settings
cached in the Dictionary object are refreshed. This is an application of the Cache-Aside pattern.
The following code sample shows how the Changed event, the GetSettings method, and the
CheckForConfigurationChanges method are implemented:
String value;
try
{
this.settingsCacheLock.EnterReadLock();
return value;
}
...
Private void CheckForConfigurationChanges()
{
try
{
// It is assumed that updates are infrequent.
// If the versions are the same, nothing has changed in the configuration.
If (this.currentVersion == latestVersion) return;
// Get the latest settings from the settings store and publish changes.
Var latestSettings = await this.settings.FindAllAsync();
If (this.settingsCache != null)
{
//Notify settings changed
latestSettings.Except(this.settingsCache).ToList().ForEach(kv => this.changed.
OnNext(kv));
}
this.settingsCache = latestSettings;
}
finally
{
this.settingsCacheLock.ExitWriteLock();
}
The ExternalConfigurationManager class also provides a property named Environment. This property
supports varying configurations for an application running in different environments, such as staging
and production.
An ExternalConfigurationManager object can also query the BlobSettingsStore object periodically for
any changes. In the following code, the StartMonitor method calls CheckForConfigurationChanges
at an interval to detect any changes and raise the Changed event, as described earlier.
/// <summary>
/// Start the background monitoring for configuration changes in the central store
/// </summary>
public void StartMonitor()
{
if (this.IsMonitoring)
return;
try
{
this.timerSemaphore.Wait();
/// <summary>
/// Loop that monitors for configuration changes
/// </summary>
/// <returns></returns>
Public async Task ConfigChangeMonitor()
{
while (!cts.Token.IsCancellationRequested)
{
/// <summary>
/// Stop monitoring for configuration changes
/// </summary>
Public void StopMonitor()
{
try
{
this.timerSemaphore.Wait();
this.monitoringTask = null;
}
finally
{
this.timerSemaphore.Release();
}
}
The following code is taken from the WorkerRole class in the ExternalConfigurationStore.Cloud
project. It shows how the application uses the ExternalConfiguration class to read a setting.
// Get a setting.
Var setting = ExternalConfiguration.Instance.GetAppSetting("setting1");
Trace.TraceInformation("Worker Role: Get setting1, value: " + setting);
this.completeEvent.WaitOne();
}
The following code, also from the WorkerRole class, shows how the application subscribes
to configuration events.
• Expose security vulnerabilities. When a user leaves the company the account must immediately
be deprovisioned. It's easy to overlook this in large organisations.
• Complicate user management. Administrators must manage credentials for all of the users,
and perform additional tasks such as providing password reminders.
Users typically prefer to use the same credentials for all these applications.
The trusted identity providers include corporate directories, on-premises federation services,
other security token services (STS) provided by business partners, or social identity providers that
can authenticate users who have, for example, a Microsoft, Google, Yahoo! or Facebook account.
The figure illustrates the Federated Identity pattern when a client application needs to access
a service that requires authentication. The authentication is performed by an IdP that works in
concert with an STS. The IdP issues security tokens that provide information about the authenticated
user. This information, referred to as claims, includes the user's identity, and might also include other
information such as role membership and more granular access rights.
This model is often called claims-based access control. Applications and services authorise
access to features and functionality based on the claims contained in the token. The service that
requires authentication must trust the IdP. The client application contacts the IdP that performs
the authentication. If the authentication is successful, the IdP returns a token containing the claims
that identify the user to the STS (note that the IdP and STS can be the same service). The STS can
transform and augment the claims in the token based on predefined rules, before returning it
to the client. The client application can then pass this token to the service as proof of its identity.
There might be additional STSs in the chain of trust. For example, in the scenario described later,
an on-premises STS trusts another STS that is responsible for accessing an identity provider to
authenticate the user. This approach is common in enterprise scenarios where there's an on-premises
STS and directory.
Federated identity also has the major advantage that management of the identity and credentials is
the responsibility of the identity provider. The application or service doesn't need to provide identity
management features. In addition, in corporate scenarios, the corporate directory doesn't need
to know about the user if it trusts the identity provider. This removes all the administrative overhead
of managing the user identity within the directory.
• Authentication tools make it possible to configure access control based on role claims contained
in the authentication token. This is often referred to as role-based access control (RBAC),
and it can allow a more granular level of control over access to features and resources.
• Unlike a corporate directory, claims-based authentication using social identity providers doesn't
usually provide information about the authenticated user other than an email address, and
perhaps a name. Some social identity providers, such as a Microsoft account, provide only
a unique identifier. The application usually needs to maintain some information on registered
users, and be able to match this information to the identifier contained in the claims in the
token. Typically this is done through registration when the user first accesses the application,
and information is then injected into the token as additional claims after each authentication.
• If there's more than one identity provider configured for the STS, it must detect which identity
provider the user should be redirected to for authentication. This process is called home realm
discovery. The STS might be able to do this automatically based on an email address or user
name that the user provides, a subdomain of the application that the user is accessing, the
user's IP address scope or on the contents of a cookie stored in the user's browser. For example,
if the user entered an email address in the Microsoft domain, such as user@live.com, the STS
will redirect the user to the Microsoft account sign-in page. On later visits, the STS could use
a cookie to indicate that the last sign in was with a Microsoft account. If automatic discovery
can't determine the home realm, the STS will display a home realm discovery page that lists
the trusted identity providers, and the user must select the one they want to use.
• Federated identity with multiple partners. In this scenario you need to authenticate both
corporate employees and business partners who don't have accounts in the corporate directory.
This is common in business-to-business applications, applications that integrate with third-party
services and where companies with different IT systems have merged or shared resources.
• The application was originally built using a different authentication mechanism, perhaps with
custom user stores, or doesn't have the capability to handle the negotiation standards used
by claims-based technologies. Retrofitting claims-based authentication and access control
into existing applications can be complex, and probably not cost effective.
Example
An organisation hosts a multi-tenant Software as a Service (SaaS) application in Microsoft Azure.
The application includes a website that tenants can use to manage the application for their own
users. The application allows tenants to access the website by using a federated identity that is
generated by Active Directory Federation Services (ADFS) when a user is authenticated by that
organisation's own Active Directory.
Tenants won't need to remember separate credentials to access the application, and an administrator
at the tenant's company can configure in its own ADFS the list of users that can access the
application.
Related guidance
• Microsoft Azure Active Directory
• Active Directory Domain Services
• Active Directory Federation Services
• Identity management for multi-tenant applications in Microsoft Azure
• Multi-tenant Applications in Azure
Gatekeeper pattern
Protect applications and services by using a dedicated host instance that acts as a broker between
clients and the application or service, validates and sanitises requests and passes requests and data
between them. This can provide an additional layer of security, and limit the attack surface of the
system.
If a malicious user is able to compromise the system and gain access to the application's hosting
environment, the security mechanisms it uses such as credentials and storage keys, and the services
and data it accesses, are exposed. As a result, the malicious user can gain unrestrained access
to sensitive information and other services.
Solution
To minimise the risk of clients gaining access to sensitive information and services, decouple hosts
or tasks that expose public endpoints from the code that processes requests and accesses storage.
You can achieve this by using a façade or a dedicated task that interacts with clients and then hands
off the request – perhaps through a decoupled interface – to the hosts or tasks that'll handle the
request. The figure provides a high-level overview of this pattern.
• Controlled validation. The gatekeeper validates all requests, and rejects those that don't meet
validation requirements.
• Limited risk and exposure. The gatekeeper doesn't have access to the credentials or keys used
by the trusted host to access storage and services. If the gatekeeper is compromised, the attacker
doesn't get access to these credentials or keys.
• Appropriate security. The gatekeeper runs in a limited privilege mode, while the rest of the
application runs in the full trust mode required to access storage and services. If the gatekeeper
is compromised, it can't directly access the application services or data.
This pattern acts like a firewall in a typical network topography. It allows the gatekeeper to examine
requests and make a decision about whether to pass the request on to the trusted host (sometimes
called the keymaster) that performs the required tasks. This decision typically requires the gatekeeper
to validate and sanitise the request content before passing it on to the trusted host.
• The gatekeeper must run in a limited privilege mode. Typically this means running the
gatekeeper and the trusted host in separate hosted services or virtual machines.
• The gatekeeper shouldn't perform any processing related to the application or services, or
access any data. Its function is purely to validate and sanitise requests. The trusted hosts might
need to perform additional validation of requests, but the core validation should be performed
by the gatekeeper.
• Use a secure communication channel (HTTPS, SSL or TLS) between the gatekeeper and the
trusted hosts or tasks where this is possible. However, some hosting environments don't support
HTTPS on internal endpoints.
Adding the extra layer to the application to implement the gatekeeper pattern is likely to have
some impact on performance due to the additional processing and network communication
it requires.
• Distributed applications where it's necessary to perform request validation separately from
the main tasks, or to centralise this validation to simplify maintenance and administration.
Example
In a cloud-hosted scenario, this pattern can be implemented by decoupling the gatekeeper role
or virtual machine from the trusted roles and services in an application. Do this by using an internal
endpoint, a queue or storage as an intermediate communication mechanism. The figure illustrates
using an internal endpoint.
Related patterns
The Valet Key pattern might also be relevant when implementing the Gatekeeper pattern. When
communicating between the Gatekeeper and trusted roles it's good practice to enhance security
by using keys or tokens that limit permissions for accessing resources.
In the following diagram, the client sends requests to each service (1,2,3). Each service processes
the request and sends the response back to the application (4,5,6). Over a mobile network with
typically high latency, using individual requests in this manner is inefficient and could result in broken
connectivity or incomplete requests. While each request may be done in parallel, the application
must send, wait and process data for each request, all on separate connections, increasing the
chance of failure.
Solution
Use a gateway to reduce chattiness between the client and the services. The gateway receives client
requests, dispatches requests to the various backend systems and then aggregates the results and
sends them back to the requesting client.
This pattern can reduce the number of requests that the application makes to backend services,
and improve application performance over high-latency networks.
In the following diagram, the application sends a request to the gateway (1). The request contains
a package of additional requests. The gateway decomposes these and processes each request
by sending it to the relevant service (2). Each service returns a response to the gateway (3). The
gateway combines the responses from each service and sends the response to the application (4).
The application makes a single request and receives only a single response from the gateway.
• The gateway should be located near the backend services to reduce latency as much as possible.
• The gateway service may introduce a single point of failure. Ensure the gateway is properly
designed to meet your application's availability requirements.
The gateway may introduce a bottleneck. Ensure the gateway has adequate performance
to handle load and can be scaled to meet your anticipated growth.
• Perform load testing against the gateway to ensure that you don't introduce cascading failures
for services.
• Implement a resilient design, using techniques such as bulkheads, circuit breaking, retry and
timeouts.
• If one or more service calls takes too long, it may be acceptable to timeout and return a partial
set of data. Consider how your application will handle this scenario.
• Use asynchronous I/O to ensure that a delay at the backend doesn't cause performance issues
in the application.
• Implement distributed tracing using correlation IDs to track each individual call.
• Instead of building aggregation into the gateway, consider placing an aggregation service behind
the gateway. Request aggregation will likely have different resource requirements than other
services in the gateway and may impact the gateway's routing and offloading functionality.
• The client or application is located near the backend services and latency is not a significant
factor.
Example
The following example illustrates how to create a simple a gateway aggregation NGINX service
using Lua.
worker_processes 4;
events {
worker_connections 1024;
}
http {
server {
Listen 80;
Location = /batch {
content_by_lua '
ngx.req.read_body()
ngx.say(cjson.encode({results = results}))
‘;
}
location = /service1 {
default_type application/json;
Echo '{"attr1":"val1"}';
}
location = /service2 {
default_type application/json;
Echo '{"attr2":"val2"}';
}
}
}
Properly handling security issues (token validation, encryption, SSL certificate management)
and other complex tasks can require team members to have highly specialised skills. For example,
a certificate needed by an application must be configured and deployed on all application
instances. With each new deployment, the certificate must be managed to ensure that it does not
expire. Any common certificate that is due to expire must be updated, tested and verified on every
application deployment.
Solution
Offload some features into an API gateway, particularly cross-cutting concerns such as certificate
management, authentication, SSL termination, monitoring, protocol translation or throttling.
Offload some features into an API gateway, particularly cross-cutting concerns such as certificate
management, authentication, SSL termination, monitoring, protocol translation or throttling.
The following diagram shows an API gateway that terminates inbound SSL connections. It requests
data on behalf of the original requestor from any HTTP server upstream of the API gateway.
• Allows dedicated teams to implement features that require specialised expertise, such as security.
This allows your core team to focus on the application functionality, leaving these specialised
but cross-cutting concerns to the relevant experts.
• Provides some consistency for request and response logging and monitoring. Even if a service
is not correctly instrumented, the gateway can be configured to ensure a minimum level
of monitoring and logging.
• Ensure the gateway is designed for the capacity and scaling requirements of your application
and endpoints. Make sure that the gateway does not become a bottleneck for the application
and is sufficiently scalable.
• Only offload features that are used by the entire application, such as security or data transfer.
• If you need to track transactions, consider generating correlation IDs for logging purposes.
• A feature that is common across application deployments that may have different resource
requirements, such as memory resources, storage capacity or network connections.
Example
Using Nginx as the SSL offload appliance, the following configuration terminates an inbound
SSL connection and distributes the connection to one of three upstream HTTP servers.
Upstream iis {
server 10.3.0.10 max_fails=3 fail_timeout=15s;
server 10.3.0.20 max_fails=3 fail_timeout=15s;
server 10.3.0.30 max_fails=3 fail_timeout=15s;
}
server {
listen 443;
ssl on;
ssl_certificate /etc/nginx/ssl/domain.cer;
ssl_certificate_key /etc/nginx/ssl/domain.key;
location / {
set $targ iis;
proxy_pass http://$targ;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
}
}
Related guidance
• Backends for Frontends pattern
• Gateway Aggregation pattern
• Gateway Routing pattern
With this pattern, the client application only needs to know about and communicate with
a single endpoint. If a service is consolidated or decomposed, the client does not necessarily
require updating. It can continue making requests to the gateway, and only the routing changes.
A gateway also lets you abstract backend services from the clients, allowing you to keep client calls
simple while enabling changes in the backend services behind the gateway. Client calls can be routed
to whatever service or services need to handle the expected client behaviour, allowing you to add,
split and reorganise services behind the gateway without changing the client.
This pattern can also help with deployment, by allowing you to manage how updates are rolled
out to users. When a new version of your service is deployed, it can be deployed in parallel with
the existing version. Routing lets you control what version of the service is presented to the clients,
giving you the flexibility to use various release strategies, whether incremental, parallel or complete
rollouts of updates. Any issues discovered after the new service is deployed can be quickly reverted
by making a configuration change at the gateway, without affecting clients.
• The gateway service may introduce a single point of failure. Ensure that it is properly designed
to meet your availability requirements. Consider resilience and fault tolerance capabilities when
implementing.
• The gateway service may introduce a bottleneck. Ensure that the gateway has adequate
performance to handle load and can easily scale in line with your growth expectations.
• You need to route requests from externally addressable endpoints to internal virtual endpoints,
such as exposing ports on a VM to cluster virtual IP addresses.
This pattern may not be suitable when you have a simple application that uses only one
or two services.
Example
Using Nginx as the router, the following is a simple example configuration file for a server that
routes requests for applications residing on different virtual directories to different machines
at the back end.
server {
Listen 80;
server_name domain.com;
location /app1 {
proxy_pass http://10.0.3.10:80;
}
location /app2 {
proxy_pass http://10.0.3.20:80;
}
location /app3 {
proxy_pass http://10.0.3.30:80;
}
}
Related guidance
• Backends for Frontends pattern
• Gateway Aggregation pattern
• Gateway Offloading pattern
There are many factors that affect cloud-hosted applications such as network latency, the
performance and availability of the underlying compute and storage systems and the network
bandwidth between them. The service can fail entirely or partially due to any of these factors.
Therefore, you must verify at regular intervals that the service is performing correctly to ensure
that the required level of availability, which might be part of your service level agreement (SLA).
Solution
Implement health monitoring by sending requests to an endpoint on the application. The application
should perform the necessary checks and return an indication of its status.
• The checks (if any) performed by the application or service in response to the request to the
health verification endpoint.
• Analysis of the results by the tool or framework that performs the health verification check.
The response code indicates the status of the application and, optionally, any components or services
it uses. The latency or response time check is performed by the monitoring tool or framework.
The figure provides an overview of the pattern.
• Checking other resources or services located in the application, or located elsewhere but
used by the application.
Services and tools are available that monitor web applications by submitting a request to
a configurable set of endpoints, and evaluating the results against a set of configurable rules.
It's relatively easy to create a service endpoint whose sole purpose is to perform some functional
tests on the system.
• Validating the response code. For example, an HTTP response of 200 (OK) indicates that the
application responded without error. The monitoring system might also check for other response
codes to give more comprehensive results.
• Checking the content of the response to detect errors, even when a 200 (OK) status code is
returned. This can detect errors that affect only a section of the returned web page or service
response. For example, checking the title of a page or looking for a specific phrase that indicates
the correct page was returned.
• Measuring the response time, which indicates a combination of the network latency and the time
that the application took to execute the request. An increasing value can indicate an emerging
problem with the application or network.
• Checking resources or services located outside the application, such as a content delivery
network used by the application to deliver content from global caches.
• Measuring the response time of a DNS lookup for the URL of the application to measure
DNS latency and DNS failures.
• Validating the URL returned by the DNS lookup to ensure correct entries. This can help to avoid
malicious request redirection through a successful attack on the DNS server.
It's also useful, where possible, to run these checks from different on-premises or hosted locations
to measure and compare response times. Ideally you should monitor applications from locations that
are close to customers to get an accurate view of the performance from each location. In addition
to providing a more robust checking mechanism, the results can help you decide on the deployment
location for the application – and whether to deploy it in more than one datacentre.
Tests should also be run against all the service instances that customers use to ensure the application
is working correctly for all customers. For example, if customer storage is spread across more than
one storage account, the monitoring process should check all of these.
The number of endpoints to expose for an application. One approach is to expose at least one
endpoint for the core services that the application uses and another for lower priority services,
allowing different levels of importance to be assigned to each monitoring result. Also consider
exposing more endpoints, such as one for each core service, for additional monitoring granularity.
For example, a health verification check might check the database, storage and an external
geocoding service that an application uses, with each requiring a different level of uptime and
response time. The application could still be healthy if the geocoding service, or some other
background task, is not available for a few minutes.
Whether to use the same endpoint for monitoring as is used for general access, but to a specific
path designed for health verification checks, for example, /HealthCheck/{GUID}/ on the general
access endpoint. This allows some functional tests in the application to be run by the monitoring
tools, such as adding a new user registration, signing in and placing a test order, while also verifying
that the general access endpoint is available.
The type of information to collect in the service in response to monitoring requests, and how
to return this information. Most existing tools and frameworks look only at the HTTP status code
that the endpoint returns. To return and validate additional information, you might have to create
a custom monitoring utility or service.
How much information to collect. Performing excessive processing during the check can overload
the application and impact other users. The time it takes might exceed the timeout of the monitoring
system so it marks the application as unavailable. Most applications include instrumentation such
as error handlers and performance counters that log performance and detailed error information,
this might be sufficient instead of returning additional information from a health verification check.
Caching the endpoint status. It could be expensive to run the health check too frequently. If the
health status is reported through a dashboard, for example, you don't want every request from the
dashboard to trigger a health check. Instead, periodically check the system health and cache the
status. Expose an endpoint that returns the cached status.
How to configure security for the monitoring endpoints to protect them from public access, which
might expose the application to malicious attacks, risk the exposure of sensitive information or
attract denial of service (DoS) attacks. Typically this should be done in the application configuration
so that it can be updated easily without restarting the application. Consider using one or more
of the following techniques:
• Secure the endpoint by requiring authentication. You can do this by using an authentication
security key in the request header or by passing credentials with the request, provided that
the monitoring service or tool supports authentication.
• Use an obscure or hidden endpoint. For example, expose the endpoint on a different
IP address to that used by the default application URL, configure the endpoint on
a standard HTTP port and/or use a complex path to the test page. You can usually
specify additional endpoint addresses and ports in the application configuration, and
add entries for these endpoints to the DNS server if required to avoid having to specify
the IP address directly.
• DoS attacks are likely to have less impact on a separate endpoint that performs basic
functional tests without compromising the operation of the application. Ideally, avoid
using a test that might expose sensitive information. If you must return information
that might be useful to an attacker, consider how you'll protect the endpoint and the
data from unauthorised access. In this case just relying on obscurity isn't enough. You
should also consider using an HTTPS connection and encrypting any sensitive data,
although this will increase the load on the server.
• How to access an endpoint that's secured using authentication. Not all tools and frameworks can
be configured to include credentials with the health verification request. For example, Microsoft
Azure built-in health verification features can't provide authentication credentials. Some third-
party alternatives are Pingdom, Panopta, NewRelic and Statuscake.
• How to ensure that the monitoring agent is performing correctly. One approach is to expose an
endpoint that simply returns a value from the application configuration or a random value that
can be used to test the agent.
Also ensure that the monitoring system performs checks on itself, such as a self-test and built-in test,
to avoid it issuing false positive results.
• Monitoring middle-tier or shared services to detect and isolate a failure that could disrupt other
applications.
Example
The following code examples, taken from the HealthCheckController class (a sample that
demonstrates this pattern is available on GitHub), demonstrates exposing an endpoint for performing
a range of health checks.
The CoreServices method, shown below in C#, performs a series of checks on services used in the
application. If all of the tests run without error, the method returns a 200 (OK) status code. If any of
the tests raises an exception, the method returns a 500 (Internal Error) status code. The method could
optionally return additional information when an error occurs, if the monitoring tool or framework
is able to make use of it.
// This can optionally return different status codes based on the exception.
// Optionally it could return more details about the exception.
// The additional information could be used by administrators who access the
// endpoint with a browser, or using a ping utility that can display the
// additional information.
Return new HttpStatusCodeResult((int)HttpStatusCode.InternalServerError);
}
Return new HttpStatusCodeResult((int)HttpStatusCode.OK);
}
The ObscurePath method shows how you can read a path from the application configuration and use
it as the endpoint for tests. This example, in C#, also shows how you can accept an ID as a parameter
and use it to check for valid requests.
// The obscure path can be set through configuration to hide the endpoint.
var hiddenPathKey = CloudConfigurationManager.GetSetting(“Test.ObscurePath”);
// If the value passed does not match that in configuration, return 404 (Not Found).
if (!string.Equals(id, hiddenPathKey))
{
return new HttpStatusCodeResult((int)HttpStatusCode.NotFound);
}
The TestResponseFromConfig method shows how you can expose an endpoint that performs
a check for a specified configuration setting value.
int returnStatusCode;
• Use a third-party service or a framework such as Microsoft System Centre Operations Manager.
• Create a custom utility or a service that runs on your own or on a hosted server.
Even though Azure provides a reasonably comprehensive set of monitoring options, you can use
additional services and tools to provide extra information. Azure Management Services provides
a built-in monitoring mechanism for alert rules. The alerts section of the management services page
in the Azure portal allows you to configure up to ten alert rules per subscription for your services.
These rules specify a condition and a threshold value for a service such as CPU load, or the number
of requests or errors per second, and the service can automatically send email notifications to
addresses you define in each rule.
The conditions you can monitor vary depending on the hosting mechanism you choose for your
application (such as WebSites, Cloud Services, Virtual Machines or Mobile Services), but all of these
include the ability to create an alert rule that uses a web endpoint you specify in the settings for
your service. This endpoint should respond in a timely way so that the alert system can detect that
the application is operating correctly.
If you host your application in Azure Cloud Services web and worker roles or Virtual Machines, you
can take advantage of one of the built-in services in Azure called Traffic Manager. Traffic Manager
is a routing and load-balancing service that can distribute requests to specific instances of your
Cloud Services hosted application based on a range of rules and settings.
In addition to routing requests, Traffic Manager pings a URL, port and relative path that you specify
on a regular basis to determine which instances of the application defined in its rules are active and
are responding to requests. If it detects a status code 200 (OK), it marks the application as available.
Any other status code causes Traffic Manager to mark the application as offline. You can view the
status in the Traffic Manager console, and configure the rule to reroute requests to other instances
of the application that are responding.
However, Traffic Manager will only wait ten seconds to receive a response from the monitoring URL.
Therefore, you should ensure that your health verification code executes in this time, allowing for
network latency for the round trip from Traffic Manager to your application and back again.
Read more information about using Traffic Manager to monitor your applications. Traffic Manager
is also discussed in Multiple Datacentre Deployment Guidance.
While the primary key is valuable for queries that fetch data based on the value of this key, an
application might not be able to use the primary key if it needs to retrieve data based on some other
field. In the customers example, an application can't use the Customer ID primary key to retrieve
customers if it queries data solely by referencing the value of some other attribute, such as the town
in which the customer is located. To perform a query such as this, the application might have to fetch
and examine every customer record, which could be a slow process.
You can create as many secondary indexes as you need to support the different queries that your
application performs. For example, in a Customers table in a relational database where the Customer
ID is the primary key, it's beneficial to add a secondary index over the town field if the application
frequently looks up customers by the town where they reside.
However, although secondary indexes are common in relational systems, most NoSQL data stores
used by cloud applications don't provide an equivalent feature.
Solution
If the data store doesn't support secondary indexes, you can emulate them manually by creating
your own index tables. An index table organises the data by a specified key. Three strategies are
commonly used for structuring an index table, depending on the number of secondary indexes
that are required and the nature of the queries that an application performs.
The first strategy is to duplicate the data in each index table but organise it by different keys
(complete denormalisation). The next figure shows index tables that organise the same customer
information by Town and LastName.
This strategy is appropriate if the data is relatively static compared to the number of times it's
queried using each key. If the data is more dynamic, the processing overhead of maintaining each
index table becomes too large for this approach to be useful. Also, if the volume of data is very large,
the amount of space required to store the duplicate data is significant.
The second strategy is to create normalised index tables organised by different keys and reference
the original data by using the primary key rather than duplicating it, as shown in the following figure.
The original data is called a fact table.
The third strategy is to create partially normalised index tables organised by different keys that
duplicate frequently retrieved fields. Reference the fact table to access less frequently accessed fields.
The next figure shows how commonly accessed data is duplicated in each index table.
With this strategy, you can strike a balance between the first two approaches. The data for common
queries can be retrieved quickly by using a single lookup, while the space and maintenance overhead
isn't as significant as duplicating the entire data set.
If an application frequently queries data by specifying a combination of values (for example, "Find all
customers that live in Redmond and that have a last name of Smith"), you could implement the keys
to the items in the index table as a concatenation of the Town attribute and the LastName attribute.
The next figure shows an index table based on composite keys. The keys are sorted by Town,
and then by LastName for records that have the same value for Town.
• Implementing an index table as a normalised structure that references the original data requires
an application to perform two lookup operations to find data. The first operation searches the
index table to retrieve the primary key, and the second uses the primary key to fetch the data.
• If a system incorporates a number of index tables over very large data sets, it can be difficult
to maintain consistency between index tables and the original data. It might be possible to
design the application around the eventual consistency model. For example, to insert, update
or delete data, an application could post a message to a queue and let a separate task perform
the operation and maintain the index tables that reference this data asynchronously. For more
information about implementing eventual consistency, see the Data Consistency Primer.
• Microsoft Azure storage tables support transactional updates for changes made to data held in the
same partition (referred to as entity group transactions). If you can store the data for a fact table and
one or more index tables in the same partition, you can use this feature to help ensure consistency.
• A field selected as the secondary key for an index table is nondiscriminating and can only have
a small set of values (for example, gender).
• The balance of the data values for a field selected as the secondary key for an index table are
highly skewed. For example, if 90% of the records contain the same value in a field, then creating
and maintaining an index table to look up data based on this field might create more overhead
than scanning sequentially through the data. However, if queries very frequently target values
that lie in the remaining 10%, this index can be useful. You should understand the queries that
your application is performing, and how frequently they're performed.
Example
Azure storage tables provide a highly scalable key/value data store for applications running in the
cloud. Applications store and retrieve data values by specifying a key. The data values can contain
multiple fields, but the structure of a data item is opaque to table storage, which simply handles
a data item as an array of bytes.
Azure storage tables also support sharding. The sharding key includes two elements, a partition
key and a row key. Items that have the same partition key are stored in the same partition (shard),
and the items are stored in row key order within a shard. Table storage is optimised for performing
queries that fetch data falling within a contiguous range of row key values within a partition. If you're
building cloud applications that store information in Azure tables, you should structure your data
with this feature in mind.
This approach is less effective if the application also needs to query movies by starring actor. In this
case, you can create a separate Azure table that acts as an index table. The partition key is the actor
and the row key is the movie name. The data for each actor will be stored in separate partitions.
If a movie stars more than one actor, the same movie will occur in multiple partitions.
You can duplicate the movie data in the values held by each partition by adopting the first approach
described in the Solution section above. However, it's likely that each movie will be replicated several
times (once for each actor), so it might be more efficient to partially denormalise the data to support
the most common queries (such as the names of the other actors) and enable an application
to retrieve any remaining details by including the partition key necessary to find the complete
information in the genre partitions. This approach is described by the third option in the Solution
section. The next figure shows this approach.
• Sharding pattern. The Index Table pattern is frequently used in conjunction with data
partitioned by using shards. The Sharding pattern provides more information on how to divide
adata store into a set of shards.
• Materialised View pattern. Instead of indexing data to support queries that summarise data,
itmight be more appropriate to create a materialised view of the data. Describes how to support
efficient summary queries by generating prepopulated views over data.
For example:
• In a cloud-based system that implements horizontal scaling, multiple instances of the same task
could be running at the same time with each instance serving a different user. If these instances
write to a shared resource, it's necessary to coordinate their actions to prevent each instance
from overwriting the changes made by the others.
• If the tasks are performing individual elements of a complex calculation in parallel, the results
need to be aggregated when they all complete.
The task instances are all peers, so there isn't a natural leader that can act as the coordinator
or aggregator.
Solution
A single task instance should be elected to act as the leader, and this instance should coordinate
the actions of the other subordinate task instances. If all of the task instances are running the same
code, they are each capable of acting as the leader. Therefore, the election process must be managed
carefully to prevent two or more instances taking over the leader role at the same time.
The system must provide a robust mechanism for selecting the leader. This method has to cope with
events such as network outages or process failures. In many solutions, the subordinate task instances
monitor the leader through some type of heartbeat method, or by polling. If the designated leader
terminates unexpectedly, or a network failure makes the leader not available to the subordinate task
instances, it's necessary for them to elect a new leader.
There are several strategies for electing a leader among a set of tasks in a distributed environment,
including:
• Selecting the task instance with the lowest-ranked instance or process ID.
• Racing to acquire a shared, distributed mutex. The first task instance that acquires the mutex
is the leader. However, the system must ensure that, if the leader terminates or becomes
disconnected from the rest of the system, the mutex is released to allow another task instance
to become the leader.
• Implementing one of the common leader election algorithms such as the Bully Algorithm or the
Ring Algorithm. These algorithms assume that each candidate in the election has a unique ID,
and that it can communicate with the other candidates reliably.
• In a system that implements horizontal autoscaling, the leader could be terminated if the system
scales back and shuts down some of the computing resources.
• Using a shared, distributed mutex introduces a dependency on the external service that provides
the mutex. The service constitutes a single point of failure. If it becomes unavailable for any
reason, the system won’t be able to elect a leader.
• Using a single dedicated process as the leader is a straightforward approach. However, if the
process fails there could be a significant delay while it's restarted. The resulting latency can
affect the performance and response times of other processes if they're waiting for the leader
to coordinate an operation.
• Implementing one of the algorithms manually provides the greatest flexibility for tuning and
optimising the code.
Avoid making the leader a bottleneck in the system. The purpose of the leader is to coordinate
the work of the subordinate tasks, and it doesn't necessarily have to participate in this work itself –
although it should be able to do so if the task isn't elected as the leader.
• The coordination between tasks can be achieved using a more lightweight method. For example,
if several task instances simply need coordinated access to a shared resource, a better solution
is to use optimistic or pessimistic locking to control access.
• A third-party solution is more appropriate. For example, the Microsoft Azure HDInsight service
(based on Apache Hadoop) uses the services provided by Apache Zookeeper to coordinate the
map and reduce tasks that collect and summarise data.
Example
The DistributedMutex project in the LeaderElection solution (a sample that demonstrates this pattern
is available on GitHub) shows how to use a lease on an Azure Storage blob to provide a mechanism
for implementing a shared, distributed mutex. This mutex can be used to elect a leader among a
group of role instances in an Azure cloud service. The first role instance to acquire the lease is elected
the leader, and remains the leader until it releases the lease or isn't able to renew the lease. Other
role instances can continue to monitor the blob lease in case the leader is no longer available.
To avoid a faulted role instance retaining the lease indefinitely, specify a lifetime for the lease. When
this expires, the lease becomes available. However, while a role instance holds the lease it can request
that the lease is renewed, and it'll be granted the lease for a further period of time. The role instance
can continually repeat this process if it wants to retain the lease. For more information on how
to lease a blob, see Lease Blob (REST API).
if (!string.IsNullOrEmpty(leaseId))
{
// Create a new linked cancellation token source so that if either the
// original token is cancelled or the lease can’t be renewed, the
// leader task can be cancelled.
using (var leaseCts =
CancellationTokenSource.CreateLinkedTokenSource(new[] { token }))
{
// Run the leader task.
var leaderTask = this.taskToRunWhenLeaseAquired.Invoke(leaseCts.Token);
...
}
}
}
...
}
The task started by the leader also runs asynchronously. While this task is running, the
RunTaskWhenBlobLeaseAquired method shown in the following code sample periodically attempts
to renew the lease. This helps to ensure that the role instance remains the leader. In the sample
solution, the delay between renewal requests is less than the time specified for the duration of the
lease in order to prevent another role instance from being elected the leader. If the renewal fails
for any reason, the task is cancelled.
If the lease fails to be renewed or the task is cancelled (possibly as a result of the role instance
shutting down), the lease is released. At this point, this or another role instance might be elected
as the leader. The code extract below shows this part of the process.
// When any task completes (either the leader task itself or when it
// couldn’t renew the lease) then cancel the other task.
await CancelAllWhenAnyCompletes(leaderTask, renewLeaseTask, leaseCts);
}
}
}
}
...
}
The following code example shows how to use the BlobDistributedMutex class in a worker role.
This code acquires a lease over a blob named MyLeaderCoordinatorTask in the lease's container
in development storage, and specifies that the code defined in the MyLeaderCoordinatorTask
method should run if the role instance is elected the leader.
• If the task being performed by the leader stalls, the leader might continue to renew the lease,
preventing any other role instance from acquiring the lease and taking over the leader role
in order to coordinate tasks. In the real world, the health of the leader should be checked at
frequent intervals.
• The election process is nondeterministic. You can't make any assumptions about which role
instance will acquire the blob lease and become the leader.
• The blob used as the target of the blob lease shouldn't be used for any other purpose. If a role
instance attempts to store data in this blob, this data won't be accessible unless the role instance
is the leader and holds the blob lease.
• Compute Partitioning Guidance. This guidance describes how to allocate tasks to hosts in a
cloud service in a way that helps to minimise running costs while maintaining the scalability,
performance, availability and security of the service.
However, this can have a negative effect on queries. When a query only needs a subset of the
data from some entities, such as a summary of orders for several customers without all of the
order details, it must extract all of the data for the relevant entities in order to obtain the required
information.
Solution
To support efficient querying, a common solution is to generate, in advance, a view that materialises
the data in a format suited to the required results set. The Materialised View pattern describes
generating prepopulated views of data in environments where the source data isn't in a suitable
format for querying, where generating a suitable query is difficult, or where query performance
is poor due to the nature of the data or the data store.
These materialised views, which only contain data required by a query, allow applications to
quickly obtain the information that they need. In addition to joining tables or combining data
entities, materialised views can include the current values of calculated columns or data items, the
results of combining values or executing transformations on the data items and values specified as
part of the query. A materialised view can even be optimised for just a single query.
A key point is that a materialised view and the data it contains is completely disposable because
it can be entirely rebuilt from the source data stores. A materialised view is never updated directly
by an application, and so it's a specialised cache.
When the source data for the view changes, the view must be updated to include the new
information. You can schedule this to happen automatically, or when the system detects a change
to the original data. In some cases it might be necessary to regenerate the view manually. The figure
shows an example of how the Materialised View pattern might be used.
In some systems, such as when using the Event Sourcing pattern to maintain a store of only the
events that modified the data, materialised views are necessary. Prepopulating views by examining
all events to determine the current state might be the only way to obtain information from the
event store. If you're not using Event Sourcing, you need to consider whether a materialised view is
helpful or not. Materialised views tend to be specifically tailored to one, or a small number of queries.
If many queries are used, materialised views can result in unacceptable storage capacity requirements
and storage cost.
Consider the impact on data consistency when generating the view, and when updating the view
if this occurs on a schedule. If the source data is changing at the point when the view is generated,
the copy of the data in the view won't be fully consistent with the original data.
Consider where you'll store the view. The view doesn't have to be located in the same store
or partition as the original data. It can be a subset from a few different partitions combined.
A view can be rebuilt if lost. Because of that, if the view is transient and is only used to improve
query performance by reflecting the current state of the data, or to improve scalability, it can be
stored in a cache or in a less reliable location.
When defining a materialised view, maximise its value by adding data items or columns to it based
on computation or transformation of existing data items, on values passed in the query or on
combinations of these values when appropriate.
Where the storage mechanism supports it, consider indexing the materialised view to further increase
performance. Most relational databases support indexing for views, as do big data solutions based
on Apache Hadoop.
• Creating temporary views that can dramatically improve query performance, or can act directly
as source views or data transfer objects for the UI, for reporting or for display.
• Supporting occasionally connected or disconnected scenarios where connection to the data store
isn't always available. The view can be cached locally in this case.
• Simplifying queries and exposing data for experimentation in a way that doesn't require
knowledge of the source data format. For example, by joining different tables in one or more
databases, or one or more domains in NoSQL stores, and then formatting the data to fit its
eventual use.
• Providing access to specific subsets of the source data that, for security or privacy reasons,
shouldn't be generally accessible, open to modification, or fully exposed to users.
• Bridging different data stores, to take advantage of their individual capabilities. For example,
using a cloud store that's efficient for writing as the reference data store, and a relational
database that offers good query and read performance to hold the materialised views.
• The source data changes very quickly, or can be accessed without using a view. In these cases,
you should avoid the processing overhead of creating views.
• Consistency is a high priority. The views might not always be fully consistent with the
original data.
Creating this materialised view requires complex queries. However, by exposing the query result as
a materialised view, users can easily obtain the results and use them directly or incorporate them in
another query. The view is likely to be used in a reporting system or dashboard, and can be updated
on a scheduled basis such as weekly.
Although this example utilises Azure table storage, many relational database management systems
also provide native support for materialised views.
• Command and Query Responsibility Segregation (CQRS) pattern. Use to update the information
in a materialised view by responding to events that occur when the underlying data
values change.
• Event Sourcing pattern. Use in conjunction with the CQRS pattern to maintain the information
in a materialised view. When the data values a materialised view is based on are changed,
the system can raise events that describe these changes and save them in an event store.
The figure illustrates the issues with processing data using the monolithic approach. An application
receives and processes data from two sources. The data from each source is processed by a separate
module that performs a series of tasks to transform this data, before passing the result to the
business logic of the application.
However, the processing tasks performed by each module, or the deployment requirements for each
task, could change as business requirements are updated. Some tasks might be compute intensive
and could benefit from running on powerful hardware, while others might not require such expensive
resources. Also, additional processing might be required in the future, or the order in which the
tasks performed by the processing could change. A solution is required that addresses these issues,
and increases the possibilities for code reuse.
Solution
Break down the processing required for each stream into a set of separate components (or filters),
each performing a single task. By standardising the format of the data that each component receives
and sends, these filters can be combined together into a pipeline. This helps to avoid duplicating
code, and makes it easy to remove, replace or integrate additional components if the processing
requirements change. The next figure shows a solution implemented using pipes and filters.
The time it takes to process a single request depends on the speed of the slowest filter in the
pipeline. One or more filters could be a bottleneck, especially if a large number of requests appear
in a stream from a particular data source. A key advantage of the pipeline structure is that it provides
opportunities for running parallel instances of slow filters, enabling the system to spread the load
and improve throughput.
The filters that make up a pipeline can run on different machines, enabling them to be scaled
independently and take advantage of the elasticity that many cloud environments provide. A filter
that is computationally intensive can run on high performance hardware, while other less demanding
filters can be hosted on less expensive commodity hardware. The filters don't even have to be
in the same data centre or geographical location, which allows each element in a pipeline to run in
an environment that is close to the resources it requires. The next figure shows an example applied
to the pipeline for the data from Source 1.
Another benefit is the resilience that this model can provide. If a filter fails or the machine it's running
on is no longer available, the pipeline can reschedule the work that the filter was performing and
direct this work to another instance of the component. Failure of a single filter doesn't necessarily
result in failure of the entire pipeline.
Using the Pipes and Filters pattern in conjunction with the Compensating Transaction pattern
is an alternative approach to implementing distributed transactions. A distributed transaction
can be broken down into separate, compensable tasks, each of which can be implemented by using
a filter that also implements the Compensating Transaction pattern. The filters in a pipeline can
be implemented as separate hosted tasks running close to the data that they maintain.
You should consider the following points when deciding how to implement this pattern:
• Complexity. The increased flexibility that this pattern provides can also introduce complexity,
especially if the filters in a pipeline are distributed across different servers.
• Reliability. Use an infrastructure that ensures that data flowing between filters in a pipeline
won't be lost.
• Idempotency. If a filter in a pipeline fails after receiving a message and the work is rescheduled
to another instance of the filter, part of the work might have already been completed. If this work
updates some aspect of the global state (such as information stored in a database), the same
update could be repeated. A similar issue might occur if a filter fails after posting its results to the
next filter in the pipeline, but before indicating that it's completed its work successfully. In these
cases, the same work could be repeated by another instance of the filter, causing the same
results to be posted twice. This could result in subsequent filters in the pipeline processing the
same data twice. Therefore filters in a pipeline should be designed to be idempotent. For more
information see Idempotency Patterns on Jonathan Oliver's blog.
• Repeated messages. If a filter in a pipeline fails after posting a message to the next stage of the
pipeline, another instance of the filter might be run, and it'll post a copy of the same message
to the pipeline. This could cause two instances of the same message to be passed to the next
filter. To avoid this, the pipeline should detect and eliminate duplicate messages.
• Context and state. In a pipeline, each filter essentially runs in isolation and shouldn't make
any assumptions about how it was invoked. This means that each filter should be provided
with sufficient context to perform its work. This context could include a large amount of state
information.
• The system can benefit from distributing the processing for steps across different servers.
• A reliable solution is required that minimises the effects of failure in a step while data is being
processed.
• The amount of context or state information required by a step makes this approach inefficient.
It might be possible to persist state information to a database instead, but don't use this strategy
if the additional load on the database causes excessive contention.
Example
You can use a sequence of message queues to provide the infrastructure required to implement
a pipeline. An initial message queue receives unprocessed messages. A component implemented
as a filter task listens for a message on this queue, performs its work and then posts the transformed
message to the next queue in the sequence. Another filter task can listen for messages on this queue,
process them, post the results to another queue and so on until the fully transformed data appears
in the final message in the queue. The next figure illustrates implementing a pipeline using message
queues.
The ServiceBusPipeFilter class is defined in the PipesAndFilters.Shared project available from GitHub.
...
// Create the inbound and outbound queue clients.
this.inQueue = QueueClient.CreateFromConnectionString(...);
}
this.inQueue.OnMessageAsync(
async (msg) =>
{
...
// Process the filter and send the output to the
// next queue in the pipeline.
var outMessage = await asyncFilterTask(msg);
// Note: There’s a chance that the same message could be sent twice
// or that a message gets processed by an upstream or downstream
// filter at the same time.
// This would happen in a situation where processing of a message was
// completed, it was sent to the next pipe/queue, and then failed
// to complete when using the PeekLock method.
// Idempotent message processing and concurrency should be considered
// in a real-world implementation.
},
options);
}
this.inQueue.Close();
...
}
The sample solution implements filters in a set of worker roles. Each worker role can be scaled
independently, depending on the complexity of the business processing that it performs or the
resources required for processing. Additionally, multiple instances of each worker role can be run
in parallel to improve throughput.
The following code shows an Azure worker role named PipeFilterARoleEntry, defined in the
PipeFilterA project in the sample solution.
this.pipeFilterA.Start();
...
}
newMsg.Properties.Add(Constants.FilterAMessageKey, “Complete”);
return newMsg;
});
...
}
...
}
This role contains a ServiceBusPipeFilter object. The OnStart method in the role connects to the
queues for receiving input messages and posting output messages (the names of the queues are
defined in the Constants class). The Run method invokes the OnPipeFilterMessagesAsync method
to perform some processing on each message that’s received (in this example, the processing
is simulated by waiting for a short period of time). When processing is complete, a new message
is constructed containing the results (in this case, the input message has a custom property added),
and this message is posted to the output queue.
The sample solution also provides two additional roles named InitialSenderRoleEntry
(in the InitialSender project) and FinalReceiverRoleEntry (in the FinalReceiver project).
The InitialSenderRoleEntry role provides the initial message in the pipeline. The OnStart method
connects to a single queue and the Run method posts a method to this queue. This queue is the
input queue used by the PipeFilterARoleEntry role, so posting a message to it causes the message
to be received and processed by the PipeFilterARoleEntry role. The processed message then passes
through the PipeFilterBRoleEntry role.
The input queue for the FinalReceiveRoleEntry role is the output queue for the PipeFilterBRoleEntry
role. The Run method in the FinalReceiveRoleEntry role, shown below, receives the message and
performs some final processing. Then it writes the values of the custom properties added by
the filters in the pipeline to the trace output.
return null;
});
...
}
...
}
• Competing Consumers pattern. A pipeline can contain multiple instances of one or more filters.
This approach is useful for running parallel instances of slow filters, enabling the system to
spread the load and improve throughput. Each instance of a filter will compete for input with the
other instances, two instances of a filter shouldn't be able to process the same data. Provides an
explanation of this approach.
Solution
A queue is usually a first-in, first-out (FIFO) structure and consumers typically receive messages in
the same order that they were posted to the queue. However, some message queues support priority
messaging. The application posting a message can assign a priority and the messages in the queue
are automatically reordered so that those with a higher priority will be received before those with
a lower priority. The figure illustrates a queue with priority messaging.
In systems that don't support priority-based message queues, an alternative solution is to maintain
a separate queue for each priority. The application is responsible for posting messages to the
appropriate queue. Each queue can have a separate pool of consumers. Higher priority queues
can have a larger pool of consumers running on faster hardware than lower priority queues.
The next figure illustrates using separate message queues for each priority.
A variation on this strategy is to have a single pool of consumers that check for messages on
high priority queues first, and only then start to fetch messages from lower priority queues. There
are some semantic differences between a solution that uses a single pool of consumer processes
(either with a single queue that supports messages with different priorities or with multiple queues
that each handle messages of a single priority), and a solution that uses multiple queues with
a separate pool for each queue.
In the single pool approach, higher priority messages are always received and processed before lower
priority messages. In theory, messages that have a very low priority could be continually superseded
and might never be processed. In the multiple pool approach, lower priority messages will always
be processed, just not as quickly as those of a higher priority (depending on the relative size of the
pools and the resources that they have available).
• The multiple message queue approach can help maximise application performance and
scalability by partitioning messages based on processing requirements. For example, vital
tasks can be prioritised to be handled by receivers that run immediately while less important
background tasks can be handled by receivers that are scheduled to run at less busy periods.
Decide if all high priority items must be processed before any lower priority items. If the messages
are being processed by a single pool of consumers, you have to provide a mechanism that can
preempt and suspend a task that's handling a low priority message if a higher priority message
becomes available.
In the multiple queue approach, when using a single pool of consumer processes that listen
on all queues rather than a dedicated consumer pool for each queue, the consumer must apply
an algorithm that ensures it always services messages from higher priority queues before those
from lower priority queues.
Monitor the processing speed on high and low priority queues to ensure that messages in these
queues are processed at the expected rates.
If you need to guarantee that low priority messages will be processed, it's necessary to implement
the multiple message queue approach with multiple pools of consumers. Alternatively, in a queue
that supports message prioritisation, it's possible to dynamically increase the priority of a queued
message as it ages. However, this approach depends on the message queue providing this feature.
Using a separate queue for each message priority works best for systems that have a small number
of well-defined priorities.
Message priorities can be determined logically by the system. For example, rather than having
explicit high and low priority messages, they could be designated as "fee paying customer", or
"non-fee paying customer". Depending on your business model, your system can allocate more
resources to processing messages from fee paying customers than non-fee paying ones.
There might be a financial and processing cost associated with checking a queue for a message
(some commercial messaging systems charge a small fee each time a message is posted or retrieved,
and each time a queue is queried for messages). This cost increases when checking multiple queues.
It's possible to dynamically adjust the size of a pool of consumers based on the length of the queue
that the pool is servicing. For more information, see the Autoscaling Guidance.
Example
Microsoft Azure doesn't provide a queuing mechanism that natively supports automatic prioritisation
of messages through sorting. However, it does provide Azure Service Bus topics and subscriptions
that support a queuing mechanism that provides message filtering, together with a wide range
of flexible capabilities that make it ideal for use in most priority queue implementations.
An Azure solution can implement a Service Bus topic an application can post messages to, in the
same way as a queue. Messages can contain metadata in the form of application-defined custom
properties. Service Bus subscriptions can be associated with the topic, and these subscriptions
can filter messages based on their properties. When an application sends a message to a topic, the
message is directed to the appropriate subscription where it can be read by a consumer. Consumer
processes can retrieve messages from a subscription using the same semantics as a message queue
(a subscription is a logical queue). The following figure illustrates implementing a priority queue with
Azure Service Bus topics and subscriptions.
Note that there's nothing special about the designation of high and low priority messages
in this example. They're simply labels specified as properties in each message, and are used to
direct messages to a specific subscription. If additional priorities are required, it's relatively easy
to create further subscriptions and pools of consumer processes to handle these priorities.
The Run method in the PriorityWorkerRole class arranges for the virtual ProcessMessage method
(also defined in the PriorityWorkerRole class) to be run for each message received on the queue.
The following code shows the Run and ProcessMessage methods. The QueueManager class, defined
in the PriorityQueue.Shared project, provides helper methods for using Azure Service Bus queues.
The PriorityQueue.High and PriorityQueue.Low worker roles both override the default functionality
of the ProcessMessage method. The code below shows the ProcessMessage method for the
PriorityQueue.High worker role.
When an application posts messages to the topic associated with the subscriptions used by
the PriorityQueue.High and PriorityQueue.Low worker roles, it specifies the priority by using the
Priority custom property, as shown in the following code example. This code (implemented in the
WorkerRole class in the PriorityQueue.Sender project), uses the SendBatchAsync helper method
of the QueueManager class to post messages to a topic in batches.
this.queueManager.SendBatchAsync(lowMessages).Wait();
...
this.queueManager.SendBatchAsync(lowMessages).Wait();
• Asynchronous Messaging Primer. A consumer service that processes a request might need
to send a reply to the instance of the application that posted the request. Provides information
on the strategies that you can use to implement request/response messaging.
• Competing Consumers pattern. To increase the throughput of the queues, it’s possible to have
multiple consumers that listen on the same queue, and process the tasks in parallel. These
consumers will compete for messages, but only one should be able to process each message.
Provides more information on the benefits and tradeoffs of implementing this approach.
• Autoscaling Guidance. It might be possible to scale the size of the pool of consumer processes
handling a queue depending on the length of the queue. This strategy can help to improve
performance, especially for pools handling high priority messages.
A service could be part of the same solution as the tasks that use it, or it could be a third-party
service providing access to frequently used resources such as a cache or a storage service.
If the same service is used by a number of tasks running concurrently, it can be difficult to
predict the volume of requests to the service at any time.
A service might experience peaks in demand that cause it to overload and be unable to respond
to requests in a timely manner. Flooding a service with a large number of concurrent requests
can also result in the service failing if it's unable to handle the contention these requests cause.
Solution
Refactor the solution and introduce a queue between the task and the service. The task and the
service run asynchronously. The task posts a message containing the data required by the service to
a queue. The queue acts as a buffer, storing the message until it's retrieved by the service. The service
retrieves the messages from the queue and processes them. Requests from a number of tasks, which
can be generated at a highly variable rate, can be passed to the service through the same message
queue. This figure shows using a queue to level the load on a service.
• It can help to maximise availability because delays arising in services won't have an immediate
and direct impact on the application, which can continue to post messages to the queue even
when the service isn't available or isn't currently processing messages.
• It can help to maximise scalability because both the number of queues and the number
of services can be varied to meet demand.
• It can help to control costs because the number of service instances deployed only have
to be adequate to meet average load rather than the peak load.
Some services implement throttling when demand reaches a threshold beyond which the system
could fail. Throttling can reduce the functionality available. You can implement load levelling with
these services to ensure that this threshold isn't reached.
• Message queues are a one-way communication mechanism. If a task expects a reply from
a service, it might be necessary to implement a mechanism that the service can use to send
a response. For more information, see the Asynchronous Messaging Primer.
• Be careful if you apply autoscaling to services that are listening for requests on the queue.
This can result in increased contention for any resources that these services share and diminish
the effectiveness of using the queue to level the load.
This pattern isn't useful if the application expects a response from the service with minimal latency.
Example
A Microsoft Azure web role stores data using a separate storage service. If a large number of
instances of the web role run concurrently, it's possible that the storage service will be unable to
respond to requests quickly enough to prevent these requests from timing out or failing. This figure
highlights a service being overwhelmed by a large number of concurrent requests from instances
of a web role.
• Competing Consumers pattern. It might be possible to run multiple instances of a service, each
acting as a message consumer from the load-levelling queue. You can use this approach to adjust
the rate at which messages are received and passed to a service.
• Throttling pattern. A simple way to implement throttling with a service is to use queue-based
load levelling and route all requests to a service through a message queue. The service can
process requests at a rate that ensures that resources required by the service aren't exhausted,
and to reduce the amount of contention that could occur.
• Queue Service Concepts. Information about choosing a messaging and queuing mechanism
in Azure applications.
Retry pattern
Enable an application to handle transient failures when it tries to connect to a service or network
resource, by transparently retrying a failed operation. This can improve the stability of the application.
These faults are typically self-correcting, and if the action that triggered a fault is repeated after
a suitable delay it's likely to be successful. For example, a database service that's processing a large
number of concurrent requests can implement a throttling strategy that temporarily rejects any
further requests until its workload has eased. An application trying to access the database might
fail to connect, but if it tries again after a delay it might succeed.
Solution
In the cloud, transient faults aren't uncommon and an application should be designed to handle
them elegantly and transparently. This minimises the effects faults can have on the business tasks
the application is performing.
If an application detects a failure when it tries to send a request to a remote service, it can handle
the failure using the following strategies:
• Cancel. If the fault indicates that the failure isn't transient or is unlikely to be successful if
repeated, the application should cancel the operation and report an exception. For example,
an authentication failure caused by providing invalid credentials is not likely to succeed no matter
how many times it's attempted.
• Retry after delay. If the fault is caused by one of the more commonplace connectivity or
busy failures, the network or service might need a short period while the connectivity issues
are corrected or the backlog of work is cleared. The application should wait for a suitable time
before retrying the request.
For the more common transient failures, the period between retries should be chosen to spread
requests from multiple instances of the application as evenly as possible. This reduces the chance
of a busy service continuing to be overloaded. If many instances of an application are continually
overwhelming a service with retry requests, it'll take the service longer to recover.
If the request still fails, the application can wait and make another attempt. If necessary, this process
can be repeated with increasing delays between retry attempts, until some maximum number of
requests have been attempted. The delay can be increased incrementally or exponentially, depending
on the type of failure and the probability that it'll be corrected during this time.
The following diagram illustrates invoking an operation in a hosted service using this pattern. If the
request is unsuccessful after a predefined number of attempts, the application should treat the fault
as an exception and handle it accordingly.
The application should wrap all attempts to access a remote service in code that implements a retry
policy matching one of the strategies listed above. Requests sent to different services can be subject
to different policies. Some vendors provide libraries that implement retry policies, where the application
can specify the maximum number of retries, the time between retry attempts and other parameters.
An application should log the details of faults and failing operations. This information is useful to
operators. If a service is frequently unavailable or busy, it’s often because the service has exhausted
its resources. You can reduce the frequency of these faults by scaling out the service. For example,
if a database service is continually overloaded, it might be beneficial to partition the database and
spread the load across multiple servers.
The retry policy should be tuned to match the business requirements of the application and the
nature of the failure. For some noncritical operations, it's better to fail quickly rather than retry several
times and impact the throughput of the application. For example, in an interactive web application
accessing a remote service, it's better to fail after a smaller number of retries with only a short
delay between retry attempts, and display a suitable message to the user (for example, "please try
again later"). For a batch application, it might be more appropriate to increase the number of retry
attempts with an exponentially increasing delay between attempts.
An aggressive retry policy with minimal delay between attempts, and a large number of retries, could
further degrade a busy service that's running close to or at capacity. This retry policy could also affect
the responsiveness of the application if it's continually trying to perform a failing operation.
If a request still fails after a significant number of retries, it's better for the application to prevent
further requests going to the same resource and simply report a failure immediately. When the
period expires, the application can tentatively allow one or more requests through to see whether
they're successful. For more details of this strategy, see the Circuit Breaker pattern.
Consider whether the operation is idempotent. If so, it's inherently safe to retry. Otherwise, retries
could cause the operation to be executed more than once, with unintended side effects. For example,
a service might receive the request, process the request successfully, but fail to send a response. At that
point, the retry logic might re-send the request, assuming that the first request wasn't received.
A request to a service can fail for a variety of reasons raising different exceptions depending on
the nature of the failure. Some exceptions indicate a failure that can be resolved quickly, while others
indicate that the failure is longer lasting. It's useful for the retry policy to adjust the time between
retry attempts based on the type of the exception.
Consider how retrying an operation that's part of a transaction will affect the overall transaction
consistency. Fine tune the retry policy for transactional operations to maximise the chance of success
and reduce the need to undo all the transaction steps.
Ensure that all retry code is fully tested against a variety of failure conditions. Check that it doesn't
severely impact the performance or reliability of the application, cause excessive load on services
and resources or generate race conditions or bottlenecks.
Implement retry logic only where the full context of a failing operation is understood. For example,
if a task that contains a retry policy invokes another task that also contains a retry policy, this extra
layer of retries can add long delays to the processing. It might be better to configure the lower-level
task to fail fast and report the reason for the failure back to the task that invoked it. This higher-level
task can then handle the failure based on its own policy.
It's important to log all connectivity failures that cause a retry so that underlying problems with the
application, services, or resources can be identified.
Whenever possible, the application should ensure that the task runs to completion and resolve any
failures that might occur when accessing remote services or resources. Failures can occur for many
reasons. For example, the network might be down, communications could be interrupted, a remote
service might be unresponsive or in an unstable state, or a remote resource might be temporarily
inaccessible, perhaps due to resource constraints. In many cases the failures will be transient and
can be handled by using the Retry pattern.
If the application detects a more permanent fault it can't easily recover from, it must be able
to restore the system to a consistent state and ensure integrity of the entire operation.
Solution
The Scheduler Agent Supervisor pattern defines the following actors. These actors orchestrate
the steps to be performed as part of the overall task.
• The Scheduler arranges for the steps that make up the task to be executed and orchestrates
their operation. These steps can be combined into a pipeline or workflow. The Scheduler is
responsible for ensuring that the steps in this workflow are performed in the right order. As each
step is performed, the Scheduler records the state of the workflow, such as "step not yet started",
"step running" or "step completed". The state information should also include an upper limit
of the time allowed for the step to finish, called the complete-by time. If a step requires access
to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the
details of the work to be performed. The Scheduler typically communicates with an Agent using
asynchronous request/response messaging. This can be implemented using queues, although
other distributed messaging technologies could be used instead.
• The Scheduler performs a similar function to the Process Manager in the Process Manager
pattern. The actual workflow is typically defined and implemented by a workflow engine that's
controlled by the Scheduler. This approach decouples the business logic in the workflow from
the Scheduler.
• The Supervisor monitors the status of the steps in the task being performed by the Scheduler.
It runs periodically (the frequency will be system specific), and examines the status of steps
maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the
appropriate Agent to recover the step or execute the appropriate remedial action (this might
involve modifying the status of a step). Note that the recovery or remedial actions are
implemented by the Scheduler and Agents. The Supervisor should simply request that these
actions be performed.
The Scheduler, Agent and Supervisor are logical components and their physical implementation
depends on the technology being used. For example, several logical agents might be implemented
as part of a single web service.
The Scheduler maintains information about the progress of the task and the state of each step in
a durable data store, called the state store. The Supervisor can use this information to help determine
whether a step has failed. The figure illustrates the relationship between the Scheduler, the Agents,
the Supervisor and the state store.
When the application is ready to run a task, it submits a request to the Scheduler. The Scheduler
records initial state information about the task and its steps (for example, step not yet started)
in the state store and then starts performing the operations defined by the workflow. As the
Scheduler starts each step, it updates the information about the state of that step in the state
store (for example, step running).
If a step references a remote service or resource, the Scheduler sends a message to the appropriate
Agent. The message contains the information that the Agent needs to pass to the service or
access the resource, in addition to the complete-by time for the operation. If the Agent completes
its operation successfully, it returns a response to the Scheduler. The Scheduler can then update
the state information in the state store (for example, step completed) and perform the next step.
This process continues until the entire task is complete.
An Agent can implement any retry logic that's necessary to perform its work. However, if the Agent
doesn't complete its work before the complete-by period expires, the Scheduler will assume that the
operation has failed. In this case, the Agent should stop its work and not try to return anything to
the Scheduler (not even an error message), or try any form of recovery. The reason for this restriction
is that, after a step has timed out or failed, another instance of the Agent might be scheduled to run
the failing step (this process is described later).
If the Agent fails, the Scheduler won't receive a response. The pattern doesn't make a distinction
between a step that has timed out and one that has genuinely failed.
If a step times out or fails, the state store will contain a record that indicates that the step is running,
but the complete-by time will have passed. The Supervisor looks for steps like this and tries to
recover them. One possible strategy is for the Supervisor to update the complete-by value to extend
the time available to complete the step, and then send a message to the Scheduler identifying the
step that has timed out. The Scheduler can then try to repeat this step. However, this design requires
the tasks to be idempotent.
The Supervisor might need to prevent the same step from being retried if it continually fails or
times out. To do this, the Supervisor could maintain a retry count for each step, along with the state
information, in the state store. If this count exceeds a predefined threshold the Supervisor can adopt
a strategy of waiting for an extended period before notifying the Scheduler that it should retry the
step, in the expectation that the fault will be resolved during this period. Alternatively, the Supervisor
can send a message to the Scheduler to request the entire task be undone by implementing
a Compensating Transaction pattern. This approach will depend on the Scheduler and Agents
providing the information necessary to implement the compensating operations for each step
that completed successfully.
It isn't the purpose of the Supervisor to monitor the Scheduler and Agents, and restart them if they
fail. This aspect of the system should be handled by the infrastructure these components are running
in. Similarly, the Supervisor shouldn't have knowledge of the actual business operations that the tasks
being performed by the Scheduler are running (including how to compensate should these tasks fail).
This is the purpose of the workflow logic implemented by the Scheduler. The sole responsibility of the
Supervisor is to determine whether a step has failed and arrange either for it to be repeated or for
the entire task containing the failed step to be undone.
The key advantage of this pattern is that the system is resilient in the event of unexpected temporary
or unrecoverable failures. The system can be constructed to be self healing. For example, if an
Agent or the Scheduler fails, a new one can be started and the Supervisor can arrange for a task
to be resumed. If the Supervisor fails, another instance can be started and can take over from
where the failure occurred. If the Supervisor is scheduled to run periodically, a new instance can be
automatically started after a predefined interval. The state store can be replicated to reach an even
greater degree of resilience.
• The recovery/retry logic implemented by the Scheduler is complex and dependent on state
information held in the state store. It might also be necessary to record the information required
to implement a compensating transaction in a durable data store.
• How often the Supervisor runs will be important. It should run often enough to prevent any
failed steps from blocking an application for an extended period, but it shouldn't run so often
that it becomes an overhead.
• The steps performed by an Agent could be run more than once. The logic that implements
these steps should be idempotent.
This pattern might not be suitable for tasks that don't invoke remote services or access remote
resources.
Example
A web application that implements an ecommerce system has been deployed on Microsoft Azure.
Users can run this application to browse the available products and to place orders. The user
interface runs as a web role, and the order processing elements of the application are implemented
as a set of worker roles. Part of the order processing logic involves accessing a remote service, and
this aspect of the system could be prone to transient or more long-lasting faults. For this reason, the
designers used the Scheduler Agent Supervisor pattern to implement the order processing elements
of the system.
The state information that the submission process creates for the order includes:
• OrderID. The ID of the order in the orders database.
• LockedBy. The instance ID of the worker role handling the order. There might be multiple current
instances of the worker role running the Scheduler, but each order should only be handled
by a single instance.
• ProcessState. The current state of the task handling the order. The possible states are:
• Pending. The order has been created but processing hasn't yet been started.
• Processing. The order is currently being processed.
• Processed. The order has been processed successfully.
• Error. The order processing has failed.
• FailureCount. The number of times that processing has been tried for the order.
In this state information, the OrderID field is copied from the order ID of the new order. The LockedBy
and CompleteBy fields are set to null, the ProcessState field is set to Pending and the FailureCount
field is set to 0.
In this example, the order handling logic is relatively simple and only has a single step that invokes
a remote service. In a more complex multistep scenario, the submission process would likely involve
several steps, and so several records would be created in the state store – each one describing the
state of an individual step.
The Scheduler also runs as part of a worker role and implements the business logic that handles the
order. An instance of the Scheduler polling for new orders examines the state store for records where
the LockedBy field is null and the ProcessState field is pending. When the Scheduler finds a new
order, it immediately populates the LockedBy field with its own instance ID, sets the CompleteBy
field to an appropriate time, and sets the ProcessState field to processing. The code is designed to
be exclusive and atomic to ensure that two concurrent instances of the Scheduler can't try to handle
the same order simultaneously.
The Scheduler then runs the business workflow to process the order asynchronously, passing it the
value in the OrderID field from the state store. The workflow handling the order retrieves the details
of the order from the orders database and performs its work. When a step in the order processing
workflow needs to invoke the remote service, it uses an Agent. The workflow step communicates
with the Agent using a pair of Azure Service Bus message queues acting as a request/response
channel. The figure shows a high level view of the solution.
If the complete-by time expires before the Agent receives a response from the remote service,
the Agent simply halts its processing and terminates handling the order. Similarly, if the workflow
handling the order exceeds the complete-by time, it also terminates. In both cases, the state of the
order in the state store remains set to processing, but the complete-by time indicates that the time
for processing the order has passed and the process is deemed to have failed. Note that if the Agent
that's accessing the remote service, or the workflow that's handling the order (or both) terminate
unexpectedly, the information in the state store will again remain set to processing and eventually
will have an expired complete-by value.
The Supervisor periodically examines the state store looking for orders with an expired complete-
by value. If the Supervisor finds a record, it increments the FailureCount field. If the failure count
value is below a specified threshold value, the Supervisor resets the LockedBy field to null, updates
the CompleteBy field with a new expiry time and sets the ProcessState field to pending. An instance
of the Scheduler can pick up this order and perform it's processing as before. If the failure count
value exceeds a specified threshold, the reason for the failure is assumed to be nontransient.
The Supervisor sets the status of the order to error and raises an event that alerts an operator.
In this example, the Supervisor is implemented in a separate worker role. You can use a variety of
strategies to arrange for the Supervisor task to be run, including using the Azure Scheduler service
(not to be confused with the Scheduler component in this pattern). For more information about the
Azure Scheduler service, visit the Scheduler page.
Although it isn't shown in this example, the Scheduler might need to keep the application that
submitted the order informed about the progress and status of the order. The application and the
Scheduler are isolated from each other to eliminate any dependencies between them. The application
has no knowledge of which instance of the Scheduler is handling the order, and the Scheduler
is unaware of which specific application instance posted the order.
To allow the order status to be reported, the application could use its own private response queue.
The details of this response queue would be included as part of the request sent to the submission
process, which would include this information in the state store. The Scheduler would then post
messages to this queue indicating the status of the order (request received, order completed, order
failed and so on). It should include the order ID in these messages so they can be correlated with
the original request by the application.
The following patterns and guidance might also be relevant when implementing this pattern:
• Retry pattern. An Agent can use this pattern to transparently retry an operation that accesses
a remote service or resource that has previously failed. Use when the expectation is that the
cause of the failure is transient and can be corrected.
• Circuit Breaker pattern. An Agent can use this pattern to handle faults that take a variable amount
of time to correct when connecting to a remote service or resource.
• Asynchronous Messaging Primer. The components in the Scheduler Agent Supervisor pattern
typically run decoupled from each other and communicate asynchronously. Describes some
of the approaches that can be used to implement asynchronous communication based on
message queues.
• Reference 6: A Saga on Sagas. An example showing how the CQRS pattern uses a process
manager (part of the CQRS Journey guidance).
Sharding pattern.
Divide a data store into a set of horizontal partitions or shards. This can improve scalability
when storing and accessing large volumes of data.
• Storage space. A data store for a large-scale cloud application is expected to contain a huge
volume of data that could increase significantly over time. A server typically provides only a finite
amount of disk storage, but you can replace existing disks with larger ones, or add further disks
to a machine as data volumes grow. However, the system will eventually reach a limit where
it isn't possible to easily increase the storage capacity on a given server.
• Network bandwidth. Ultimately, the performance of a data store running on a single server
is governed by the rate the server can receive requests and send replies. It's possible that the
volume of network traffic might exceed the capacity of the network used to connect to the
server, resulting in failed requests.
• Geography. It might be necessary to store data generated by specific users in the same region
as those users for legal, compliance or performance reasons, or to reduce latency of data access.
If the users are dispersed across different countries or regions, it might not be possible to store
the entire data for the application in a single data store.
Scaling vertically by adding more disk capacity, processing power, memory and network connections
can postpone the effects of some of these limitations, but it's likely to only be a temporary solution.
A commercial cloud application capable of supporting large numbers of users and high volumes of
data must be able to scale almost indefinitely, so vertical scaling isn't necessarily the best solution.
• A system can use off-the-shelf hardware rather than specialised and expensive computers
for each storage node.
• You can reduce contention and improve performance by balancing the workload across shards.
• In the cloud, shards can be located physically close to the users that'll access the data.
When dividing a data store up into shards, decide which data should be placed in each shard. A shard
typically contains items that fall within a specified range determined by one or more attributes of the
data. These attributes from the shard key (sometimes referred to as the partition key). The shard key
should be static. It shouldn't be based on data that might change.
Sharding physically organises the data. When an application stores and retrieves data, the sharding
logic directs the application to the appropriate shard. This sharding logic can be implemented
as partçof the data access code in the application, or it could be implemented by the data storage
system if it transparently supports sharding.
Abstracting the physical location of the data in the sharding logic provides a high level of control
over which shards contain which data. It also enables data to migrate between shards without
reworking the business logic of an application if the data in the shards need to be redistributed later
(for example, if the shards become unbalanced). The tradeoff is the additional data access overhead
required in determining the location of each data item as it's retrieved.
To ensure optimal performance and scalability, it's important to split the data in a way that's
appropriate for the types of queries that the application performs. In many cases, it's unlikely that the
sharding scheme will exactly match the requirements of every query. For example, in a multi-tenant
system an application might need to retrieve tenant data using the tenant ID, but it might also need
to look up this data based on some other attribute such as the tenant's name or location. To handle
these situations, implement a sharding strategy with a shard key that supports the most commonly
performed queries.
If queries regularly retrieve data using a combination of attribute values, you can likely define
a composite shard key by linking attributes together. Alternatively, use a pattern such as Index Table
to provide fast lookup to data based on attributes that aren't covered by the shard key.
The Lookup strategy. In this strategy the sharding logic implements a map that routes a request
for data to the shard that contains that data using the shard key. In a multi-tenant application all the
data for a tenant might be stored together in a shard using the tenant ID as the shard key. Multiple
tenants might share the same shard, but the data for a single tenant won't be spread across multiple
shards. The figure illustrates sharding tenant data based on tenant IDs.
The mapping between the shard key and the physical storage can be based on physical shards where
each shard key maps to a physical partition. Alternatively, a more flexible technique for rebalancing
shards is virtual partitioning, where shard keys map to the same number of virtual shards, which in
turn map to fewer physical partitions. In this approach, an application locates data using a shard key
that refers to a virtual shard, and the system transparently maps virtual shards to physical partitions.
The mapping between a virtual shard and a physical partition can change without requiring the
application code be modified to use a different set of shard keys.
The Range strategy. This strategy groups related items together in the same shard, and orders them
by shard key – the shard keys are sequential. It's useful for applications that frequently retrieve sets
of items using range queries (queries that return a set of data items for a shard key that falls within a
given range). For example, if an application regularly needs to find all orders placed in a given month,
this data can be retrieved more quickly if all orders for a month are stored in date and time order in
the same shard. If each order was stored in a different shard, they'd have to be fetched individually
by performing a large number of point queries (queries that return a single data item). The next
figure illustrates storing sequential sets (ranges) of data in shard.
The Hash strategy. The purpose of this strategy is to reduce the chance of hotspots (shards
that receive a disproportionate amount of load). It distributes the data across the shards in a way
that achieves a balance between the size of each shard and the average load that each shard will
encounter. The sharding logic computes the shard to store an item in based on a hash of one or more
attributes of the data. The chosen hashing function should distribute data evenly across the shards,
possibly by introducing some random element into the computation. The next figure illustrates
sharding tenant data based on a hash of tenant IDs.
The three sharding strategies have the following advantages and considerations:
• Lookup. This offers more control over the way that shards are configured and used. Using virtual
shards reduces the impact when rebalancing data because new physical partitions can be added
to even out the workload. The mapping between a virtual shard and the physical partitions that
implement the shard can be modified without affecting application code that uses a shard key to
store and retrieve data. Looking up shard locations can impose an additional overhead.
• Range. This is easy to implement and works well with range queries because they can often fetch
multiple data items from a single shard in a single operation. This strategy offers easier data
management. For example, if users in the same region are in the same shard, updates can be
scheduled in each time zone based on the local load and demand pattern. However, this strategy
doesn't provide optimal balancing between shards. Rebalancing shards is difficult and might not
resolve the problem of uneven load if the majority of activity is for adjacent shard keys.
• Hash. This strategy offers a better chance of more even data and load distribution. Request
routing can be accomplished directly by using the hash function. There's no need to maintain
a map. Note that computing the hash might impose an additional overhead. Also, rebalancing
shards is difficult.
Most common sharding systems implement one of the approaches described above, but you should
also consider the business requirements of your applications and their patterns of data usage. For
example, in a multi-tenant application:
• You can shard data based on workload. You could segregate the data for highly volatile tenants
in separate shards. The speed of data access for other tenants might be improved as a result.
• You can shard data based on the location of tenants. You can take the data for tenants in a
specific geographic region offline for backup and maintenance during off-peak hours in that
region, while the data for tenants in other regions remains online and accessible during their
business hours.
• High-value tenants could be assigned their own private, high performing, lightly loaded shards,
whereas lower-value tenants might be expected to share more densely-packed, busy shards.
The data for tenants that need a high degree of data isolation and privacy can be stored
on a completely separate server.
• The data for tenants that need a high degree of data isolation and privacy can be stored
on a completely separate server.
The Lookup strategy permits scaling and data movement operations to be carried out at the user
level, either online or offline. The technique is to suspend some or all user activity (perhaps during
off-peak periods), move the data to the new virtual partition or physical shard, change the mappings,
invalidate or refresh any caches that hold this data and then allow user activity to resume. Often
this type of operation can be centrally managed. The Lookup strategy requires state to be highly
cacheable and replica friendly.
The Range strategy imposes some limitations on scaling and data movement operations, which must
typically be carried out when a part or all of the data store is offline because the data must be split
and merged across the shards. Moving the data to rebalance shards might not resolve the problem
of uneven load if the majority of activity is for adjacent shard keys or data identifiers that are within
the same range. The Range strategy might also require some state to be maintained in order to map
ranges to the physical partitions.
The Hash strategy makes scaling and data movement operations more complex because the
partition keys are hashes of the shard keys or data identifiers. The new location of each shard must
be determined from the hash function, or the function modified to provide the correct mappings.
However, the Hash strategy doesn't require maintenance of state.
• Keep shards balanced so they all handle a similar volume of I/O. As data is inserted and deleted,
it's necessary to periodically rebalance the shards to guarantee an even distribution and to
reduce the chance of hotspots. Rebalancing can be an expensive operation. To reduce the
necessity of rebalancing, plan for growth by ensuring that each shard contains sufficient free
space to handle the expected volume of changes. You should also develop strategies and scripts
you can use to quickly rebalance shards if this becomes necessary.
• Use stable data for the shard key. If the shard key changes, the corresponding data item might
have to move between shards, increasing the amount of work performed by update operations.
For this reason, avoid basing the shard key on potentially volatile information. Instead, look for
attributes that are invariant or that naturally form a key.
• Ensure that shard keys are unique. For example, avoid using autoincrementing fields as the
shard key. Is some systems, autoincremented fields can't be coordinated across shards, possibly
resulting in items in different shards having the same shard key.
• Autoincremented values in other fields that are not shard keys can also cause problems.
For example, if you use autoincremented fields to generate unique IDs, then two
different items located in different shards might be assigned the same ID.
• Queries that access only a single shard are more efficient than those that retrieve data from
multiple shards, so avoid implementing a sharding system that results in applications performing
large numbers of queries that join data held in different shards. Remember that a single shard
can contain the data for multiple types of entities. Consider denormalising your data to keep
related entities that are commonly queried together (such as the details of customers and the
orders that they have placed) in the same shard to reduce the number of separate reads that
an application performs.
• If an entity in one shard references an entity stored in another shard, include the
shard key for the second entity as part of the schema for the first entity. This can
help to improve the performance of queries that reference related data across shards.
• If an application must perform queries that retrieve data from multiple shards, it might be
possible to fetch this data by using parallel tasks. Examples include fan-out queries, where data
from multiple shards is retrieved in parallel and then aggregated into a single result. However,
this approach inevitably adds some complexity to the data access logic of a solution.
• For many applications, creating a larger number of small shards can be more efficient than
having a small number of large shards because they can offer increased opportunities for load
balancing. This can also be useful if you anticipate the need to migrate shards from one physical
location to another. Moving a small shard is quicker than moving a large one.
• Make sure that the resources available to each shard storage node are sufficient to handle the
scalability requirements in terms of data size and throughput. For more information, see the
section "Designing Partitions for Scalability" in the Data Partitioning Guidance.
• Consider replicating reference data to all shards. If an operation that retrieves data from a
shard also references static or slow-moving data as part of the same query, add this data to the
shard. The application can then fetch all of the data for the query easily, without having to make
an additional round trip to a separate data store.
• If reference data held in multiple shards changes, the system must synchronise these
changes across all shards. The system can experience a degree of inconsistency while
this synchronisation occurs. If you do this, you should design your applications to be
able to handle it.
• It can be difficult to maintain referential integrity and consistency between shards, so you should
minimise operations that affect data in multiple shards. If an application must modify data across
shards, evaluate whether complete data consistency is actually required. Instead, a common
approach in the cloud is to implement eventual consistency. The data in each partition is updated
separately, and the application logic must take responsibility for ensuring that the updates all
complete successfully, as well as handling the inconsistencies that can arise from querying data
while an eventually consistent operation is running. For more information about implementing
eventual consistency, see the Data Consistency Primer.
• Configuring and managing a large number of shards can be a challenge. Tasks such as
monitoring, backing up, checking for consistency and logging or auditing must be accomplished
on multiple shards and servers, possibly held in multiple locations. These tasks are likely to
• Shards can be geolocated so that the data that they contain is close to the instances of an
application that use it. This approach can considerably improve performance, but requires
additional consideration for tasks that must access multiple shards in different locations.
The primary focus of sharding is to improve the performance and scalability of a system, but as
a by-product it can also improve availability due to how the data is divided into separate partitions.
A failure in one partition doesn't necessarily prevent an application from accessing data held in
other partitions, and an operator can perform maintenance or recovery of one or more partitions
without making the entire data for an application inaccessible. For more information, see the Data
Partitioning Guidance.
Example
The following example in C# uses a set of SQL Server databases acting as shards. Each database
holds a subset of the data used by an application. The application retrieves data that's distributed
across the shards using its own sharding logic (this is an example of a fan-out query). The details
of the data that's located in each shard is returned by a method called GetShards. This method
returns an enumerable list of ShardInformation objects, where the ShardInformation type contains
an identifier for each shard and the SQL Server connection string that an application should use
to connect to the shard (the connection strings aren't shown in the code example).
The code below shows how the application uses the list of ShardInformation objects to perform
a query that fetches data from each shard in parallel. The details of the query aren't shown, but
in this example the data that's retrieved contains a string that could hold information such as the
name of a customer if the shards contain the details of customers. The results are aggregated into
a ConcurrentBag collection for processing by the application.
• Data Partitioning Guidance. Sharding a data store can introduce a range of additional issues.
Describes these issues in relation to partitioning data stores in the cloud to improve scalability,
reduce contention and optimise performance.
• Index Table pattern. Sometimes it isn't possible to completely support queries just through the
design of the shard key. Enables an application to quickly retrieve data from a large data store
by specifying a key other than the shard key.
• Materialised View pattern. To maintain the performance of some query operations, it's useful to
create materialised views that aggregate and summarise data, especially if this summary data
is based on information that's distributed across shards. Describes how to generate and populate
these views.
• Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes on Dare
Obasanjo's blog.
This pattern is named Sidecar because it resembles a sidecar attached to a motorcycle. In the
pattern, the sidecar is attached to a parent application and provides supporting features for the
application. The sidecar also shares the same lifecycle as the parent application, being created and
retired alongside the parent. The sidecar pattern is sometimes referred to as the sidekick pattern and
is a decomposition pattern.
If they are tightly integrated into the application, they can run in the same process as the application,
making efficient use of shared resources. However, this also means they are not well isolated, and
an outage in one of these components can affect other components or the entire application. Also,
they usually need to be implemented using the same language as the parent application. As a result,
the component and the application have close interdependence on each other.
If the application is decomposed into services, then each service can be built using different
languages and technologies. While this gives more flexibility, it means that each component has its
own dependencies and requires language-specific libraries to access the underlying platform and
any resources shared with the parent application. In addition, deploying these features as separate
services can add latency to the application. Managing the code and dependencies for these
language-specific interfaces can also add considerable complexity, especially for hosting, deployment
and management.
Solution
Co-locate a cohesive set of tasks with the primary application, but place them inside their own
process or container, providing a homogeneous interface for platform services across languages.
• The sidecar can access the same resources as the primary application. For example, a sidecar
can monitor system resources used by both the sidecar and the primary application.
• Because of its proximity to the primary application, there's no significant latency when
communicating between them.
• Even for applications that don't provide an extensibility mechanism, you can use a sidecar
to extend functionality by attaching it as own process in the same host or sub-container as the
primary application.
The sidecar pattern is often used with containers and referred to as a sidecar container or sidekick
container.
• Before putting functionality into a sidecar, consider whether it would work better as a separate
service or a more traditional daemon.
• Also consider whether the functionality could be implemented as a library or using a traditional
extension mechanism. Language-specific libraries may have a deeper level of integration and less
network overhead.
• You need fine-grained control over resource limits for a particular resource or component. For example,
you may want to restrict the amount of memory a specific component uses. You can deploy the
component as a sidecar and manage memory usage independently of the main application.
• For small applications where the resource cost of deploying a sidecar service for each instance
is not worth the advantage of isolation.
• When the service needs to scale differently than or independently from the main applications.
If so, it may be better to deploy the feature as a separate service.
Example
The sidecar pattern is applicable to many scenarios. Some common examples:
• Infrastructure API. The infrastructure development team creates a service that's deployed
alongside each application, instead of a language-specific client library to access the
infrastructure. The service is loaded as a sidecar and provides a common layer for infrastructure
services, including logging, environment data, configuration store, discovery, health checks and
watchdog services. The sidecar also monitors the parent application's host environment and
process (or container) and logs the information to a centralised service.
• Manage NGINX/HAProxy. Deploy NGINX with a sidecar service that monitors environment state,
then updates the NGINX configuration file and recycles the process when a change in state is
needed.
• Ambassador sidecar. Deploy an ambassador service as a sidecar. The application calls through
the ambassador, which handles request logging, routing, circuit breaking and other connectivity
related features.
• Offload proxy. Place an NGINX proxy in front of a node.js service instance, to handle serving
static file content for the service
Related guidance
• Ambassador pattern
Although web servers are well tuned to optimise requests through efficient dynamic page code
execution and output caching, they still have to handle requests to download static content.
This consumes processing cycles that could often be put to better use.
Solution
In most cloud hosting environments it's possible to minimise the need for compute instances (for
example, use a smaller instance or fewer instances), by locating some of an application's resources
and static pages in a storage service. The cost for cloud-hosted storage is typically much less than
for compute instances.
When hosting some parts of an application in a storage service, the main considerations are related
to deployment of the application and to securing resources that aren't intended to be available
to anonymous users.
• For maximum performance and availability, consider using a content delivery network (CDN)
to cache the contents of the storage container in multiple datacentres around the world.
However, you'll likely have to pay for using the CDN.
Storage accounts are often geo-replicated by default to provide resilience against events that
might affect a datacentre. This means that the IP address might change, but the URL will remain
the same.
• When some content is located in a storage account and other content is in a hosted compute
instance it becomes more challenging to deploy an application and to update it. You might have
to perform separate deployments, and version the application and content to manage it more
easily – especially when the static content includes script files or UI components. However, if only
static resources have to be updated, they can simply be uploaded to the storage account without
needing to redeploy the application package.
• The storage containers must be configured for public read access, but it's vital to ensure that
they aren't configured for public write access to prevent users being able to upload content.
Consider using a valet key or token to control access to resources that shouldn't be available
anonymously – see the Valet Key pattern for more information.
• Minimising the hosting cost for websites that consist of only static content and resources.
Depending on the capabilities of the hosting provider's storage system, it might be possible
to entirely host a fully static website in a storage account.
• Exposing static resources and content for applications running in other hosting environments
or on-premises servers.
• Locating content in more than one geographical area using a content delivery network that
caches the contents of the storage account in multiple datacentres around the world.
• Monitoring costs and bandwidth usage. Using a separate storage account for some or all of
the static content allows the costs to be more easily separated from hosting and runtime costs.
• The volume of static content is very small. The overhead of retrieving this content from separate
storage can outweigh the cost benefit of separating it out from the compute resource.
Example
Static content located in Azure Blob storage can be accessed directly by a web browser. Azure
provides an HTTP-based interface over storage that can be publicly exposed to clients. For example,
content in an Azure Blob storage container is exposed using a URL with the following form:
When uploading the content it's necessary to create one or more blob containers to hold the files
and documents. Note that the default permission for a new container is Private, and you must
change this to Public to allow clients to access the contents. If it's necessary to protect the content
from anonymous access, you can implement the Valet Key pattern so users must present a valid
token to download the resources.
Blob Service Concepts has information about blob storage, and the ways that you can access and
use it.
The links in the pages delivered to the client must specify the full URL of the blob container and
resource. For example, a page that contains a link to an image in a public container might contain
the following HTML.
<img src="http://mystorageaccount.blob.core.windows.net/myresources/image1.png"
alt="My image" />
If the resources are protected by using a valet key, such as an Azure shared access signature,
this signature must be included in the URLs in the links.
A solution named StaticContentHosting that demonstrates using external storage for static resources
is available from GitHub. The StaticContentHosting.Cloud project contains configuration files
that specify the storage account and container that holds the static content.
<Setting name="StaticContent.StorageConnectionString"
value="UseDevelopmentStorage=true" />
<Setting name="StaticContent.Container" value="static-content" />
The Settings class in the file Settings.cs of the StaticContentHosting.Web project contains methods
to extract these values and build a string value containing the cloud storage account container URL.
return url.Content(contentPath);
}
}
The file Index.cshtml in the Views\Home folder contains an image element that uses the
StaticContentUrl method to create the URL for its src attribute.
• Valet Key pattern. If the target resources aren’t supposed to be available to anonymous users
it’s necessary to implement security over the store that holds the static content. Describes how
to use a token or key that provides clients with restricted direct access to a specific resource
or service such as a cloud-hosted storage service.
• An efficient way of deploying a static web site on Azure on the Infosys blog.
Strangler pattern
Incrementally migrate a legacy system by gradually replacing specific pieces of functionality with new
applications and services. As features from the legacy system are replaced, the new system eventually
replaces all of the old system’s features, strangling the old system and allowing you to decommission it.
Completely replacing a complex system can be a huge undertaking. Often, you will need a gradual
migration to a new system, while keeping the old system to handle features that haven't been
migrated yet. However, running two separate versions of an application means that clients have to
know where particular features are located. Every time a feature or service is migrated, clients need
to be updated to point to the new location.
Solution
Incrementally replace specific pieces of functionality with new applications and services. Create
a façade that intercepts requests going to the backend legacy system. The façade routes these
requests either to the legacy application or the new services. Existing features can be migrated to
the new system gradually, and consumers can continue using the same interface, unaware that any
migration has taken place.
• Structure new applications and services in a way that they can easily be intercepted and replaced
in future strangler migrations.
• At some point, when the migration is complete, the strangler façade will either go away or evolve
into an adaptor for legacy clients.
• Make sure the façade doesn't become a single point of failure or a performance bottleneck.
Use this pattern when gradually migrating a back-end application to a new architecture.
Throttling pattern
Control the consumption of resources used by an instance of an application, an individual tenant
or an entire service. This can allow the system to continue to function and meet service level
agreements, even when an increase in demand places an extreme load on resources.
There're many strategies available for handling varying load in the cloud, depending on the business
goals for the application. One strategy is to use autoscaling to match the provisioned resources
to the user needs at any given time. This has the potential to consistently meet user demand,
while optimising running costs. However, while autoscaling can trigger the provisioning of additional
resources, this provisioning isn't immediate. If demand grows quickly, there can be a window
of time where there's a resource deficit.
Solution
An alternative strategy to autoscaling is to allow applications to use resources only up to a limit,
and then throttle them when this limit is reached. The system should monitor how it's using resources
so that, when usage exceeds the threshold, it can throttle requests from one or more users. This will
enable the system to continue functioning and meet any service level agreements (SLAs) that are
in place. For more information on monitoring resource usage, see the Instrumentation and Telemetry
Guidance.
• Rejecting requests from an individual user who's already accessed system APIs more than n times
per second over a given period of time. This requires the system to meter the use of resources
for each tenant or user running an application. For more information, see the Service Metering
Guidance.
The figure shows an area graph for resource use (a combination of memory, CPU, bandwidth and
other factors) against time for applications that are making use of three features. A feature is an
area of functionality, such as a component that performs a specific set of tasks, a piece of code that
performs a complex calculation or an element that provides a service such as an in-memory cache.
These features are labelled A, B and C.
The area immediately below the line for a feature indicates the resources that are used by
applications when they invoke this feature. For example, the area below the line for Feature A
shows the resources used by applications that are making use of Feature A, and the area between
the lines for Feature A and Feature B indicates the resources used by applications invoking Feature B.
Aggregating the areas for each feature shows the total resource use of the system.
The previous figure illustrates the effects of deferring operations. Just prior to time T1, the total
resources allocated to all applications using these features reach a threshold (the limit of resource
use). At this point, the applications are in danger of exhausting the resources available. In this system,
Feature B is less critical than Feature A or Feature C, so it's temporarily disabled and the resources
The autoscaling and throttling approaches can also be combined to help keep the applications
responsive and within SLAs. If the demand is expected to remain high, throttling provides
a temporary solution while the system scales out. At this point, the full functionality of the
system can be restored.
The next figure shows an area graph of the overall resource use by all applications running
in a system against time, and illustrates how throttling can be combined with autoscaling.
At time T1, the threshold specifying the soft limit of resource use is reached. At this point, the system
can start to scale out. However, if the new resources don't become available quickly enough, then
the existing resources might be exhausted and the system could fail. To prevent this from occurring,
the system is temporarily throttled, as described earlier. When autoscaling has completed and
the additional resources are available, throttling can be relaxed.
• Throttling an application, and the strategy to use, is an architectural decision that impacts the
entire design of a system. Throttling should be considered early in the application design process
because it isn't easy to add once a system has been implemented.
• Throttling must be performed quickly. The system must be capable of detecting an increase in
activity and react accordingly. The system must also be able to revert to its original state quickly
after the load has eased. This requires that the appropriate performance data is continually
captured and monitored.
• If a service needs to temporarily deny a user request, it should return a specific error code
so the client application understands that the reason for the refusal to perform an operation
is due to throttling. The client application can wait for a period before retrying the request.
• Throttling can be used as a temporary measure while a system autoscales. In some cases it's
better to simply throttle, rather than to scale, if a burst in activity is sudden and isn't expected
to be long lived because scaling can add considerably to running costs.
• If throttling is being used as a temporary measure while a system autoscales, and if resource
demands grow very quickly, the system might not be able to continue functioning – even
when operating in a throttled mode. If this isn't acceptable, consider maintaining larger capacity
reserves and configuring more aggressive autoscaling.
• To help cost-optimise a system by limiting the maximum resource levels needed to keep
it functioning.
Example
The final figure illustrates how throttling can be implemented in a multi-tenant system. Users from
each of the tenant organisations access a cloud-hosted application where they fill in and submit
surveys. The application contains instrumentation that monitors the rate at which these users are
submitting requests to the application.
In order to prevent the users from one tenant affecting the responsiveness and availability of the
application for all other users, a limit is applied to the number of requests per second the users
from any one tenant can submit. The application blocks requests that exceed this limit.
• Service Metering Guidance. Describes how to meter the use of services in order to gain an
understanding of how they are used. This information can be useful in determining how to
throttle a service.
• Autoscaling Guidance. Throttling can be used as an interim measure while a system autoscales,
or to remove the need for a system to autoscale. Contains information on autoscaling strategies.
• Queue-based Load Levelling pattern. Queue-based load levelling is a commonly used mechanism
for implementing throttling. A queue can act as a buffer that helps to even out the rate at which
requests sent by an application are delivered to a service.
• Priority Queue pattern. A system can use priority queuing as part of it's throttling strategy
to maintain performance for critical or higher value applications, while reducing the
performance of less important applications.
This isn't a realistic approach in distributed systems that need to serve untrusted clients. Instead,
applications must be able to securely control access to data in a granular way, but still reduce the
load on the server by setting up this connection and then allowing the client to communicate
directly with the data store to perform the required read or write operations.
Solution
You need to resolve the problem of controlling access to a data store where the store can't manage
authentication and authorisation of clients. One typical solution is to restrict access to the data
store's public connection and provide the client with a key or token that the data store can validate.
This key or token is usually referred to as a valet key. It provides time-limited access to specific
resources and allows only predefined operations such as reading and writing to storage or queues,
or uploading and downloading in a web browser. Applications can create and issue valet keys to
client devices and web browsers quickly and easily, allowing clients to perform the required operations
without requiring the application to directly handle the data transfer. This removes the processing
overhead, and the impact on performance and scalability, from the application and the server.
The client uses this token to access a specific resource in the data store for only a specific period,
and with specific restrictions on access permissions, as shown in the figure. After the specified period,
the key becomes invalid and won't allow access to the resource.
The key can also be invalidated by the application. This is a useful approach if the client notifies the
server that the data transfer operation is complete. The server can then invalidate that key to prevent
further.
Using this pattern can simplify managing access to resources because there's no requirement
to create and authenticate a user, grant permissions and then remove the user again. It also makes
it easy to limit the location, the permission and the validity period – all by simply generating a key
at runtime. The important factors are to limit the validity period, and especially the location of the
resource, as tightly as possible so that the recipient can only use it for the intended purpose.
Consider the following points when deciding how to implement this pattern:
Manage the validity status and period of the key. If leaked or compromised, the key effectively
unlocks the target item and makes it available for malicious use during the validity period. A key can
usually be revoked or disabled, depending on how it was issued. Server-side policies can be changed
or, the server key it was signed with can be invalidated. Specify a short validity period to minimise
the risk of allowing unauthorised operations to take place against the data store. However, if the
validity period is too short, the client might not be able to complete the operation before the key
expires. Allow authorised users to renew the key before the validity period expires if multiple accesses
to the protected resource are required.
Control the level of access the key will provide. Typically, the key should allow the user to only
perform the actions necessary to complete the operation, such as read-only access if the client
shouldn't be able to upload data to the data store. For file uploads, it's common to specify a key that
provides write-only permission, as well as the location and the validity period. It's critical to accurately
specify the resource or the set of resources to which the key applies.
Consider how to control users' behaviour. Implementing this pattern means some loss of control
over the resources users are granted access to. The level of control that can be exerted is limited
by the capabilities of the policies and permissions available for the service or the target data store.
For example, it's usually not possible to create a key that limits the size of the data to be written
to storage, or the number of times the key can be used to access a file. This can result in huge
unexpected costs for data transfer, even when used by the intended client, and might be caused
by an error in the code that causes repeated upload or download. To limit the number of times
a file can be uploaded, where possible, force the client to notify the application when one operation
has completed. For example, some data stores raise events the application code can use to monitor
operations and control user behaviour. However, it's hard to enforce quotas for individual users
in a multi-tenant scenario where the same key is used by all the users from one tenant.
Validate, and optionally sanitise, all uploaded data. A malicious user that gains access to the key
could upload data designed to compromise the system. Alternatively, authorised users might upload
data that's invalid and, when processed, could result in an error or system failure. To protect against
this, ensure that all uploaded data is validated and checked for malicious content before use.
Deliver the key securely. It can be embedded in a URL that the user activates in a web page,
or it can be used in a server redirection operation so that the download occurs automatically.
Always use HTTPS to deliver the key over a secure channel.
Protect sensitive data in transit. Sensitive data delivered through the application will usually take
place using SSL or TLS, and this should be enforced for clients accessing the data store directly.
• The flexibility of key policies that can be generated might be limited. For example, some
mechanisms only allow the use of a timed expiry period. Others aren't able to specify a sufficient
granularity of read/write permissions.
• If the start time for the key or token validity period is specified, ensure that it's a little earlier than
the current server time to allow for client clocks that might be slightly out of synchronisation.
The default, if not specified, is usually the current server time.
• The URL containing the key will be recorded in server log files. While the key will typically have
expired before the log files are used for analysis, ensure that you limit access to them. If log
data is transmitted to a monitoring system or stored in another location, consider implementing
a delay to prevent leakage of keys until after their validity period has expired.
• If the client code runs in a web browser, the browser might need to support cross-origin resource
sharing (CORS) to enable code that executes within the web browser to access data in a different
domain from the one that served the page. Some older browsers and some data stores don't
support CORS, and code that runs in these browsers might be able to use a valet key to provide
access to data in a different domain, such as a cloud storage account.
• To minimise operational cost. Enabling direct access to stores and queues is resource and cost
efficient, can result in fewer network round trips, and might allow for a reduction in the number
of compute resources required.
• When clients regularly upload or download data, particularly where there's a large volume
or when each operation involves large files.
• When data is stored in a remote data store or a different datacentre. If the application was
required to act as a gatekeeper, there might be a charge for the additional bandwidth of
transferring the data between datacentres, or across public or private networks between
the client and the application, and then between the application and the data store.
• If the application must perform some task on the data before it's stored or before it's sent to the
client. For example, if the application needs to perform validation, log access success or execute
a transformation on the data. However, some data stores and clients are able to negotiate and
carry out simple transformations such as compression and decompression (for example, a web
browser can usually handle GZip formats).
• If the design of an existing application makes it difficult to incorporate the pattern. Using this
pattern typically requires a different architectural approach for delivering and receiving data.
Example
Azure supports shared access signatures on Azure Storage for granular access control to data in
blobs, tables and queues, and for Service Bus queues and topics. A shared access signature token
can be configured to provide specific access rights such as read, write, update and delete to a specific
table, a key range within a table, a queue, a blob or a blob container. The validity can be a specified
time period or with no time limit.
Azure shared access signatures also support server-stored access policies that can be associated
with a specific resource such as a table or blob. This feature provides additional control and flexibility
compared to application-generated shared access signature tokens, and should be used whenever
possible. Settings defined in a server-stored policy can be changed and are reflected in the token
without requiring a new token to be issued, but settings defined in the token can't be changed
without issuing a new token. This approach also makes it possible to revoke a valid shared access
signature token before it's expired.
For more information see Introducing Table SAS (Shared Access Signature), Queue SAS and update
to Blob SAS and Using Shared Access Signatures on MSDN.
The following code shows how to create a shared access signature token that's valid for five minutes.
The "GetSharedAccessReferenceForUpload" method returns a shared access signatures token that
can be used to upload a file to Azure Blob Storage.
// Specify a start time five minutes earlier to allow for client clock skew.
SharedAccessStartTime = DateTime.UtcNow.AddMinutes(-5),
The complete sample is available in the ValetKey solution available for download from
GitHub. The ValetKey.Web project in this solution contains a web application that includes the
ValuesController class shown above. A sample client application that uses this web application to
retrieve a shared access signatures key and upload a file to blob storage is available in the ValetKey.
Client project.
• Gatekeeper Pattern. This pattern can be used in conjunction with the Valet Key pattern to
protect applications and services by using a dedicated host instance that acts as a broker
between clients and the application or service. The gatekeeper validates and sanitises requests,
and passes requests and data between the client and the application. Can provide an additional
layer of security, and reduce the attack surface of the system.
• Static Content Hosting Pattern. Describes how to deploy static resources to a cloud-based
storage service that can deliver these resources directly to the client to reduce the requirement
for expensive compute instances. Where the resources aren't intended to be publicly available,
the Valet Key pattern can be used to secure them.
• Introducing Table SAS (Shared Access Signature), Queue SAS and update to Blob SAS
Design review
checklists
Use this checklist as a starting point to assess your DevOps culture and process.
Culture
Ensure business alignment across organisations and teams
• Ensure that the business, development and operations teams are all aligned.
Do proactive planning.
• Proactively plan for failure.
• Have processes in place to quickly identify issues when they occur.
• Escalate to the correct team members to fix and confirm resolution.
Document operations
• Document all tools, processes and automated tasks with the same level of quality as your
product code.
• Document the current design and architecture of any systems you support, along with
recovery processes and other maintenance procedures.
• Focus on the steps you actually perform, not theoretically optimal processes.
Regularly review and update the documentation.
• For code, make sure that meaningful comments are included, especially in public APIs,
and use tools to automatically generate code documentation.
Share knowledge
• Ensure the documentation is organised and easily discoverable.
• Use brown bags (informal presentations), videos or newsletters to share knowledge.
Development
Provide developers with production-like environments
• Keep development and test environments as close to the production environment
as possible.
• Make sure that test data is consistent with the data used in production, even if it's sample
data and not real production data (for privacy or compliance reasons).
• Plan to generate and anonymise sample test data.
Ensure that all authorised team members can provision infrastructure and deploy the
application
• Anyone with the right permissions should be able to create or deploy production-like
resources without going to the operations team.
Automate testing
• Automate common testing tasks and integrate the tests into your build processes.
• Integrated UI tests should also be performed by an automated tool.
Test in production
• Have tests in place to ensure that deployed code works as expected.
• For deployments that are infrequently updated, schedule production testing as a regular
part of maintenance.
Release
Automate deployments
• Automate deploying the application to test staging and production environments.
Automate Deployments
• Automate all deployments and have systems in place to detect any problems during
rollout.
• Have a mitigation process for preserving the existing code and data in production
before the update replaces them in all production instances.
• Have an automated way to roll forward fixes or roll back changes.
Monitoring
Make systems observable
• Set up external health endpoints to monitor status and ensure that applications
are coded to instrument the operations metrics.
• Use a common and consistent schema that lets you correlate events across systems.
Management
Automate operations tasks
• Automate repetitive operations processes whenever possible to ensure consistent execution
and quality.
• Code that implements the automation should be versioned in source control.
As with any other code, automation tools must be tested.
• Deploy all components, services, resources and compute instances as multiple instances.
• Design the application to be configurable to use multiple instances.
• Design the application to automatically detect failures and redirect requests to non-failed
instances.
• Make message consumers and the operations they carry out idempotent.
• Detect duplicated messages or use an optimistic approach to handling conflicts.
• Use asynchronous messaging where the sender does not block waiting for a reply.
• Use a messaging system that provides high availability and guarantees at-least-once
semantics.
• Make message processing important (see the previous item).
• Design applications to handle varying workloads, such as peaks first thing in the morning
or when a new product is released on an ecommerce site.
• Use auto-scaling where possible.
• Queue requests to services and degrade gracefully when queues are near to full capacity.
• Ensure that there is sufficient performance and capacity available under non-burst
conditions to drain the queues and handle outstanding requests. For more information,
see the Queue-Based Load Levelling Pattern.
• Deploy at least two instances of each role in the service. This enables one role to be
unavailable while the other remains active.
• Host vital business applications in more than one region to provide maximum availability.
• Automate deployment using tested and proven mechanisms such as scripts and
Resource Manager templates.
• Use automated techniques to perform all application updates.
• Fully test your automated processes to ensure there are no errors.
• Use security restrictions on automation tools.
• Carefully define and enforce deployment policies.
• Azure App Service supports swapping between staging and production environments
without application downtime.
• If you prefer to stage on-premises, or deploy different versions of the application
concurrently and gradually migrate users, you may not be able to use a VIP Swap operation.
• The configuration settings for an Azure application or service can be changed without
requiring the role to be restarted.
• Design an application to accept changes to configuration settings without requiring
a restart of the whole application.
• Specify how many upgrade domains should be created for a service when the service
is deployed.
Note
Roles are also distributed across fault domains, each of which is reasonably independent from
other fault domains in terms of server rack, power and cooling provision, in order to minimise
the chance of a failure affecting all role instances. This distribution occurs automatically and
you cannot control it.
• Place two or more virtual machines in the same availability set to guarantee that they will
not be deployed to the same fault domain.
• To maximise availability, create multiple instances of each critical virtual machine used
by your system and place these instances in the same availability set.
• If you are running multiple virtual machines that serve different purposes, create
an availability set for each virtual machine.
• Add instances of each virtual machine to each availability set.
Data management
Geo-replicate databases
For more information, see How to distribute data globally with Azure Cosmos DB.
• Ensure backup and restore meets the Recovery Point Objective (RPO).
• Regularly and automatically back up data that is not preserved elsewhere.
• Verify you can reliably restore both the data and the application itself should a failure occur.
• Secure the backup process to protect the data in transit and in storage.
Enable the high availability option to maintain a secondary copy of an Azure Redis cache
• When using Azure Redis Cache, choose the standard or premium tier to maintain a
secondary copy of the contents. For more information, see Create a cache in Azure Redis
Cache.
• Ensure that the timeouts you apply are appropriate for each service or resource as well as
the client that is accessing them.
• It may be appropriate to allow a longer timeout for a particular instance of a client, depending
on the context and other actions that the client is performing.
• Design a retry strategy for access to all services and resources that do not inherently support
automatic connection retry.
• Use a strategy that includes an increasing delay between retries as the number of failures
increases.
• Design applications to take advantage of multiple instances without affecting operation and
existing connections where possible.
• Use multiple instances and distribute requests between them and detect and avoid sending
requests to failed instances, in order to maximise availability.
• Provide a facility to replay the writes in blob storage to SQL Database when the service
becomes available.
• Detect failures and redirect requests to other services that can offer alternative functionality,
or to back up instances that can maintain core operations while the primary service is offline.
• For failures that are likely but have not yet occurred, provide sufficient data to enable
operations staff to determine the cause, mitigate the situation and ensure that the system
remains available.
• For failures that have already occurred, the application should return an error message
to the user but attempt to continue running with reduced functionality.
• In all cases, the monitoring system should capture comprehensive details to enable quick
recovery, and to modify the system to prevent the situation from arising again.
• Implement probes or check functions that are executed regularly from outside
the application.
• Test failover and fallback systems before they are required to compensate for a live
problem at runtime.
• Create an accepted, fully-tested plan for recovery from any type of failure that may affect
system availability.
• Choose a multi-site disaster recovery architecture for any mission-critical applications.
• Identify a specific owner of the disaster recovery plan, including automation and testing.
• Ensure the plan is well-documented, and automate the process as much as possible.
• Establish a backup strategy for all reference and transactional data, and test the restoration
of these backups regularly.
• Train operations staff to execute the plan and perform regular disaster simulations
to validate and improve the plan.
□□ Design your applications to react to variable load by increasing and decreasing the number
of instances of roles, queues and other services they use.
□□ Implement configuration or auto-detection of instances as they are added and removed,
so that code in the application can perform the necessary routing.
Scale as a unit
□□ Where possible, ensure that the application does not require affinity.
□□ Route requests to any instance, and the number of instances is irrelevant.
□□ Where there are many background tasks, or the tasks require considerable time
or resources, spread the work across multiple compute units (such as worker roles
or background jobs).
Requirements
Define your customer's availability requirements
• Get agreement from your customer for the availability targets of each piece of your
applications. For more information, see Defining your resilience requirements.
Application Design
Perform a failure mode analysis (FMA) for your application
• Provision more than one medium or larger Application Gateway instance to guarantee
availability of the service under the terms of the SLA.
Use Azure Traffic Manager to route your application's traffic to different regions
Configure and test health probes for your load balancers and traffic managers
• Ensure that your health logic checks the critical parts of the system and responds
appropriately to health probes.
• For a Traffic Manager probe, your health endpoint should check critical dependencies that
are deployed within the same region, and whose failure should trigger a failover to another
region.
• For a load balancer, the health endpoint should report the health of the VM.
• Don't include other tiers or external services. For guidance on implementing health
monitoring in your application, see Health Endpoint Monitoring Pattern.
• Design each part of your application to allow for asynchronous operations whenever possible.
Data management
• Evaluate the replication methods for each type of data storage in Azure, including
Azure Storage Replication and SQL Database Active Geo-Replication to ensure that
your application's data requirements are satisfied.
Ensure that no single user account has access to both production and backup data
• Design your application to limit the user account permissions so only the users who require
write access have it only to either production or backup, not both.
Document and test your fail over and fail back process
• Regularly test the documented steps to verify that an operator following them can
successfully fail over and fail back the data source.
• Regularly verify that your backup data is what you expect by running a script to validate
data integrity, schema and queries.
• Log and report any inconsistencies so the backup service can be repaired.
Security
Implement the principle of least privilege for access to the application's resources
• The default for access to the application's resources should be as restrictive as possible.
• Grant higher level permissions on an approval basis.
• Verify least privilege permissions for other resources that have their own permissions
systems such as SQL Server.
Testing
• Ensure that your application's dependent services fail over and fail back in the correct order.
Run tests in production using both synthetic and real user data
• Clearly define and document your release process, and ensure it's available to the entire
operations team.
• Use the blue/green or canary release deployment technique to deploy your application
to production.
• Design a rollback process to go back to a last known good version and minimise downtime.
Operations
Measure remote call statistics and make the information available to the application team
• Summarise remote call metrics such as latency, throughput and errors in the
99 and 95 percentiles.
• Perform statistical analysis on the metrics to uncover errors that occur within each
percentile.
• Identify the key performance indicators of your application's health, such as transient
exceptions and remote call latency, and set appropriate threshold values for each of them.
• Send an alert to operations when the threshold value is reached.
• Set these thresholds at levels that identify issues before they become critical and require
a recovery response.
Ensure that more than one person on the team is trained to monitor the application
and perform any manual recovery steps
• Train multiple individuals on detection and recovery and make sure that there is always
at least one active at any time.
Ensure that your application does not run up against Azure subscription limits
• If your application requirements exceed Azure subscription limits, create another Azure
subscription and provision sufficient resources there.
Ensure that your application does not run up against per-service limits
• Scale up (for example, choosing another pricing tier) or scale out (adding new instances)
to avoid per-service limits.
Design your application's storage requirements to fall within Azure storage scalability and
performance targets
• If your workload fluctuates over time, use Azure VM scale sets to automatically scale the
number of VM instances.
• If your application uses Azure SQL Database, ensure that you have selected the appropriate
service tier.
• Include a process for contacting support and escalating issues as part of your application's
resilience from the outset.
Ensure that your application doesn't use more than the maximum number of storage
accounts per subscription
• If your application requires more than 200 storage accounts, you will have to create
a new subscription and create additional storage accounts there.
Ensure that your application doesn't exceed the scalability targets for virtual machine
disks
• If your application exceeds the scalability targets for virtual machine disks, provision
additional storage accounts and create the virtual machine disks there.
Telemetry
• Capture robust telemetry information while the application is running in the production
environment.
• Ensure that your logging system correlates calls across service boundaries so that you can
track the request throughout your application.
Azure Resources
App Services
• Select a tier and instance size that meet your performance requirements under typical load,
and then scale out the instances to handle changes in traffic volume.
• If your solution has both a web front-end and a web API, consider decomposing them
into separate App Service apps.
• If you don't need that level of scalability at first, you can deploy the apps into the same
plan, and move them into separate plans later.
Avoid using the App Service backup feature to back up Azure SQL databases
• Create a deployment slot for staging. Deploy application updates to the staging slot,
and verify the deployment before swapping it into production.
• When you deploy an update to production, move the previous production deployment
into the LKG slot.
• Don't use the same storage account for logs and application data.
Monitor performance
Application Gateway
• Deploy Application Gateway with at least two instances. In order to qualify for the SLA,
you must provision two or more medium or larger instances.
Azure Search
• Use at least two replicas for read high-availability, or three for read-write high-availability.
• If the data source is geo-replicated, point each indexer of each regional Azure Search service
to its local data source replica.
• For large datasets stored in Azure SQL Database, instead, point all indexers to the primary
replica. After a failover, point the Azure Search indexers at the new primary replica.
• If the data source is not geo-replicated, point multiple indexers at the same data source,
so that Azure Search services in multiple regions continuously and independently index
from the data source.
Cosmos DB
SQL Database
• If your primary database fails, or simply needs to be taken offline, perform a manual
failover to the secondary database.
Use sharding
• If you are already using Azure Backup to back up your VMs, consider using Azure Backup
for SQL Server workloads using DPM.
• Otherwise, use SQL Server Managed Backup to Microsoft Azure.
• After a Traffic Manager failover, perform manual failback, rather than automatically
failing back.
• Before failing back, verify that all application subsystems are healthy.
• Create a custom endpoint that reports on the overall health of the application.
• Don't report errors for non-critical services, however.
Virtual Machines
• Put multiple VMs in an availability set or VM scale set, with a load balancer in front.
• When you add a new VM to an existing availability set, make sure that you create a NIC
for the VM, and add the NIC to the back-end address pool on the load balancer.
• In an N-tier application, don't put VMs from different tiers into the same availability set.
• To get the redundancy benefit of FDs and UDs, every VM in the availability set must be able to
handle the same client requests.
• When moving an existing workload to Azure, start with the VM size that's the closest match
to your on-premises servers.
• Then measure the performance of your actual workload with respect to CPU, memory and
disk IOPS, and adjust the size if needed.
• If you need multiple NICs, be aware of the NIC limit for each size.
• Block access from malicious users, or allow access only from users who have privilege
to access the application.
• For an HTTP probe, use a custom endpoint that reports the overall health of the application,
including all critical dependencies.
• Don't block traffic to or from this IP in any firewall policies or network security group (NSG)
rules.
Summary
In this guide you have learned how to choose the right architecture style
for your application, choose the most appropriate compute and data
store technologies, and apply design principles and pillars when building
your applications.
In the future, new trends, user demands and capabilities will continue to create even more
opportunities to enhance your architectures. To stay ahead of the game, we encourage you to keep
up-to-date with the resources and guidance below:
291
9
Azure reference
architectures
Our reference architectures are arranged by scenario, with related
architectures grouped together. Each architecture includes recommended
practices, along with considerations for scalability, availability,
manageability and security. Most also include a deployable solution.
294
• If you are likely to synchronize more than 100,000 objects from your local directory, use a production version of SQL Server, and use SQL clustering to achieve high
availability.
• Protect on-premises applications that can be accessed externally. Use the Azure AD Application Proxy to provide controlled access to on-premises web applications for
external users.
• Actively monitor Azure AD for signs of suspicious activity.
• Use conditional access control to deny authentication requests from unexpected sources.
Azure
Extend Active Directory Domain Services (AD DS) to Azure Architecture Components
This architecture extends an on-premises Active Directory environment to Azure using Active Directory Domain Services (AD DS). This architecture can reduce the latency caused by sending
authentication and local authorization requests from the cloud back to AD DS running on-premises. Consider this option if you need to use AD DS features that are not currently On-premises network
implemented by Azure AD.
The on-premises network includes local Active Directory
servers that can perform authentication and authorization
for components located on-premises.
On-premises nework Virtual network
Active Directory servers
These are domain controllers implementing directory
GATEWAY SUBNET services (AD DS) running as VMs in the cloud. These servers
UDR PRIVATE DMZ IN AVAILABLITY PRIVATE DMZ OUT can provide authentication of components running in your
SET Azure virtual network.
AD Server AD Server
NIC
NIC
contoso.com contoso.com
NSG NVA Active Directory subnet
The AD DS servers are hosted in a separate subnet. Network
NSG security group (NSG) rules protect the AD DS servers and
NIC
NIC
NIC
NSG
NVA
routes (UDRs) handle routing for on-premises traffic that
VM passes to Azure. Traffic to and from the Active Directory
NSG NSG VM servers does not pass through the network virtual appliances
(NVAs) used in this scenario.
NSG
NIC
NIC
VM
NVA
VM VM
PIP
Recommendations
• Deploy at least two VMs running AD DS as domain controllers and add them to an availability set.
• For VM size, use the on-premises AD DS machines as a starting point, and pick the closest Azure VM sizes.
• Create a separate virtual data disk for storing the database, logs, and SYSVOL for Active Directory.
• Configure the VM network interface (NIC) for each AD DS server with a static private IP address for full domain name service (DNS) support.
• Monitor the resources of the domain controller VMs as well as the AD DS Services and create a plan to quickly correct any problems.
295
• Perform regular AD DS backups. Don't simply copy the VHD files of domain controllers, because the AD DS database file on the VHD may not be in a consistent state
when it's copied.
• Do not shut down a domain controller VM using Azure portal. Instead, shut down and restart from the guest operating system.
• Use either BitLocker or Azure disk encryption to encrypt the disk hosting the AD DS database.
• Azure disk encryption to encrypt the disk hosting the AD DS database.
Azure
Create an Active Directory Domain Services (AD DS) resource forest in Azure Architecture Components
This architecture shows an AD DS forest in Azure with a one-way trust relationship with an on-premises AD domain. The forest in Azure contains a domain that does not exist on-premises.
This architecture maintains security separation for objects and identities held in the cloud, while allowing on-premises identities to access your applications running in Azure. On-premises network
The on-premises network contains its own Active Directory
forest and domains.
Active Directory servers
These are domain controllers implementing domain services
running as VMs in the cloud. These servers host a forest
containing one or more domains, separate from those
located on-premises.
On-premises nework Virtual network One-way trust relationship
The example in the diagram shows a one-way trust from the
domain in Azure to the on-premises domain. This
GATEWAY SUBNET
relationship enables on-premises users to access resources
UDR PRIVATE DMZ IN AVAILABLITY PRIVATE DMZ OUT in the domain in Azure, but not the other way around. It is
SET
possible to create a two-way trust if cloud users also require
AD Server AD Server access to on-premises resources.
NIC
NIC
contoso.com contoso.com
NSG NVA
Active Directory subnet
The AD DS servers are hosted in a separate subnet. Network
NSG
security group (NSG) rules protect the AD DS servers and
NIC
NIC
MANAGEMENT SUBNET
Gateway NVA provide a firewall against traffic from unexpected sources.
NSG Internal Load
Balancer Azure gateway
Jump Box
The Azure gateway provides a connection between the
on-premises network and the Azure VNet. This can be a VPN
Web app request connection or Azure ExpressRoute. For more information,
WEB TIER
AVAILABLITY PUBLIC DMZ IN AVAILABLITY PUBLIC DMZ OUT AD DS SUBNET see Implementing a secure hybrid network architecture in
Authentication request SET (CONTOSO.COM)
AVAILABLITY
SET SET Azure.
ADDS trust relationship
NIC
NIC
NSG
NVA
VM
NSG NSG VM
NSG
NIC
NIC
VM
NVA
VM VM
PIP
296
Recommendations
• Provision at least two domain controllers for each domain. This enables automatic replication between servers.
• Create an availability set for the VMs acting as Active Directory servers handling each domain. Put at least two servers in this availability set.
• Consider designating one or more servers in each domain as standby operations masters in case connectivity to a server acting as a flexible single master operation
(FSMO) role fails.
Azure
Active Directory Federation Services (AD FS) Architecture Components
This architecture extends an on-premises network to Azure and uses Active Directory Federation Services (AD FS) to perform federated authentication and authorization. AD FS can be hosted on-premises,
but for applications running in Azure, it may be more efficient to replicate AD FS in the cloud.
Use this architecture to authenticate users from partner organizations, allow users to authenticate from outside of the organizational firewall, or allow users to connect from authorized mobile devices.
AD DS subnet
The AD DS servers are contained in their own subnet with network
security group (NSG) rules acting as a firewall.
On-premises nework Virtual network
AD DS servers
Domain controllers running as VMs in Azure. These servers provide
GATEWAY SUBNET PRIVATE DMZ IN AVAILABLITY PRIVATE DMZ OUT authentication of local identities within the domain.
MANAGEMENT SUBNET SET
UDR
Gateway AD FS subnet
NIC
NIC
Web app request NVA AD FS servers
Authentication request Internal Load The AD FS servers provide federated authorization and authentication.
Balancer In this architecture, they perform the following tasks:
Federated authentication
request Receiving security tokens containing claims made by a partner
Partner network
federation server on behalf of a partner user. AD FS verifies that
WEB TIER
AVAILABLITY the tokens are valid before passing the claims to the web applica-
SET tion running in Azure to authorize requests.
The web application running in Azure is the relying party. The
NSG partner federation server must issue claims that are understood
Federation
VM by the web application. The partner federation servers are
server
referred to as account partners, because they submit access
requests on behalf of authenticated accounts in the partner
VM organization. The AD FS servers are called resource partners
because they provide access to resources (the web application).
VM
Authenticating and authorizing incoming requests from external
users running a web browser or device that needs access to web
applications, by using AD DS and the Active Directory Device
Registration Service.
PUBLIC DMZ IN AVAILABLITY PUBLIC DMZ OUT
PIP
SET The AD FS servers are configured as a farm accessed through an Azure
load balancer. This implementation improves availability and scalabili -
ty. The AD FS servers are not exposed directly to the Internet. All
NIC
NIC
NSG
NVA Internet traffic is filtered through AD FS web application proxy servers
NSG and a DMZ (also referred to as a perimeter network).
NIC
NIC
NVA AD FS proxy subnet
The AD FS proxy servers can be contained within their own subnet,
PIP with NSG rules providing protection. The servers in this subnet are
exposed to the Internet through a set of network virtual appliances
that provide a firewall between your Azure virtual network and the
Internet.
Recommendations
WEB TIER
AVAILABLITY
SET AD FS web application proxy (WAP) servers
• For VM size, use the on-premises AD FS machines as a starting point, and These VMs act as AD FS servers for incoming requests from partner
pick the closest Azure VM sizes. organizations and external devices. The WAP servers act as a filter,
NSG
shielding the AD FS servers from direct access from the Internet. As
VM with the AD FS servers, deploying the WAP servers in a farm with load
• Create separate Azure availability sets for the AD FS and WAP VMs, with at balancing gives you greater availability and scalability than deploying
least two update domains and two fault domains. a collection of stand-alone servers.
VM
297
• Place AD FS servers and WAP servers in separate subnets with their own
firewalls. Use NSG rules to define firewall rules.
VM Partner organization
A partner organization running a web application that requests access
to a web application running in Azure. The federation server at the
• Configure the network interface for each of the VMs hosting AD FS and
partner organization authenticates requests locally, and submits secu-
WAP servers with static private IP addresses. rity tokens containing claims to AD FS running in Azure. AD FS in
Azure validates the security tokens, and if valid can pass the claims to
• Prevent direct exposure of the AD FS servers to the Internet. the web application running in Azure to authorize them.
• Do not join the WAP servers to the domain.
Hybrid Network
These reference architectures show proven practices for creating a robust
network connection between an on-premises network and Azure.
You have hybrid applications that need ExpressRoute with VPN failover
the Higher bandwidth of ExpressRoute
and require highly available network
connectivity.
299
• If you need to ensure that the on-premises network remains available to the Azure VPN gateway, implement a failover cluster for the on-premises VPN gateway. balancers. For more information about the application
infrastructure, see Running Windows VM workloads and
• If your organization has multiple on-premises sites, create multi-site connections to one or more Azure VNets. This approach requires dynamic (route-based) routing, so Running Linux VM workloads.
make sure that the on-premises VPN gateway supports this feature.
Internal load balancer
• Generate a different shared key for each VPN gateway. Use a strong shared key to help resist brute-force attacks. Network traffic from the VPN gateway is routed to the cloud
application through an internal load balancer. The load
• If you need higher bandwidth than a VPN connection supports, consider using an Azure ExpressRoute connection instead. balancer is located in the front-end subnet of the
application.
Azure
Connect an on-premises network to Azure using ExpressRoute Architecture Components
This architecture extends an on-premises network to Azure, using Azure ExpressRoute. ExpressRoute connections use a private, dedicated connection through a third-party connectivity
provider. The private connection extends your on-premises network into Azure. On-premises network
A private local-area network running within an organization.
Azure public services
ExpressRoute circuit
A layer 2 or layer 3 circuit supplied by the connectivity
provider that joins the on-premises network with Azure
Public peering through the edge routers. The circuit uses the hardware
infrastructure managed by the connectivity provider.
Local edge routers
Routers that connect the on-premises network to the circuit
On-premises nework
managed by the provider. Depending on how your
corporate network
connection is provisioned, you may need to provide the
public IP addresses used by the routers.
Azure Virtual networks
Local Microsoft
edge edge Microsoft edge routers
routers routers
Two routers in an active-active highly available configuration.
Private peering These routers enable a connectivity provider to connect their
ExpressRoute circuit circuits directly to their datacenter. Depending on how your
connection is provisioned, you may need to provide the
public IP addresses used by the routers.
Azure virtual networks (VNets)
Each VNet resides in a single Azure region, and can host
multiple application tiers. Application tiers can be
segmented using subnets in each VNet.
Azure public services
Office 365 services
Azure services that can be used within a hybrid application.
Microsoft peering These services are also available over the Internet, but
accessing them using an ExpressRoute circuit provides low
latency and more predictable performance, because traffic
does not go through the Internet. Connections are
performed using public peering, with addresses that are
either owned by your organization or supplied by your
connectivity provider.
Recommendations
Office 365 services
• Ensure that your organization has met the ExpressRoute prerequisite requirements for connecting to Azure. See ExpressRoute prerequisites & checklist.
The publicly available Office 365 applications and services
• Create an Azure VNet with an address space large enough for all of your required resources, with room for growth in case more VMs are needed in the future. The provided by Microsoft. Connections are performed using
address space of the VNet must not overlap with the on-premises network. Microsoft peering, with addresses that are either owned by
your organization or supplied by your connectivity provider.
• The virtual network gateway requires a subnet named GatewaySubnet. Do not deploy any VMs to the gateway subnet. Also, do not assign an NSG to this subnet, as it will You can also connect directly to Microsoft CRM Online
cause the gateway to stop functioning. through Microsoft peering.
• Although some providers allow you to change your bandwidth, make sure you pick an initial bandwidth that surpasses your needs and provides room for growth. Connectivity providers (not shown)
300
• Consider the following options for high availability: Companies that provide a connection either using layer 2 or
layer 3 connectivity between your datacenter and an Azure
datacenter.
If you're using a layer 2 connection, deploy redundant routers in your on-premises network in an active-active configuration. Connect the primary circuit to one router, and the secondary
circuit to the other.
If you're using a layer 3 connection, verify that it provides redundant BGP sessions that handle availability for you.
Connect the VNet to multiple ExpressRoute circuits, supplied by different service providers. This strategy provides additional high-availability and disaster recovery capabilities.
Configure a site-to-site VPN as a failover path for ExpressRoute. This option only applies to private peering. For Azure and Office 365 services, the Internet is the only failover path.
Azure
Connect an on-premises network to Azure using ExpressRoute with VPN failover Architecture Components
This architecture extends an on-premises network to Azure by using ExpressRoute, with a site-to-site virtual private network (VPN) as a failover connection. Traffic flows between the
on-premises network and the Azure VNet through an ExpressRoute connection. If there is a loss of connectivity in the ExpressRoute circuit, traffic is routed through an IPSec VPN tunnel. On-premises network
A private local-area network running within an organization.
VPN appliance
A device or service that provides external connectivity to the
On-premises nework on-premises network. The VPN appliance may be a hardware
device, or it can be a software solution such as the Routing
and Remote Access Service (RRAS) in Windows Server 2012.
Local Microsoft
For a list of supported VPN appliances and information on
Edge edge
Routers routers configuring selected VPN appliances for connecting to
Azure, see About VPN devices for Site-to-Site VPN Gateway
connections.
ExpressRoute circuit ExpressRoute circuit
A layer 2 or layer 3 circuit supplied by the connectivity
Gateway provider that joins the on-premises network with Azure
through the edge routers. The circuit uses the hardware
infrastructure managed by the connectivity provider.
Virtual network
ExpressRoute virtual network gateway
The ExpressRoute virtual network gateway enables the VNet
GATEWAY WEB TIER
AVAILABLITY
BUSINESS TIER
AVAILABLITY
DATA TIER
AVAILABLITY to connect to the ExpressRoute circuit used for connectivity
SUBNET
SET SET SET with your on-premises network.
VPN virtual network gateway
NSG VM NSG VM NSG VM
The VPN virtual network gateway enables the VNet to
connect to the VPN appliance in the on-premises network.
The VPN virtual network gateway is configured to accept
VM VM VM requests from the on-premises network only through the
ExpressRoute
VPN appliance. For more information, see Connect an
Gateway on-premises network to a Microsoft Azure virtual network.
VM VM VM
VPN connection
The connection has properties that specify the connection
MANAGEMENT SUBNET type (IPSec) and the key shared with the on-premises VPN
appliance to encrypt traffic.
NSG
Azure Virtual Network (VNet)
VM
VPN Gateway Each VNet resides in a single Azure region, and can host
Jumpbox
multiple application tiers. Application tiers can be
segmented using subnets in each VNet.
Gateway subnet
The virtual network gateways are held in the same subnet.
301
Recommendations Cloud application
• The recommendations from the previous two architectures apply to this architecture. The application hosted in Azure. It might include multiple
tiers, with multiple subnets connected through Azure load
balancers. For more information about the application
• After you establish the virtual network gateway connections, test the environment. First make sure you can connect from your on-premises network to your Azure VNet.
infrastructure, see Running Windows VM workloads and
This connection will use ExpressRoute. Then contact your provider to stop ExpressRoute connectivity for testing, and verify that you can still connect using the VPN
Running Linux VM workloads.
connection.
Azure
Implement a hub-spoke network topology in Azure Architecture Components
In this architecture, the hub is an Azure virtual network (VNet) that acts as a central point of connectivity to your on-premises network. The spokes are VNets that peer with the hub, and can
be used to isolate workloads. On-premises network
Reasons to consider this architecture : A private local-area network running within an organization.
Reduce costs by centralizing services shared services such as network virtual appliances (NVAs) and DNS servers.
Overcome subscriptions limits by peering VNets from different subscriptions to the central hub.
Separate concerns between central IT (SecOps, InfraOps) and workloads (DevOps). VPN Device
A device or service that provides external connectivity to the
Spoke1 Virtual Network on-premises network. The VPN device may be a hardware device, or
a software solution such as the Routing and Remote Access Service
(RRAS) in Windows Server 2012. For a list of supported VPN applianc-
es and information on configuring selected VPN appliances for con-
WEB TIER necting to Azure, see About VPN devices for Site-to-Site VPN Gate-
AVAILABLITY
way connections.
SET
On-premise network Hub Virtual Network VPN virtual network gateway
NSG VM
or ExpressRoute gateway
The virtual network gateway enables the VNet to connect to the VPN
PERIMETER CLOUD ACCESS GATEWAY SUBNET SHARED SERVICES
device, or ExpressRoute circuit, used for connectivity with your
SERVICES POINT
VM on-premises network. For more information, see Connect an
NSG
on-premises network to a Microsoft Azure virtual network.
AVAILABLITY
SET VM Note:
The deployment scripts for this reference architecture use a VPN
gateway for connectivity, and a VNet in Azure to simulate your
Peering
on-premises network.
VM Peering Spoke2 Virtual Network
Backend Systems DNS Services VPN Device ExpressRoute Hub Vnet
Gateway
Azure VNet used as the hub in the hub-spoke topology. The hub is
the central point of connectivity to your on-premises network, and a
VM WEB TIER
AVAILABLITY place to host services that can be consumed by the different work-
SET loads hosted in the spoke VNets.
DNS
Gateway subnet
NSG VM
The virtual network gateways are held in the same subnet.
VM Shared services subnet
A subnet in the hub VNet used to host services that can be shared
among all spokes, such as DNS or AD DS.
VM
Spoke Vnets
One or more Azure VNets that are used as spokes in the hub-spoke
Recommendations topology. Spokes can be used to isolate workloads in their own
VNets, managed separately from other spokes. Each workload might
• The hub VNet, and each spoke VNet, can be implemented in different resource groups, and even different subscriptions, as long as they belong to the same Azure Active include multiple tiers, with multiple subnets connected through
Directory (Azure AD) tenant in the same Azure region. This allows for a decentralized management of each workload, while sharing services maintained in the hub VNet. Azure load balancers. For more information about the application
infrastructure, see Running Windows VM workloads and Running
Linux VM workloads.
302
• A hub-spoke topology can be used without a gateway, if you don't need connectivity with your on-premises network.
• If you require connectivity between spokes, consider implementing an NVA for routing in the hub, and using user defined routes (UDRs) in the spoke to forward traffic to
the hub. Vnet peering
Two VNets in the same Azure region can be connected using a peer-
• Make sure to consider the limitation on the number of VNet peerings per VNet in Azure. If you need more spokes than this limit, consider creating a ing connection. Peering connections are non-transitive, low latency
hub-spoke-hub-spoke topology, where the first level of spokes also act as hubs. connections between VNets. Once peered, the VNets exchange traffic
by using the Azure backbone, without the need for a router. In a
• Consider what services are shared in the hub, to ensure the hub scales to the number of spokes. hub-spoke network topology, you use VNet peering to connect the
hub to each spoke.
Network DMZ
These reference architectures show proven practices for
creating a network DMZ that protects the boundary between an
Azure virtual network and an on-premises network or the Internet.
NIC
NIC
MANAGEMENT SUBNET
10.0.0.128/25 NVA such as allowing or denying access as a firewall, optimizing
wide area network (WAN) operations (including network
Internal Load
NSG Balancer
compression), custom routing, or other network
functionality.
Jump Box
Web Tier, Business Tier, and Data Tier Subnets
Subnets hosting the VMs and services that implement an
example 3-tier application running in the cloud. See Running
WEB TIER
AVAILABLITY
BUSINESS TIER DATA TIER Windows VMs for an N-tier architecture on Azure for more
10.0.1.0/24 AVAILABLITY AVAILABLITY
SET
10.0.2.0/24
SET
10.0.3.0/24
SET information.
User Defined Routes
NSG VM NSG VM NSG VM User defined routes define the flow of IP traffic within Azure
VNets.
NOTE: Depending on the requirements of your VPN
VM VM VM connection, you can configure Border Gateway Protocol
(BGP) routes instead of using UDRs to implement the
forwarding rules that direct traffic back through the
on-premises network.
VM VM VM
Management Subnet
This subnet contains VMs that implement management and
monitoring capabilities for the components running in the
Recommendations VNet.
• Use Role-Based Access Control (RBAC) to manage the resources in your application.
• On-premises traffic passes to the VNet through a virtual network gateway. We recommend an Azure VPN gateway or an Azure ExpressRoute gateway.
• Create a network security group (NSG) for the inbound NVA subnet, and only allow traffic originating from the on-premises network.
304
• Create NSGs for each subnet to provide a second level of protection in case of an incorrectly configured or disabled NVA.
• Force-tunnel all outbound Internet traffic through your on-premises network using the site-to-site VPN tunnel, and route to the Internet using network address
translation (NAT).
• Don't completely block Internet traffic from the application tiers, as this will prevent these tiers from using Azure PaaS services that rely on public IP addresses, such as VM
diagnostics logging.
• Perform all application and resource monitoring through the jumpbox in the management subnet.
Azure
DMZ between Azure and the Internet Architecture Components
This architecture implements a DMZ (also called a perimeter network) that accepts Internet traffic to an Azure virtual network. It also includes a DMZ that handles traffic from an on-premises
network. Network virtual appliances (NVAs) implement security functionality such as firewalls and packet inspection. Public IP Address (PIP)
The IP address of the public endpoint. External users
connected to the Internet can access the system through this
On-premises nework Virtual network address.
GATEWAY SUBNET
Network Virtual Appliance (NVA)
10.0.255.224/27
PRIVATE DMZ IN AVAILABLITY PRIVATE DMZ OUT This architecture includes a separate pool of NVAs for traffic
UDR 10.0.0.0/27 SET 10.0.0.32/27 originating on the Internet.
Azure Load Balancer
NIC
NIC
NVA
NSG All incoming requests from the Internet pass through the
Gateway NSG
load balancer and are distributed to the NVAs in the public
DMZ.
MANAGEMENT SUBNET
NIC
NIC
10.0.0.128/25 NVA
Public DMZ Inbound Subnet
Internal Load This subnet accepts requests from the Azure load balancer.
NSG
Balancer
Incoming requests are passed to one of the NVAs in the
Jump Box public DMZ.
Web app request
Public DMZ Outbound Subnet
WEB TIER BUSINESS TIER DATA TIER
AVAILABLITY AVAILABLITY AVAILABLITY
10.0.1.0/24
SET
10.0.2.0/24
SET
10.0.3.0/24
SET
Requests that are approved by the NVA pass through this
subnet to the internal load balancer for the web tier.
NSG NSG NSG
VM VM VM
VM VM VM
VM VM VM
PUBLIC DMZ IN AVAILABLITY PUBLIC DMZ OUT
10.0.0.0/27 SET 10.0.0.32/27
Azure Load
NIC
NIC
Balancer NVA
PIP NSG NSG
Recommendations
NIC
NIC
NVA
• Use one set of NVAs for traffic originating on the Internet, and another for
traffic originating on-premises.
• Include a layer-7 NVA to terminate application connections at the NVA level
and maintain compatibility with the backend tiers.
305
• For scalability and availability, deploy the public DMZ NVAs in an availability
set with a load balancer to distribute requests across the NVAs.
• Even if your architecture initially requires a single NVA, we recommend
putting a load balancer in front of the public DMZ from the beginning. That
makes it easier to scale to multiple NVAs in the future.
• Log all incoming requests on all ports. Regularly audit the logs.
Managed web
application
These reference architectures show proven practices for web applications
that use Azure App Service and other managed services in Azure.
307
deployments into a separate App Service plan to isolate them from the production version.
• Store configuration settings as app settings. Define the app settings in your Resource Manager templates, or using PowerShell. Never check passwords, access keys, or
connection strings into source control. Instead, pass these as parameters to a deployment script that stores these values as app settings.
• Consider using App Service authentication to implement authentication with an identity provider such as Azure Active Directory, Facebook, Google, or Twitter.
• Use SQL Database point-in-time restore to recover from human error by returning the database to an earlier point in time. Use geo-restore to recover from a service
outage by restoring a database from a geo-redundant backup.
Azure
Improved scalability in a web application Architecture Components
This architecture builds on the one shown in “Basic web application” and adds elements to improve scalability and performance.
Resource Group
A resource group is a logical container for Azure resources.
Web App and API App
A typical modern application might include both a website
Web Front End and one or more RESTful web APIs. A web API might be
consumed by browser clients through AJAX, by native client
App Service App Service Plan Data Storage applications, or by server-side applications. For
Authentication Plan considerations on designing web APIs, see API design
guidance.
{}
Azure Active SQL Database Document DB WebJob
Directory Web App API App WebJob Redis Cache Use Azure WebJobs to run long-running tasks in the
background. WebJobs can run on a schedule, continously, or
in response to a trigger, such as putting a message on a
queue. A WebJob runs as a background process in the
context of an App Service app.
Queue
Logs Queue Static Azure Search In the architecture shown here, the application queues
Content background tasks by putting a message onto an Azure
Queue storage queue. The message triggers a function in the
WebJob. Alternatively, you can use Service Bus queues. For a
Blob comparison, see Azure Queues and Service Bus queues -
compared and contrasted.
Storage Storage Storage Content Delivery
Account Account Account Network
Cache
Store semi-static data in Azure Redis Cache.
Resource
Group CDN
Use Azure Content Delivery Network (CDN) to cache publicly
Email/SMS available content for lower latency and faster delivery of
Service content.
Edge Servers
Data Storage
Use Azure SQL Database for relational data. For
Recommendations non-relational data, consider a NoSQL store, such as Cosmos
DB.
• Use Azure WebJobs to run long-running tasks in the background. WebJobs can run on a schedule, continuously, or in response to a trigger, such as putting a message
on a queue.
Azure Search
• Consider deploying resource intensive WebJobs to a separate App Service plan. This provides dedicated instances for the WebJob.
Use Azure Search to add search functionality such as search
• Use Azure Redis Cache to cache semi-static transaction data, session state, and HTML output. suggestions, fuzzy search, and language-specific search.
Azure Search is typically used in conjunction with another
• Use Azure CDN to cache static content. The main benefit of a CDN is to reduce latency for users, because content is cached at an edge server that is geographically close data store, especially if the primary data store requires strict
consistency. In this approach, store authoritative data in the
to the user. CDN can also reduce load on the application, because that traffic is not being handled by the application.
308
other data store and the search index in Azure Search. Azure
Search can also be used to consolidate a single search index
• Choose a data store based on application requirements. Depending on the scenario, you might use multiple stores. For more guidance, see Choose the right data store.
from multiple data stores.
• If you are using Azure SQL Database, consider using elastic pools. Elastic pools enable you to manage and scale multiple databases that have varying and unpredictable
usage demands. Email/SMS
Use a third-party service such as SendGrid or Twilio to send
• Also consider using Elastic Database tools to shard the database. Sharding allows you to scale out the database horizontally. email or SMS messages instead of building this functionality
directly into the application.
• Use Transparent Data Encryption to encrypt data at rest in Azure SQL Database.
Azure
Run a web application in multiple regions Architecture Components
This architecture shows a web application running on Azure App Service in two regions to achieve high availability. If an outage occurs in the primary region, the application can fail over to the secondary region.
Primary and Secondary Regions
Region 1 (Active)
This architecture uses two regions to achieve higher
availability. The application is deployed to each region.
App Service App Service Plan Data Storage During normal operations, network traffic is routed to the
Plan primary region. If the primary region becomes unavailable,
{} traffic is routed to the secondary region.
SQL Database Document DB
Web App API App WebJob Redis Cache Azure Traffice Manager
Traffic Manager routes incoming requests to the primary
region. If the application running that region becomes
Internet
unavailable, Traffic Manager fails over to the secondary
region.
Logs Queue Static
Content Azure Search
Geo-replication
of SQL Database and Cosmos DB.
Storage Storage Storage Content Delivery
Account Account Account Network (CDN)
should content below be included?
Region 2 (Standby) Read-only
replica Geo-replication A multi-region architecture can provide higher availability
than deploying to a single region. If a regional outage affects
the primary region, you can use Traffic Manager to fail over
to the secondary region. This architecture can also help if an
App Service App Service Plan Data Storage individual subsystem of the application fails.
Plan
{} There are several general approaches to achieving high
availability across regions:
SQL Database Document DB
Web App API App WebJob Redis Cache Active/passive with hot standby. Traffic goes to one region,
while the other waits on hot standby. Hot standby means the
VMs in the secondary region are allocated and running at all
times.
Logs Queue Active/passive with cold standby. Traffic goes to one region,
Azure Search while the other waits on cold standby. Cold standby means
the VMs in the secondary region are not allocated until
needed for failover. This approach costs less to run, but will
generally take longer to come online during a failure.
Storage Storage
Account Account Active/active. Both regions are active, and requests are load
balanced between them. If one region becomes unavailable,
Recommendations it is taken out of rotation.
• Each Azure region is paired with another region within the same geography. In general, choose regions from the same regional pair. If there is a broad outage, recovery of at least one region out of every This reference architecture focuses on active/passive with hot
pair is prioritized. standby, using Traffic Manager for failover.
• Configure Traffic Manager to use priority routing, which routes all requests to the primary region. If the primary region becomes unreachable, Traffic Manager automatically fails over to the secondary
region.
• Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each region. Create a health probe endpoint that reports the overall health of the application.
309
• Traffic Manager is a possible failure point in the system. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If not,
consider adding another traffic management solution as a failback.
• For Azure SQL Database, use Active Geo-Replication to create a readable secondary replica in a different region. Fail over to a secondary database if your primary database fails or needs to be taken offline.
• Cosmos DB also supports geo-replication across regions. One region is designated as writable and the others are read-only replicas. If there is a regional outage, you can fail over by selecting another
region to be the write region.
• For Azure Storage, use read-access geo-redundant storage (RA-GRS).
Running Linux
VM workloads
These reference architectures show proven practices for running Linux
VMs in Azure.
311
• Reserve a static IP address if you need a fixed IP address that won't change — for example, if you need to create an A record in DNS, or need the IP address to be added
to a safe list.
Diagnostics
• For higher availability, deploy multiple VMs behind a load balancer. See [Load balanced VMs reference architecture] Diagnostic logging is crucial for managing and
troubleshooting the VM.
• Use Azure Security Center to get a central view of the security state of your Azure resources. Security Center monitors potential security issues and provides a
comprehensive picture of the security health of your deployment.
• Consider Azure Disk Encryption if you need to encrypt the OS and data disks.
Azure
Run load-balanced VMs for scalability and availability Architecture Components
This architecture shows running several Linux virtual machines (VMs) running behind a load balancer, to improve availability and scalability. This architecture can be used for any stateless
workload, such as a web server, and is a building block for deploying n-tier applications. Availability Set
The availability set contains the VMs, making the VMs
eligible for the availability service level agreement (SLA) for
Azure VMs. For the SLA to apply, the availability set must
include a minimum of two VMs. Availability sets are implicit
in scale sets. If you create VMs outside a scale set, you need
to create the availability set independently.
Virtual Network (VNet) and Subnet
Every VM in Azure is deployed into a VNet that is further
divided into subnets.
Azure Load Balancer
Virtual network
The load balancer distributes incoming Internet requests to
the VM instances. The load balancer includes some related
SUBNET
resources:
VM 1 Storage VHDs
Account
Public IP Address
AVAILABLITY
SET A public IP address is needed for the load balancer to
receive Internet traffic.
Front-end Configuration
Public IP
VM 2 Storage VHDs Associates the public IP address with the load balancer.
Account
Azure Load Back-end Address Pool
Internet Balancer Contains the network interfaces (NICs) for the VMs that
Diagnostics will receive the incoming traffic.
VM Scaleset
Logs
Logs Storage
Load Balancer Rules
Account Used to distribute network traffic among all the VMs in the
back-end address pool.
Network Address Translation (NAT) Rules
Used to route traffic to a specific VM. For example, to enable
remote desktop protocol (RDP) to the VMs, create a separate
NAT rule for each VM.
Storage
If you are not using managed disks, storage accounts hold
the VM images and other file-related resources, such as VM
Recommendations diagnostic data captured by Azure.
• Consider using a VM scale set if you need to quickly scale out VMs, or need to autoscale. If you don’t use a scale set, place the VMs in an availability set. VM Scale set
A VM scale set is a set of identical VMs used to host a
• Use Managed disks, which do not require a storage account. You simply specify the size and type of disk and it is deployed in a highly available way. workload. Scale sets allow the number of VMs to be scaled in
or out manually, or based on predefined rules.
• Place the VMs within the same subnet. Do not expose the VMs directly to the Internet, but instead give each VM a private IP address. Clients connect using the public IP
312
address of the load balancer.
• For incoming Internet traffic, the load balancer rules define which traffic can reach the back end. However, load balancer rules don't support IP whitelisting, so if you want
to add certain public IP addresses to a whitelist, add an NSG to the subnet.
• The load balancer uses health probes to monitor the availability of VM instances. If your VMs run an HTTP server, create an HTTP probe. Otherwise create a TCP probe.
• Virtual networks are a traffic isolation boundary in Azure. VMs in one VNet cannot communicate directly to VMs in a different VNet. VMs within the same VNet can
communicate, unless you create network security groups (NSGs) to restrict traffic.
Azure
Run Linux VMs for an N-tier Application Architecture Components
This architecture shows how to deploy Linux virtual machines (VMs) to run an N-tier application in Azure. For the data tier, this architecture shows Apache Cassandra, which provides
replication and failover. However you could easily replace Cassandra in this architecture with another database, such as SQL Server. Availability Sets
Create an availability set for each tier, and provision at least
two VMs in each tier. This makes the VMs eligible for a
higher service level agreement (SLA) for VMs.
Subnets
VIRTUAL NETWORK Create a separate subnet for each tier. Specify the address
range and subnet mask using CIDR notation.
WEB TIER BUSINESS TIER DATA TIER
SUBNET
AVAILABLITY SUBNET
AVAILABLITY SUBNET
AVAILABLITY
SET SET SET Load Balancers
Use an Internet-facing load balancer to distribute incoming
VM
Internet traffic to the web tier, and an internal load balancer
NSG VM NSG VM to distribute network traffic from the web tier to the business
VM VM
tier.
Cassandra
NSG
Cluster
VM VM
Jumpbox
Azure Load Also called a bastion host. A secure VM on the network that
VM VM administrators use to connect to the other VMs. The
Internet Balancer
jumpbox has an NSG that allows remote traffic only from
VM VM VM public IP addresses on a safe list. The NSG should permit
secure shell (SSH) traffic.
Monitoring
MANAGEMENT SUBNET
Monitoring software such as Nagios, Zabbix, or Icinga can
give you insight into response time, VM uptime, and the
overall health of your system. Install the monitoring software
on a VM that's placed in a separate management subnet.
NSG
VM
NSGs
DevOps Jumpbox Use network security groups (NSGs) to restrict network
traffic within the VNet. For example, in the 3-tier architecture
shown here, the database tier does not accept traffic from
the web front end, only from the business tier and the
management subnet.
Apache Cassandra Database
Provides high availability at the data tier, by enabling
replication and failover.
Recommendations
• When you create the VNet, determine how many IP addresses your resources in each subnet require.
• Choose an address range that does not overlap with your on-premises network, in case you need to set up a gateway between the VNet and your on-premises network later.
• Design subnets with functionality and security requirements in mind. All VMs within the same tier or role should go into the same subnet, which can be a security boundary.
313
• Use NSG rules to restrict traffic between tiers. For example, in the 3-tier architecture shown above, the web tier should not communicate directly with the database tier.
• Do not allow SSH access from the public Internet to the VMs that run the application workload. Instead, all SSH access to these VMs must come through the jumpbox.
• The load balancers distribute network traffic to the web and business tiers. Scale horizontally by adding new VM instances. Note that you can scale the web and business
tiers independently, based on load.
• At the database tier, having multiple VMs does not automatically translate into a highly available database. For a relational database, you will typically need to use replication
and failover to achieve high availability.
Azure
Run Linux VMs in multiple regions for high availability Architecture Components
This architecture shows an N-tier application deployed in two Azure regions. This architecture can provide higher availability than a single region. If an outage occurs in the primary region,
the application can fail over to the secondary region. However, you must consider issues such as data replication and managing failover. Primary and Secondary Regions
Use two regions to achieve higher availability. One is the
primary region.The other region is for failover.
WEB TIER BUSINESS TIER CASSANDRA
Azure Traffic Manager
Traffic Manager routes incoming requests to one of the
VM
VM regions. During normal operations, it routes requests to the
VM primary region. If that region becomes unavailable, Traffic
VM VM
Manager fails over to the secondary region. For more
VM information, see the section Traffic Manager configuration.
VM
VM Resource Group
VM VM
Create separate resource groups for the primary region, the
VM
VM secondary region, and for Traffic Manager. This gives you the
flexibility to manage each region as a single collection of
resources. For example, you could redeploy one region,
VM Jumpbox without taking down the other one. Link the resource
groups, so that you can run a query to list all the resources
for the application.
VNets
WEB TIER BUSINESS TIER CASSANDRA
Create a separate VNet for each region. Make sure the
address spaces do not overlap.
VM
VM
VM
VM VM Apache Cassandra
VM Deploy Cassandra in data centers across Azure regions for
VM high availability. Within each region, nodes are configured in
rack-aware mode with fault and upgrade domains, for
VM
VM VM resiliency inside the region.
VM
VM
VM Jumpbox
Recommendations
• Each Azure region is paired with another region within the same geography. In general, choose regions from the same regional pair. If there is a broad outage, recovery of at
least one region out of every pair is prioritized.
• Configure Traffic Manager to use priority routing, which routes all requests to the primary region. If the primary region becomes unreachable, Traffic Manager automatically
fails over to the secondary region.
• If Traffic Manager fails over, we recommend performing a manual failback rather than implementing an automatic failback. Verify that all application subsystems are healthy
before failing back.
• Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each region. Create a health probe endpoint that reports the overall health of the application.
314
• Traffic Manager is a possible failure point in the system. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business
requirements for high availability. If not, consider adding another traffic management solution as a failback.
• For the data tier, this architecture shows Apache Cassandra for data replication and failover. Other database systems have similar functionality.
• When you update your deployment, update one region at a time to reduce the chance of a global failure from an incorrect configuration or an error in the application.
• Test the resiliency of the system to failures. Measure the recovery times and verify they meet your business requirements.
Running
Windows VM
workloads
These reference architectures show proven practices for running
Windows VMs in Azure.
Public IP Address deleted during reboots and other VM lifecycle events. Use
VHD this disk only for temporary data, such as page or swap files.
VM
Internet
Data Disk
Data 2 A data disk is a persistent VHD used for application data.
Data disks are stored in Azure Storage, like the OS disk.
Temp
Diagnostics
Virtual Network (VNet) and Subnet
Logs Every VM in Azure is deployed into a VNet that is further
divided into subnets.
Logs Storage
Physical SSD
Account
on Host Public IP Address
A public IP address is needed to communicate with the
VM—for example over remote desktop (RDP).
Recommendations Network interface (NIC)
• For best disk I/O performance, we recommend Premium Storage, which stores data on solid-state drives (SSDs). The NIC enables the VM to communicate with the virtual
network.
• Use Managed disks, which do not require a storage account. You simply specify the size and type of disk and it is deployed in a highly available way.
• Attach a data disk for persistent storage of application data. Network security group (NSG)
The NSG is used to allow/deny network traffic to the subnet.
• Enable monitoring and diagnostics, including health metrics, diagnostics infrastructure logs, and boot diagnostics.
You can associate an NSG with an individual NIC or with a
subnet. If you associate it with a subnet, the NSG rules apply
• Add an NSG to the subnet to allow/deny network traffic to the subnet. To enable remote desktop (RDP), add a rule to the NSG that allows inbound traffic to TCP port 3389. to all VMs in that subnet.
316
• Reserve a static IP address if you need a fixed IP address that won't change — for example, if you need to create an A record in DNS, or need the IP address to be added to
a safe list. Diagnostics
Diagnostic logging is crucial for managing and
• For higher availability, deploy multiple VMs behind a load balancer. See [Load balanced VMs reference architecture] troubleshooting the VM.
• Use Azure Security Center to get a central view of the security state of your Azure resources. Security Center monitors potential security issues and provides a
comprehensive picture of the security health of your deployment.
• Consider Azure Disk Encryption if you need to encrypt the OS and data disks.
Azure
Run load-balanced VMs for scalability and availability Architecture Components
This architecture shows running several Windows virtual machines (VMs) running behind a load balancer, to improve availability and scalability. This architecture can be used for any stateless
workload, such as a web server, and is a building block for deploying n-tier applications. Resource group
Resource groups are used to group resources so they can be
managed by lifetime, owner, and other criteria.
Virtual Network (VNet) and Subnet
Every VM in Azure is deployed into a VNet that is further
divided into subnets.
Azure Load Balancer
The load balancer distributes incoming Internet requests to
the VM instances. The load balancer includes some related
resources:
Virtual network
Public IP Address
A public IP address is needed for the load balancer to
SUBNET VM 1 Storage VHDs receive Internet traffic.
Account
AVAILABLITY
SET Front-end Configuration
Associates the public IP address with the load balancer.
Back-end Address Pool
Public IP
VM 2 Storage VHDs
Contains the network interfaces (NICs) for the VMs that
Account
will receive the incoming traffic.
Azure Load
Internet Balancer Load Balancer Rules
VM Scaleset
Diagnostics Used to distribute network traffic among all the VMs in the
Logs back-end address pool.
Logs Storage
Account VM Scale set
A VM scale set is a set of identical VMs used to host a
workload. Scale sets allow the number of VMs to be scaled in
or out manually, or based on predefined rules.
Availability Set
The availability set contains the VMs, making the VMs
eligible for the availability service level agreement (SLA) for
Azure VMs. In order for the SLA to apply, the availability set
must include a minimum of two VMs. Availability sets are
implicit in scale sets. If you create VMs outside a scale set,
Recommendations you need to create the availability set independently.
• Consider using a VM scale set if you need to quickly scale out VMs, or need to autoscale. If you don’t use a scale set, place the VMs in an availability set.
Storage
If you are not using managed disks, storage accounts hold
• Use Managed disks, which do not require a storage account. You simply specify the size and type of disk and it is deployed in a highly available way.
the VM images and other file-related resources, such as VM
diagnostic data captured by Azure.
• Place the VMs within the same subnet. Do not expose the VMs directly to the Internet, but instead give each VM a private IP address. Clients connect using the public IP
address of the load balancer.
317
• For incoming Internet traffic, the load balancer rules define which traffic can reach the back end. However, load balancer rules don't support IP whitelisting, so if you want
to add certain public IP addresses to a whitelist, add an NSG to the subnet.
• The load balancer uses health probes to monitor the availability of VM instances. If your VMs run an HTTP server, create an HTTP probe. Otherwise create a TCP probe.
• Virtual networks are a traffic isolation boundary in Azure. VMs in one VNet cannot communicate directly to VMs in a different VNet. VMs within the same VNet can
communicate, unless you create network security groups (NSGs) to restrict traffic.
Azure
Run Windows VMs for an N-tier application Architecture Components
This architecture shows how to deploy Windows virtual machines (VMs) to run an N-tier application in Azure. For the data tier, this architecture uses SQL Server Always On Availability
Groups, which provide replication and failover. Availability Sets
Create an availability set for each tier, and provision at least
two VMs in each tier. This makes the VMs eligible for a
higher service level agreement (SLA) for VMs.
Subnets
VIRTUAL NETWORK
Create a separate subnet for each tier. Specify the address
range and subnet mask using CIDR notation.
WEB TIER BUSINESS TIER BUSINESS TIER
SUBNET
AVAILABLITY SUBNET
AVAILABLITY SUBNET
AVAILABLITY
SET SET SET
Load Balancers
Use an Internet-facing load balancer to distribute incoming
Internet traffic to the web tier, and an internal load balancer
NSG VM NSG VM NSG to distribute network traffic from the web tier to the
business tier.
SQL Server
(Primary)
Jumpbox
VM VM
Also called a bastion host. A secure VM on the network that
Azure Load
administrators use to connect to the other VMs. The
Internet Balancer
jumpbox has an NSG that allows remote traffic only from
public IP addresses on a safe list. The NSG should permit
VM VM
remote desktop (RDP) traffic.
SQL Server Monitoring
(Secondary)
Monitoring software such as Nagios, Zabbix, or Icinga can
MANAGEMENT SUBNET ACTIVE DIRECTORY SUBNET
give you insight into response time, VM uptime, and the
overall health of your system. Install the monitoring software
on a VM that's placed in a separate management subnet.
NSG NSG
VM VM NSGs
Use network security groups (NSGs) to restrict network
DevOps Jumpbox AD DS AD DS File share
Server Server witness traffic within the VNet. For example, in the 3-tier architecture
shown here, the database tier does not accept traffic from
the web front end, only from the business tier and the
management subnet.
SQL Server Always On Availability Group
Provides high availability at the data tier, by enabling
replication and failover.
Recommendations
Active Directory Domain Services
• When you create the VNet, determine how many IP addresses your resources in each subnet require. (AD DS) Servers
• Choose an address range that does not overlap with your on-premises network, in case you need to set up a gateway between the VNet and your on-premises network later. Prior to Windows Server 2016, SQL Server Always On
Availability Groups must be joined to a domain. This is
because Availability Groups depend on Windows Server
• Design subnets with functionality and security requirements in mind. All VMs within the same tier or role should go into the same subnet, which can be a security boundary.
Failover Cluster (WSFC) technology. Windows Server 2016
introduces the ability to create a Failover Cluster without
• Use NSG rules to restrict traffic between tiers. For example, in the 3-tier architecture shown above, the web tier should not communicate directly with the database tier. Active Directory, in which case the AD DS servers are not
318
required for this architecture. For more information, see
• Do not allow remote desktop (RDP) access from the public Internet to the VMs that run the application workload. Instead, all RDP access to these VMs must come through What's new in Failover Clustering in Windows Server 2016.
the jumpbox.
• The load balancers distribute network traffic to the web and business tiers. Scale horizontally by adding new VM instances. Note that you can scale the web and business
tiers independently, based on load.
• We recommend Always On Availability Groups for SQL Server high availability. When a SQL client tries to connect, the load balancer routes the connection request to the
primary replica. If there is a failover to another replica, the load balancer automatically starts routing requests to the new primary replica.
Azure
Run Windows VMs in multiple regions for high availability Architecture Components
This architecture shows an N-tier application deployed in two Azure regions. This architecture can provide higher availability than a single region. If an outage occurs in the primary region,
the application can fail over to the secondary region. However, you must consider issues such as data replication and managing failover. Primary and Secondary Regions
Use two regions to achieve higher availability. One is the
WEB TIER
primary region. The other region is for failover.
BUSINESS TIER SQL SERVER ALWAYS ON
AVAILABILITY GROUP
VM
Azure Traffic Manager
VM Traffic Manager routes incoming requests to one of the
VM regions. During normal operations, it routes requests to the
VM primary region. If that region becomes unavailable, Traffic
Manager fails over to the secondary region. For more
VM
VM VM information, see the section Traffic Manager configuration.
ACTIVE DIRECTORY
Resource Groups
JUMPBOX GATEWAY SUBNET
Create separate resource groups for the primary region, the
secondary region, and for Traffic Manager. This gives you the
flexibility to manage each region as a single collection of
VM resources. For example, you could redeploy one region,
VPN Gateway
without taking down the other one. Link the resource
groups, so that you can run a query to list all the resources
for the application.
JUMPBOX ACTIVE DIRECTORY GATEWAY SUBNET VNets
Create a separate VNet for each region. Make sure the
address spaces do not overlap.
VM
VPN Gateway
SQL Server Always On Availability Group
WEB TIER BUSINESS TIER SQL SERVER ALWAYS ON
AVAILABILITY GROUP If you are using SQL Server, we recommend SQL Always On
Availability Groups for high availability. Create a single
VM availability group that includes the SQL Server instances in
VM both regions.
VM
Note
VM
VM Also consider Azure SQL Database, which provides a
VM relational database as a cloud service. With SQL Database,
VM you don't need to configure an availability group or manage
failover.
Recommendations
• Each Azure region is paired with another region within the same geography. In general, choose regions from the same regional pair. If there is a broad outage, recovery of at least one region out of every VPN Gateways
pair is prioritized. Create a VPN gateway in each VNet, and configure a
VNet-to-VNet connection, to enable network traffic between
• Configure Traffic Manager to use priority routing, which routes all requests to the primary region. If the primary region becomes unreachable, Traffic Manager automatically fails over to the secondary
the two VNets. This is required for the SQL Always On
region.
Availability Group.
• If Traffic Manager fails over, we recommend performing a manual failback rather than implementing an automatic failback. Verify that all application subsystems are healthy before failing back.
• Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each region. Create a health probe endpoint that reports the overall health of the application.
• Traffic Manager is a possible failure point in the system. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If not,
319
consider adding another traffic management solution as a failback.
• Create a SQL Server Always On Availability Group that includes the SQL Server instances in both the primary and secondary regions. Configure the replicas in the secondary region to use asynchronous
commit, for performance reasons.
• If all of the SQL Server database replicas in the primary region fail, you can manually fail over the availability group. With forced failover, there is a risk of data loss. Once the primary region is back online,
take a snapshot of the database and use tablediff to find the differences.
• When you update your deployment, update one region at a time to reduce the chance of a global failure from an incorrect configuration or an error in the application.
• Test the resiliency of the system to failures. Measure the recovery times and verify they meet your business requirements.