As organisations begin deploying autonomous AI agents into real operational roles, one question matters more than almost any other:
What happens when something goes wrong?
Not just bugs or bad prompts, but outages, regional failures, or service disruptions.
This is where BCDR (Business Continuity and Disaster Recovery) becomes critical, especially in the context of Azure AI Foundry Agent Services.
First: Why BCDR Matters More for Agents Than Traditional Apps
Traditional applications can often tolerate downtime:
- A system is unavailable
- Users wait
- Work resumes later
Agents are different.
Agents:
- Hold state
- Track context
- Make decisions over time
- Operate autonomously
- Interact with other systems continuously
If an agent loses its state, it doesn’t just “pause”. It forgets what it was doing, why, and where it was up to.
That’s not just an outage. That’s operational risk.
The Core Design Principle in Azure AI Foundry
Azure AI Foundry Agent Services deliberately separate:
- The agent runtime (Microsoft-managed)
- The agent state (customer-managed)
This is an important architectural choice.
Instead of storing agent memory and operational context inside a shared platform service, you provide and control the data store that holds the agent’s state.
That data store is Azure Cosmos DB.
What “BCDR for Agents” Actually Means
In practical terms:
- You provision your own single-tenant Azure Cosmos DB account
- All agent state is stored there
- You control Backup policies, Recovery options, Regional replication & Retention
- The agent service connects to that database at runtime
This design puts resilience and recoverability in your hands, not behind a black box.
What Happens During a Regional Outage?
This is the key scenario leaders care about.
If the primary Azure region becomes unavailable:
- Azure AI Foundry Agent Service automatically connects to
- The same Cosmos DB account
- In the secondary region
Because Cosmos DB preserves:
- Full history
- State transitions
- Agent memory
…the agent can continue operating with minimal disruption.
No rebuilding. No rehydrating context. No starting from scratch.
From a business perspective, this means:
- Lower downtime
- Lower operational risk
- Higher confidence in automation-led processes
What Is Azure Cosmos DB?
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database designed for applications that need high availability, low latency, and elastic scale. Unlike traditional databases that are tied to a single region or server, Cosmos DB can automatically replicate data across multiple Azure regions and handle failover without manual intervention. From a business perspective, it’s built for always-on systems (the kind that can’t afford to lose state, context, or history when something fails). This makes it a natural fit for autonomous agents, where preserving memory and operational continuity is critical.
Why Cosmos DB Is Central to Agent Resilience
Cosmos DB isn’t just a technical dependency, it’s the source of truth for agent behaviour.
It stores things like:
- Conversation state
- Task progress
- Decision context
- Execution history
- Memory across interactions
By using Cosmos DB’s native capabilities, you gain:
- Multi-region replication
- Automatic failover
- Point-in-time restore
- Configurable backup policies
In other words, BCDR for agents is really about BCDR for agent memory.
Data Residency and Regional Control
A common concern when discussing BCDR for AI agents is where data physically lives. In Azure AI Foundry, agent services do not automatically move or replicate your data across geographies. All agent state is stored in your customer-provisioned Azure Cosmos DB account, and you explicitly control which Azure regions are used. If your organisation requires data to remain on Australian shores, you can configure Cosmos DB to operate solely within Australian regions. Cross-region or cross-geography replication is a deliberate design decision, not a platform default. This means data residency, sovereignty, and regulatory obligations remain fully under your control, with resilience aligned to your organisation’s risk appetite.
When data residency is restricted to Australia only, resilience is still achieved within the Australian geography, not outside it. Azure Cosmos DB can be configured to use multiple Australian regions (for example, Australia East and Australia Southeast), allowing data to be synchronously or asynchronously replicated while remaining entirely onshore. If one Australian region becomes unavailable, Cosmos DB can fail over to the secondary Australian region, preserving agent state and history without breaching data residency requirements. The trade-off is that resilience is limited to in-country regional availability rather than global redundancy, a decision that should be consciously aligned to business continuity objectives and regulatory constraints.
Your Responsibilities as a Customer (And Why That’s a Good Thing)
This model does introduce responsibility, but it’s intentional.
As an Azure Standard customer, you are expected to:
- Provision and maintain the Cosmos DB account
- Configure backup and recovery policies
- Decide replication and retention strategies
- Align resilience with business criticality
This allows you to:
- Match resilience to workload importance
- Meet regulatory and compliance needs
- Integrate with existing BCDR strategies
- Avoid “one size fits all” platform assumptions
For many enterprises, this is a feature, not a burden.
Networking, Isolation, and Enterprise Readiness
Azure AI Foundry also supports:
- Using your own Azure resources
- Virtual network integration
- Controlled network access patterns
This matters for:
- Data sovereignty
- Security posture
- Regulated environments
- Zero Trust architectures
BCDR isn’t just about recovery, it’s about operating safely and predictably at scale.
The Bigger Picture for Leaders
BCDR for agents signals a broader shift in how AI platforms are being designed:
- Less magic
- More architectural clarity
- Clear ownership boundaries
- Enterprise-grade responsibility models
As agents move from experimentation into production, resilience becomes a leadership concern, not a technical footnote.
If an agent is trusted to act autonomously, it must also be trusted to recover safely when things fail.
Azure AI Foundry’s approach makes that trust explicit, and manageable.
If you’re exploring autonomous agents in your organisation, BCDR isn’t something to “add later”.
It’s part of deciding whether an agent is ready for real work.
And that’s a much more interesting conversation than prompts and demos.
Modern Applications and Power Platform Solutions Architect at Velrada.
Technical Consultant Helping organizations unlock the full potential of their Microsoft efficiency tools.
Feel free to share your thoughts or connect with me to discuss AI or Microsoft efficiencies.


Leave a Reply