Elton’s Blog

Why Would You Write a Book About Docker in 2025?

2025-10-22T09:00:00+00:00

Why Would You Write a Book About Docker in 2025?

Docker is everywhere. It’s the most sensible way to package and run applications. Every cloud platform supports it, every CI/CD pipeline uses it, any laptop can run it, and pretty much every development team has adopted or is adopting it.

So why did I write a second edition of Learn Docker in a Month of Lunches?

This is why: most engineers learn Docker on the job. You need to containerize an app, so you cobble together a Dockerfile from Stack Overflow. You need to run multiple containers, so you get Claude to write you a Docker Compose file. It works, you ship it, and you move on. But that doesn’t get you an understanding of how Docker works or what it can do.

The Reality of Learning Docker in Production

I’ve trained hundreds of people on Docker and Kubernetes, and there’s a common pattern. People know enough to get by, but they’re missing the fundamentals that would make their lives easier. They’re running containers without health checks. They’re building 2GB images when they could be 50MB. They’re not using multi-stage builds, or security scanning their images, or understanding how layer caching saves build time and data transfer costs.

You might know how to docker build and docker run, but do you really understand Docker volumes and why data in containers isn’t permanent? Can you configure application settings across different environments without rebuilding images? Do you know how containers enable advanced patterns like HTTP traffic management with reverse proxies or asynchronous messaging with queues?

Learn Docker in a Month of Lunches (Second Edition) has got you covered. It walks you through Docker with a practical hands-on approach, giving you experience in everything from the fundamentals to image optimization and cross-platform delivery. But you don’t have to follow the journey - every chapter is independent. Already comfortable with basic Dockerfiles? Jump straight to Chapter 17 on optimizing images for size, speed, and security. Need to understand networking? Chapter 7 walks through Docker Compose and how Docker plugs containers together. Want to finally master volumes on Windows AND Linux? Chapter 6 has you covered.

What’s New in the Second Edition

The first edition came out in 2021, and although the core concepts haven’t changed the book content is new with every exercise rewritten and tested for the latest releases. Everything works cross-platform: Linux, Windows, Intel, and ARM. You can follow along on your Apple Silicon, your Windows 11 laptop, or a Ubuntu server you’re running in the cloud. There’s a whole chapter on replatforming legacy Windows apps - because yes, those old .NET Framework applications deserve a new home in containers.

The runtime chapters of the book are a complete refresh, covering all the options you have to run containers in production. Azure Container Apps and Google Cloud Run for serverless containers in the cloud, a primer on Kubernetes, and GitHub Actions for CI/CD.

From Basics to Production

The book’s structured to take you from zero to production-ready. Part 1 covers the fundamentals - understanding containers and images, building multi-stage Dockerfiles for Java, Node.js, and Go apps, and sharing images through registries.

Part 2 gets into the real-world stuff: running distributed applications with Docker Compose, implementing health checks and dependency checks, adding observability with Prometheus and Grafana, and building a proper CI/CD pipeline that only needs Docker.

Part 3 shows you how to run containers anywhere - multi-platform builds that work on ARM and Intel, managed container services in Azure and Google Cloud, and yes, Kubernetes.

Part 4 is where it gets really interesting - production patterns like configuration management, centralized logging, reverse proxies for traffic control, and message queues for asynchronous communication.

The Practical Approach

Each chapter is a “lunch” - about an hour of focused learning that you can actually complete in a lunch break.

Every topic is grounded in real problems I’ve seen teams struggle with. Application configuration management across environments? Chapter 18. Writing and managing logs properly? Chapter 19. Getting containers production-ready with proper optimization? Chapter 17. These aren’t theoretical exercises - they’re solutions to actual problems you’ll face.

Getting Started

The second edition of Learn Docker in a Month of Lunches is available now from Manning and another book-selling website called Amazon. Whether you’re fixing those knowledge gaps or starting fresh, it’s the practical guide to Docker that focuses on what you actually need to know to be productive.

Docker might be ubiquitous in 2025, but that doesn’t mean everyone’s using it well. This book helps you join the group that is.

My New SRE Course: Resiliency and Automation (2025)

2025-08-18T09:00:00+00:00

My New SRE Course: Resiliency and Automation 🚀

My latest Pluralsight course is out!

SRE: Resiliency and Automation

It’s the third course in the Site Reliability Engineering learning path, and it’s all about building systems that survive the chaos of production.

This course came from a simple observation: most teams think their systems are reliable because they work perfectly in test environments. But production is hostile. Pods crash, nodes fail, dependencies timeout, and cloud services have outages. The question isn’t whether these failures will happen - it’s whether your system will survive them. 💪

SRE Resiliency: The Story 📖

The course follows an SRE team that’s had enough. They’re handing back the pager to the development team because the application is consuming their entire toil budget with constant incidents. But this isn’t about blame - it’s about partnership. The SRE team walks the developers through exactly what needs to change before they’ll take operational responsibility back.

We use this narrative to explore the core practices that transform hope-based reliability into evidence-based confidence. You’ll follow two fictional SREs, Carlos and Keiko, as they help steer the app to production reliability. You’ll see Carlos demonstrating the problems with traditional approaches, then Keiko showing how SRE teams solve these issues at scale.

SRE Skills You’ll Master

The course covers five essential areas that every production system needs:

Architectural Resilience - You’ll see why synchronous architectures create operational nightmares and how patterns like distributed caching and async messaging provide the graceful degradation that production demands. We take an app that’s failing under normal load and transform it into something that maintains its SLOs.

GitOps and Automation - Manual deployments don’t scale to multiple releases per day. You’ll learn how Infrastructure as Code with Terraform, application modeling with Helm, and continuous reconciliation with ArgoCD create self-healing systems that fix themselves at 3 AM while you sleep. 😴

Capacity Planning and Autoscaling - Pre-production sizing is guesswork. The course shows how to build systems that discover their own capacity needs through horizontal pod autoscaling, cluster autoscaling, and KEDA. Start small, measure everything, and let reality drive your scaling. 📊

Chaos Engineering - Perfect test environments create dangerous blind spots. You’ll see how to use Chaos Mesh to deliberately break things, proving your system can handle pod failures, node crashes, and dependency outages before they happen in production. 🔨

Disaster Recovery - Even the most resilient system can’t survive everything. The final module covers how SRE teams classify applications by business criticality and implement appropriate DR strategies for regional failures.

Real Problems, Real Solutions 🎯

Every demo in the course reproduces actual production problems. When you see timeouts, cascading failures, and manual deployment disasters, these aren’t theoretical examples - they’re recreations of the issues that force SRE teams to hand back the pager.

The solutions aren’t exotic either. These are the standard infrastructure patterns that emerge from running hundreds of services at scale. Distributed caching with Redis, message queuing for async processing, GitOps with ArgoCD - the tools and techniques that working SRE teams use every day.

Target Audience for SRE Professionals

This course is perfect if you’re:

A developer working with SRE teams who wants to understand their requirements
An operations engineer looking to move into SRE
An architect designing systems that need to run reliably at scale

You’ll need basic knowledge of distributed systems and cloud platforms, plus an understanding of SRE fundamentals from the earlier courses in the SRE learning path. The demo application runs in Kubernetes, but you don’t need to be an expert - the principles and approaches are the key things you’ll learn here, not just the technology implementation.

The SRE Partnership Model 🤝

One thing I really wanted to emphasize in this course is that SRE isn’t about one team imposing rules on another. It’s about partnership. Development teams bring deep application knowledge and feature expertise. SRE teams bring operational experience from running systems at scale. Together, they build something neither could achieve alone.

When the SRE team hands back the pager in module one, it’s not a failure - it’s a recognition that the application needs architectural changes that only the dev team can implement. When they take it back after the improvements, both teams win. Developers get faster deployments and more autonomy. SRE teams get sustainable operations with manageable toil.

Next Steps ➡️

SRE: Resiliency and Automation is available now on Pluralsight. It’s about 90 minutes of content split across five modules, each with practical demos you can follow along with.

If you haven’t started the SRE learning path yet, begin with SRE: Concepts and Principles for the fundamentals, then move through the path to build your expertise.

The SRE approach transforms how we build and run systems. Instead of hoping things won’t break, we prove they can survive. Instead of firefighting the same issues repeatedly, we build systems that heal themselves. It’s a better way to work for everyone - developers, operators, and especially the users who depend on our services. 🎉

Happy learning!

10 Essential Claude Code Tips: Boost Your AI Coding Productivity in 2025

2025-07-15T09:00:00+00:00

Claude Code is Anthropic’s agentic coding tool that transforms AI pair programming. It lets you delegate development tasks directly from your VS Code terminal - you describe what you want, and a team of Claudes build it while you focus on the bigger picture.

My journey with Claude Code went like this:

mildly skeptical 🤔
pleasantly surprised 😯
thoroughly impressed 🤯
cannot live without 🚀

This Claude Code tutorial covers 10 battle-tested tips from real projects that will help you work like a tech lead with an AI development team at your command.

Quick Summary: Claude Code transforms you from a coder into a development director. These 10 Claude Code best practices will help you manage multiple AI coding agents, maintain code quality, and dramatically increase productivity. Time required: 5 minutes to read, hours of new free time to fill.

Getting Started with Claude Code

Setting up is straightforward: create a free account, install the Claude Code extension in VS Code, authenticate and you’re ready. Open a terminal, type claude and start describing what you need. The real power comes from understanding how to work with it effectively.

I used Claude Code to build an entire multi-cloud AKS/EKS demo application. With a few hours of guidance, Claude completed what would have taken me at least 3 days to write myself.

1. Run Multiple Claude Instances: Multitask Like a Manager

Run multiple Claude instances across different terminal sessions. While one’s building your API endpoints, another can work on the frontend, and a third can write your deployment scripts. Switch between them to provide guidance - it’s like having a team of developers who are extremely eager and who know everything.

Make sure you have plenty of things on the go - work, pet projects, blogs, tech explorations. And don’t be afraid to let it loop - prompts like “keep iterating on the build: fix any issues with the terraform config and deployment scripts, run the script, watch the outcome and repeat until it works” will keep Claude busy.

It’s the new ABC: Always Be Clauding.

2. Delegate Debugging: Let Claude Do the Work

When something’s broken, resist the urge to fix it yourself. That’s not why you’re here. Describe the problem and let Claude handle the implementation. If you’re diving into the code to make changes, you’re going too deep. Stay at the design level where you add the most value.

Claude will use all the same debugging tools you use to find issues (it asks permission first and stores the permissions you’ve granted). If you see an error log, just give it to Claude and it will use kubectl to examine your Pods and Services, curl to test endpoints, nslookup for DNS queries and so on.

3. Code Review Mindset: Roll With AI-Generated Code

Claude’s code isn’t going to look like yours. That’s fine. Treat it like you’re reviewing someone’s PR - does it meet the requirements? Is it maintainable? If you have standards, enforce them in the repo. Don’t get hung up on style differences. The goal is working software, not perfect alignment with your own preferences.

4. Rapid Prototyping: Design and Iterate on the Fly

Coding is cheap now. Really cheap. Need to refactor the entire architecture? Just ask. Want to switch from REST to GraphQL? Claude can handle it. Don’t overthink the initial design - build something that works, then iterate. It’s liberating when a complete redesign takes minutes, not days.

5. Git Best Practices: Stay in Control of Commits

Claude can commit code and write commit messages, but don’t let it run on autopilot. Review the diffs, commit frequently, and keep your Git history clean. You want to understand what’s changing - that’s how you maintain ownership of the codebase.

Ready to try this? Start with Claude Code free and experience the power of AI-assisted development.

6. Beyond Application Code: Let Claude Handle Infrastructure

Don’t just use it for application code. Claude can write your Dockerfiles, Kubernetes manifests, Terraform configs, CI/CD pipelines, test suites, documentation. It will even give you guidance on architecture and tech stack. Push the boundaries - you’ll be surprised at what it can do. It has memorized the entire Internet, after all (probably).

7. Troubleshooting Complex Tasks: Be Persistent

Some tasks are harder for Claude than others. I’ve had situations where it took a dozen prompts to get a local LGTM stack running, or to authenticate to a new EKS cluster. When it struggles, approach from different angles. Rephrase your requirements, break complex tasks into steps, feed in error messages, or provide examples. Like any team member, Claude sometimes needs extra guidance to get unstuck.

Unlike most team members, Claude sometimes gives up. It will say something like “success! I’ve got it all working except the things you really wanted”. But that doesn’t mean it can’t do it, it’s just reached the end of the road for that prompt - try again.

8. CLAUDE.md Best Practices: Provide Context Upfront

Create a CLAUDE.md file in your project root. This is where Claude documents everything it needs to know - architecture decisions, tech preferences, naming conventions, project structure, and coding standards. Claude Code automatically reads this at the start of each session, so you don’t need to repeat yourself.

A good CLAUDE.md is like a comprehensive onboarding doc for a new developer - and you can ask Claude to update it at the end of a session with new learnings, which it will pick up next time. Here’s what it looks like - create it with the /init command when you bring Claude onto the project and keep it current with prompts:

# CLAUDE.md - AI Assistant Memory File

## Project Overview
This is a multi-cloud Kubernetes demo application showcasing how containerized .NET applications can be deployed consistently across different cloud providers. The application demonstrates modern microservices patterns, message queuing, database persistence, and Kubernetes deployment best practices.

## Architecture

### Components
- **WebApp**: ASP.NET Core web application with Razor Pages
  - Form for message submission
  - Messages page displaying processed data from SQL Server
  - Modern gradient UI design with 3rem font sizes
  - Antiforgery tokens disabled for demo simplicity

9. Async Work Queue: Batch Your Changes

While Claude is working on a longer task, queue up your next prompts. If you know you’ll need API tests after the endpoints are done, type that prompt and hit enter - Claude will pick it up when ready. This keeps Claude productive while you check on your other instances. It’s like having a queue of work that you can fill ahead of time.

10. Knowledge Management: Capture and Reuse Prompts

Ask Claude to dump all the prompts from your session to a text file in the repo. It’s incredibly useful to see how the development evolved - what worked, what needed clarification, how you refined requirements. These prompt histories become scaffolding for your next similar project. You’ll build up a library of effective prompts that you can reuse and adapt.

Claude Code Pricing and Usage Limits

Don’t get too hung up on the details - which model you’re using or which plan you’re on. I use the more advanced Opus 4 model by default, but Claude automatically switches to Sonnet 4 when you’re running low on credits, and it’s perfectly capable.

The usage restrictions are very fair - when you hit the limits, they reset after a period. More expensive plans have higher limits and shorter reset periods. Just focus on being productive with whatever you have.

Frequently Asked Questions

Q: Is Claude Code free to use?
A: Claude Code offers a free tier with limited usage. Paid plans provide higher limits and access to more powerful models like Opus 4.

Q: Does Claude Code work with any language?
A: Yes! Claude Code supports all major programming languages including Python, Java, C#, Go, Rust, and more.

Q: Can Claude Code work with existing codebases?
A: Absolutely. Claude Code can analyze and modify existing code. The CLAUDE.md file helps it understand your project structure and conventions.

Q: How does Claude Code compare to GitHub Copilot?
A: While Copilot offers inline suggestions, Claude Code works at a higher level - managing entire features and projects through conversation. It’s more like having an AI pair programming partner who can handle complex, multi-file tasks.

Q: Can I use Claude Code for production applications?
A: Yes, but always review Claude’s code thoroughly. Treat it like any code review - check for security issues, performance concerns, and adherence to your coding standards.

The Future of AI-Assisted Development

Claude Code fundamentally changes how we write software. Instead of coding, you’re directing. Instead of debugging syntax, you’re validating solutions. Embrace this new way of working - you will suddenly become hugely more productive.

I imagine the next step will be a higher level still - you’ll plug Claude into your product backlog and set X number of instances running to do the entire project. One Claude will test and review the work of another Claude, and maybe there will be a manager (the Claude of Claudes) who takes over the director role.

But for now, you are the director. If you’re ready to transform your development workflow, get started with Claude Code and experience the future of AI-powered coding. For a more detailed analysis of what multiple Claudes can do, check out my post An Evening with Claude Code - or - How I Learned to Stop Worrying and Love AI.

An Evening with Claude Code - or - How I Learned to Stop Worrying and Love AI

2025-07-10T09:00:00+00:00

It’s 7pm, Friday night and I’m working on three different projects simultaneously with my new favorite colleague: Claude Code.

Claude #1 is building a multi-cloud proof-of-concept for a client.
Claude #2 is creating demos for my next Pluralsight course
Claude #3 is fixing the UI issues on this blog that I’ve ignored for years.

Actually… I’m mostly watching Incredibles 2 with my kids, and just checking in on each of the Claudes in turn to nudge them to their next step. This is AI coding today.

Welcome to the Paradigm Shift

We - engineers and architects - shouldn’t feel we’re competing with AI. 🎬 Our role is to direct it. Bandwidth is no longer the limit, because we can distribute work to as many AI agents as we can manage. With multiple Claudes I can productively work on multiple tasks in parallel.

The mythical 10X developer turns out to be a regular 1X developer directing 10 instances of Claude Code. Even the best multitasking developers pay a cost every time they context switch, but AI doesn’t have that overhead. Each instance maintains its full context, while you operate on a higher level checking in and guiding them all.

I’ve heard the same joke from my consultancy clients for years - they want to clone me so they can run me at scale. It feels like Anthropic are doing that with Claude Code.

I’ve been using Claude Code more and more, and this parallel workflow is a real breakthrough. This post covers what I think where AI is going in the short term: not replacing developers, but fundamentally changing what engineering roles look like. One person running multiple AI agents is like a tech lead with a hugely knowledgeable, experienced and dedicated team. 🚀 So yes, AI is coming for your job - not to take it from you, but to transform it into something entirely more awesome.

Project #1: The Cloud Proof-of-Concept

I have a consulting client who are multi-cloud, but the area I work in is 100% Azure. They’re looking at broadening that to include AWS but they’re skeptical about how easy it is to migrate apps between clouds.

🐳 For years I’ve been saying that Docker and Kubernetes are the keystones of portable apps. They wanted a generic, simple proof of concept they could use to see if that was true, and to diff the Azure and AWS setup. I thought that was something Claude could help me with.

The full code Claude generated is on GitHub: sixeyed/multi-cloud-demo. Here’s a snap of the app running in Azure with fully automated deployments built by Claude:

This is how the conversation with Claude Code started - using the VS Code integration in an empty folder:

1. "this is a new simple demo app for showing how Kubernetes deployments can work the same way in different clouds. i'd like to create a basic multi service application - a web app which posts text to a redis queue, and a background worker which reads from the queue. both should be .NET, and the web app should have a very simple form for the user to enter text. i'd also like docker files and docker compose.yml so this runs locally for development" 2. "great. now let's have a helm folder with a chart to deploy this app to kubernetes. we'll want the chart to have a redis dependency - probably bitnami's chart"

At this point I had the source code, Docker and Kubernetes artifacts for a working demo. Then it gets interesting because I’m designing this out loud and Claude is reacting to changes in requirements:

3. "now i want to demonstrate different kubernetes features. can we add to the worker process so it writes logs to a file or persists data somewhere so we can see PVCs and different storage options" 4. "no, scratch that. leave the logging to console but also add SQL server container to docker-compose.yml and have the worker write the messages to a database table"

And now I have a database defined in my Docker Compose spec, with the source code extended to include persistent storage. Claude also added it to the Kubernetes model without my asking, because by now it had enough context to know we’d be using both.

This is impressive enough, but the output isn’t perfect and - like any developer - Claude does get things wrong. What’s really impressive is that you can tell Claude there’s an issue, and it will use the same tools you would to debug, track down the root cause and fix it:

5. "the data isn't getting into sql server from the worker. check the connection string and the ef core code"

That triggered lots of approval requests so Claude could use tools like kubectl and curl - nothing happens without your permission. It found the problem, fixed the Kubernetes specs and we were off again.

🤖 Generative AI for code is like having an engineer on the team who’s extremely knowledgeable and very enthusiastic, but lacking experience. Your role is to guide them, feed them tasks which you break down into sensible steps and describe clearly, and give feedback when something needs more work.

Some of these prompts take Claude a minute or two to work on, sometimes longer. That’s when you - as the AI team lead - switch to another instance on another part of the project, or a different project entirely. That other Claude has finished its latest task so you guide it on to the next one.

I’m always polite with Claude because of Terminator 2, but it doesn’t mind criticism. Sometimes it approaches tasks in an odd way, and you just point out what you’d prefer and it goes off and corrects it:

36. "this is very cool. let's add another page to the web app which shows the messages in sql server. probably best to split it out so the html isn't all in a string now too :)" 37. "also - why do we have HTML strings at all? shouldn't we be using razor pages or something."

You get the idea. The full conversation ran to 90 prompts, and at the end I had a full stack repo with Terraform configurations to create the Azure and AWS infrastructure, Helm charts to deploy the same app to both, and detailed documentation.

I made a point of not touching any code myself. Anything that didn’t work or needed changing went into a prompt. So Claude did it all - but Claude couldn’t have done it without me.

This is why there’s still a place for human engineers. Maybe not for long, but for now the AI needs guidance. The more knowledge and experience you bring as the human guide, the more productive the AI can be. Rough guess I think this would have taken one human about 3 days to write; with Claude it was done in a few hours of intermittent guidance.

Project #2: Pluralsight Course Demo

And while that goes on there, look at this going on here. In a different VS Code window I have a separate Claude Code session.

I’m working on a new SRE learning path for Pluralsight. There are four courses and I like to have a different demo app for each course. Those apps are built to show specific behavior to highlight how SRE approaches and tools can fix issues.

In the old days I might spend the first week of a new course building the demo app. I quite enjoyed that - it’s responsibility-free coding because the app will never run in anger - and it gave me a chance to use all the latest tech stacks and keep up to date.

But actually it’s not a very effective use of time. Far better to get Claude to write that throwaway code for me, freeing up my time to work on the content.

After a few iterations Claude had built a demo app which could be configured to deliberately fail in interesting ways, with a full GitOps stack to create Azure infrastructure with Terraform and deploy to Kubernetes with Argo.

Here’s part of Claude’s summary for that session, running while the other Claude was building my multi-cloud demo:

## Final State
Repository is now:
- ✅ PowerShell standardized
- ✅ Self-bootstrapping infrastructure  
- ✅ Simplified testing approach
- ✅ Zero-configuration user experience
- ✅ Production-ready GitOps setup

## Key Takeaways
1. Complex test frameworks can become dead ends - simpler is often better
2. Self-bootstrapping infrastructure eliminates user setup pain
3. Consistency (PowerShell only) reduces maintenance overhead
4. Real infrastructure validation > mocked tests for reliability demos

This was another app which started from scratch. ⏰ I saved a week of tinkering and could get straight onto the content for the course. It takes a little bit of introspection to admit: this task is better suited to Claude than to me . But it is and it means that course is getting to completion more quickly.

Project #3: The Blog UI (Finally!)

And the last thing Claude was helping me with was this blog. I’ve always focused on the content and have pretty much zero interest in HTML and CSS. Over the years I’ve used different frameworks and platforms, the current setup is Jekyll powered by GitHub pages.

The theme is a modification of Minimal Mistakes and the mobile experience has always sucked, but it’s one of those things I never fancied working on.

So in my third session I fired up Claude and introduced it to the blog repo. With the init command it inspected the source code and generated the CLAUDE.md document for its own guidance, including a high level overview:

## Project Overview This is a Jekyll-based blog using the Minimal Mistakes theme, hosted on GitHub Pages. The blog belongs to Elton Stoneman, a freelance IT consultant and trainer.

While the other two Claudes were working on their own things, I had this Claude fixing up the responsive design, adding SEO-metadata to recent posts and generally tidying things up.

I even got Claude to write a url-shortening component, to make it easier to control links. So my Pluralsight author page is available through my blog at https://blog.sixeyed.com/l/ps-home.

Years of technical debt fixed while my other projects built themselves. 📱 The blog finally looks professional on mobile. You can actually read my posts on your phone without zooming and scrolling horizontally.

The Evening’s Tally

Final check-in across all three terminals:

Client POC: Full distributed system deployed on both AKS and EKS - frontend accepting jobs, Redis queuing them, workers processing, results in SQL
Course demos: Six architectural patterns, fully containerized with documentation
Blog: Responsive, dark mode enabled, finally entering the 2020s

✨ I accomplished three different project milestones in one evening (and a little bit into the following morning). Not by working faster - by working on multiple things simultaneously.

The Realization

💪 What I’ve realized is that the value of human oversight across multiple AI workers is the new superpower. You become the tech lead doing rounds, checking on your team’s progress, providing direction, ensuring quality.

Here’s the hardest habit to break: the urge to jump in and edit code manually. Every time I see a small bug or want to tweak something, muscle memory says “I’ll just quickly fix this.” But that’s the old way. It’s always more effective to describe the change to Claude and move on to push forward another project. Let the AI make the change while you’re being productive elsewhere. Breaking this habit is like learning to delegate - uncomfortable at first, but essential for scaling.

The Competitive Reality

💯 Here’s the uncomfortable truth: one AI-enabled developer can now deliver what used to take a small team. Not because AI is better at coding than humans, but because one human can effectively direct multiple AI developers working in parallel.

The good news? If you adapt, you become incredibly valuable. The concerning news? If you’re still working sequentially, you’re competing against people running parallel workstreams.

My advice:

Start thinking in parallel projects (or at least parallel tasks in the same project), not sequential tasks
Get comfortable being a reviewer/director/tester/product manager rather than an implementer
Practice managing multiple contexts simultaneously
Focus on the skills AI can’t replicate: understanding business value, making architectural decisions, ensuring quality

🎬 Think of yourself as a director, managing all the talent. You couldn’t do it without them - but they couldn’t do it without you either.

The code from all three projects is on GitHub. And the entire Claude Code transcript for the multi-cloud demo app is there too - every prompt. You can see exactly how a distributed system went from nothing to multi-cloud deployment without me writing a single line of code.

Stop thinking about AI as just a faster way to code, or maybe a threat to your job. Start thinking about it as your development team.

Now if you’ll excuse me, I’ve got a few more Claude Code instances to spin up. My next Pluralsight course demo isn’t going to build itself. 😏 Well, actually…

Site Reliability Engineering (SRE) on Pluralsight: Complete 4-Course Learning Path

2025-07-06T10:00:00+00:00

Site Reliability Engineering is how Google runs production systems, and it’s becoming the standard approach for managing complex applications at scale. I’ve just published the first two courses in a new SRE learning path on Pluralsight, with two more courses coming soon to complete the path.

SRE achieves the same goals as DevOps - high availability with high velocity - but without requiring a massive culture shift. It’s an engineering approach to operations that focuses on automation, measurement, and removing toil. For many organizations starting their digital transformation, SRE provides a more structured path forward than traditional DevOps adoption.

I cover the easy(ish) way to add reliability at scale with container orchestration in my 5* Pluralsight course Managing Apps on Kubernetes with Istio.

The SRE Learning Path

The complete Site Reliability Engineering learning path takes you from SRE fundamentals through to advanced practices:

SRE: Concepts and Principles
SRE: Monitoring and Observability
SRE: Resiliency and Automation
SRE: Measuring and Optimizing Reliability (coming soon)

Let’s dive into what’s covered in the first two courses.

Course 1: SRE Concepts and Principles

SRE: Concepts and Principles is your entry point into Site Reliability Engineering. Over 90 minutes, you’ll follow two experienced SREs as they deal with real production scenarios.

What You’ll Learn

The course covers the foundational SRE concepts through practical demonstrations:

How SRE differs from traditional IT operations and DevOps
Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets
Incident management and the importance of blameless postmortems
Core SRE tools for monitoring and alerting
Automation, automation, automation

Course Outline

Module 1: Investigating Issues: On-Call with an SRE
Follow an on-call SRE dealing with a disk space issue in Elasticsearch. You’ll see how SREs approach problems differently from traditional ops teams, using engineering practices to solve operational challenges.

Module 2: Classifying and Tracking Performance with Service Levels
Join another SRE investigating a performance issue that’s burning through error budget. This module explains the key concepts of SLIs and SLOs while demonstrating logging and distributed tracing tools.

Module 3: Managing Risk and Reducing Downtime
Learn how to use monitoring tools like Prometheus and Grafana with OpenTelemetry to confirm root causes and work with development teams on architectural solutions.

Module 4: Handling Failure with Incident Management
When the initial fix doesn’t work and the incident escalates, you’ll see how SREs use a structured incident management approach to investigate and get to quick resolution.

Module 5: Reflecting and Improving Practices with Postmortems
Wrap up with a blameless postmortem that connects both incidents and provides a path forward for preventing future issues.

Course 2: SRE Monitoring and Observability

SRE: Monitoring and Observability builds on the foundational knowledge from course 1. You’ll follow an SRE team preparing to onboard a new application into their production environment.

What You’ll Learn

This course focuses on the technical implementation of observability:

The three pillars of observability: logging, metrics, and tracing
Setting up monitoring stacks with Elasticsearch, Prometheus, and Grafana
Designing effective alerting strategies that avoid alert fatigue
Automating incident response with CI/CD pipelines
Exploring AIOps and machine learning for advanced monitoring

Course Outline

Module 1: Onboarding to SRE: Observability Requirements
Learn what data applications need to expose for SRE teams to manage them effectively. Covers structured logging with the EFK stack and distributed tracing with Tempo.

Module 2: Measuring “Good” with Service Level Indicators
Deep dive into implementing meaningful SLIs using Prometheus, including how to expose metrics from application components and aggregate them for monitoring.

Module 3: Alerting on “Bad” to Trigger Incident Response
Design alerting strategies that trigger the right response - automated fixes for known issues or pages for unknown problems. Includes integration with OpsGenie.

Module 4: Automating Remediation with Pipelines
Reduce toil by automating common fixes using GitHub Actions workflows triggered by your monitoring stack, with status updates posted to Slack.

Module 5: Next-level SRE: Machine Learning and AIOps
Explore how AIOps platforms like Datadog can augment traditional SRE practices with machine learning-driven anomaly detection and automated incident analysis.

Real-World Tools and Practices

Both courses use the same tools you’ll find in production SRE environments:

Monitoring: Prometheus, Grafana
Logging: Elasticsearch, Kibana, Fluentd
Tracing: Tempo, OpenTelemetry
Alerting: OpsGenie, PagerDuty
AIOps: Datadog

Every demo shows working implementations to back up the theory. You’ll see realistic incidents being investigated, actual dashboards being built, and automation workflows in action.

Who Should Take These Courses?

The courses are designed for:

Developers who work with SRE teams or want to understand SRE practices
Operations engineers looking to transition to SRE
Team leads and managers evaluating SRE for their organization
Anyone involved in digital transformation initiatives

No deep technical knowledge is required for the first course - just a basic understanding of software development and deployment processes.

What’s Next?

The next two courses in the learning path will complete your SRE education:

SRE: Resiliency and Automation will focus on building systems that can withstand failures and automating responses to common issues. You’ll learn how to design for resilience, implement chaos engineering practices, and create self-healing systems.

SRE: Measuring and Optimizing Reliability will cover advanced techniques for quantifying and improving system reliability, including complex SLO hierarchies, reliability budgeting, and using data to drive architectural decisions.

Getting Started

Ready to learn how Google keeps systems running at scale? Start your Site Reliability Engineering journey today:

View the complete SRE learning path - See all 4 courses and plan your learning
Start with SRE: Concepts and Principles - Master the fundamentals (90 minutes)
Continue with SRE: Monitoring and Observability - Implement real-world solutions

Site Reliability Engineering isn’t just for Google-scale operations. These SRE principles and practices work for any team running production systems. Whether you’re managing a handful of microservices or hundreds, SRE provides a proven framework for balancing reliability with feature velocity and reducing operational toil.

Frequently Asked Questions

Q: Do I need prior SRE experience to take these courses?
A: No, the first course starts with fundamentals. Basic software development and deployment knowledge is helpful.

Q: What tools will I learn?
A: Prometheus, Grafana, Elasticsearch, Kibana, Tempo, OpsGenie, and modern AIOps platforms like Datadog.

Q: How long does the complete learning path take?
A: The four courses total approximately 6 hours, designed to be completed over 2-3 weeks.

Q: Is this Google’s exact SRE approach?
A: These courses teach the core SRE principles Google pioneered, adapted for use in any size of organization.

Ready to dive deeper into the tools and practices that make SRE possible? Check out my other courses on Pluralsight covering Docker, Kubernetes, and cloud-native architecture.

Locking Helm Releases to Prevent Upgrades (and Downgrades)

2024-10-16T08:00:00+00:00

The Challenge: Preventing Unwanted Helm Upgrades and Downgrades

It’s great having a single ‘Up’ pipeline for your apps which deploys the whole stack, creating whatever resources it needs and ensuring the deployment matches the spec in your source repo. Idempotence is the key here so your IaC will create or update infrastructure as required, and if you’re using Kubernetes and Helm then you get desired-state deployment for the software.

One small issue you might see is if you have common services - say a data storage or monitoring subsystem - which are shared for multiple deployments of the app. If those deployments are different test environments running from different branches of the code then you might get into a tricky scenario:

you update the shared service Helm chart to v1.1 in the dev branch
you run the Up pipeline to deploy to the latest code to the dev environment
later someone deploys an earlier version from a release branch to the test environment
the release branch uses v1.0 of the Helm chart, so your shared service gets downgraded

Helm has the upgrade --install command which supports this idempotent approach, but there’s no flag to say install it if it hasn’t been deployed yet, or upgrade it if it has - but only upgrade it if this version number is higher than the one for the current release. In that case it would be useful to lock the release to prevent any further upgrades or downgrades, but there’s no helm lock command either.

Pending Status to the Rescue

When Helm installs and upgrades get interrupted they can leave the release in a pending state - pending-upgrade or pending-rollback, usually when an operation times out. It’s a nasty situation which requires manually deleting the Helm release Secret (until this HIP is completed) - but it effectively prevents any further changes to the release, so we can abuse it to create a lock.

The scripting for this is fairly simple, but it does rely on the internals of how Helm represents a release, so it’s liable to be broken at some point (it’s working as of Helm 3.16). Every time you install or upgrade a release Helm creates a Kubernetes Secret which contains an encoded representation of the release.

You can try this with a simple Helm chart from my book Learn Kubernetes in a Month of Lunches (see also my Docker book announcement):

helm repo add kiamol https://kiamol.net

helm repo update

helm -n default upgrade --install vweb kiamol/vweb

The Helm chart models a Deployment and a Service, but the install also creates a Secret:

PS>kubectl get secret

NAME                         TYPE                 DATA   AGE
sh.helm.release.v1.vweb.v1   helm.sh/release.v1   1      3m18s

In the Secret is all the chart contents, plus metadata about the release.

Inspecting the Helm Secret

You can decode the Secret but that won’t help you much - the content is in the release field, and it’s a ZIP file, encoded as a Base64 text stream. So to read the contents you need to decode the Base64 representation in Kubernetes, then decode it again to get the raw ZIP content, then pass it through the gunzip tool.

This extracts the raw data into a JSON file (using a *nix shell):

kubectl get secrets sh.helm.release.v1.vweb.v1 -o=jsonpath='{ .data.release }' | base64 -d | base64 -d | gunzip -c > data_release.json

In the JSON you’ll see the YAML manifest for the deployment which Helm generated, plus the original chart contents. The interesting fields for us though are info and version:

{
    "name": "vweb",
    "info": {
        "first_deployed": "2024-10-16T07:53:28.496644+01:00",
        "last_deployed": "2024-10-16T07:53:28.496644+01:00",
        "deleted": "",
        "description": "Install complete",
        "status": "deployed"
    },
    "version": 1
}

When you run a helm upgrade command it decodes all this and checks the value of info.status before it proceeds. If it sees the release is pending then it won’t continue.

Updating the Helm Secret to Lock the Release

Now we can see how to trick Helm into blocking any updates. The process is:

extract and decode and unzip the release value from the Secret into a JSON file
update the info.status value in the JSON
also increment the version field and set a useful description
zip and encode the updated release JSON
get the Secret and store as a YAML file
update the release field in the YAML with the new data
update the YAML metadata
apply the updated Secret YAML

I use yq to make the JSON and YAML updates.

In Bash it looks like this - setting some variables first for the release we want to lock (fetch them from helm ls):

RELEASE_NAMESPACE="default"
RELEASE_NAME="vweb"
RELEASE_VERSION="1"

RELEASE_SECRET_NAME="sh.helm.release.v1.$RELEASE_NAME.v$RELEASE_VERSION"

echo "Fetching release JSON from secret: $RELEASE_SECRET_NAME"
kubectl get secrets -n $RELEASE_NAMESPACE $RELEASE_SECRET_NAME -o=jsonpath='{ .data.release }' | base64 -d | base64 -d | gunzip -c > data_release.json

let "NEW_VERSION=RELEASE_VERSION+1"
echo "Updating release JSON with lock data and new version: $NEW_VERSION"
v=$NEW_VERSION yq -i '.version = env(v)' data_release.json
yq -i '.info.status = "pending-upgrade"' data_release.json
yq -i '.info.description = "LOCKED"' data_release.json

echo "Fetching release secret YAML"
kubectl get secrets -n $RELEASE_NAMESPACE $RELEASE_SECRET_NAME -o=yaml > release_secret.yaml

NEW_SECRET_NAME="sh.helm.release.v1.$RELEASE_NAME.v$NEW_VERSION"
echo "Updating secret YAML with lock JSON and new name: $NEW_SECRET_NAME"
yq -i 'del(.data)' release_secret.yaml
yq -i 'del(.metadata.creationTimestamp)' release_secret.yaml
yq -i 'del(.metadata.resourceVersion)' release_secret.yaml
yq -i 'del(.metadata.uid)' release_secret.yaml
r=$(cat data_release.json | gzip -c | base64 -w0) yq -i '.stringData.release = env(r)' release_secret.yaml
v=$NEW_VERSION yq -i '.metadata.labels.version = strenv(v)' release_secret.yaml
yq -i '.metadata.labels.status = "pending-upgrade"' release_secret.yaml
yq -i '.metadata.labels.locked = "true"' release_secret.yaml
n=$NEW_SECRET_NAME yq -i '.metadata.name = env(n)' release_secret.yaml

kubectl apply -f release_secret.yaml

When you run this it creates a new Kubernetes Secret with the chart contents from the previous release, but with the status set to pending-upgrade, which is what locks the release. It also adds a label to the Secret - locked=true - which makes it easy to undo the lock later.

Locking and Unlocking the Helm Release

If you try this out it should end with the happy message secret/sh.helm.release.v1.vweb.v2 created. Check your Helm releases and you’ll see the vweb app is now at revision 2 and is in pending-upgrade status:

>helm ls --all
NAME    NAMESPACE       REVISION        UPDATED                              STATUS           CHART           APP VERSION
vweb    default         2               2024-10-16 07:53:28.496644 +0100 BST pending-upgrade  vweb-2.0.0      2.0.0

Adding the new Secret mimics a helm upgrade command which timed out and left the release pending. You can see the new Secret has the status label and also the locked label:

>kubectl get secret --show-labels
NAME                         TYPE                 DATA   AGE     LABELS
sh.helm.release.v1.vweb.v1   helm.sh/release.v1   1      29m     name=vweb,owner=helm,status=deployed,version=1
sh.helm.release.v1.vweb.v2   helm.sh/release.v1   1      2m56s   locked=true,name=vweb,owner=helm,status=pending-upgrade,version=2

The status label is just a convenience - updating that on its own doesn’t lock the release, you need to update the status field in the release JSON

Any attempt to run a helm upgrade will fail now:

>helm upgrade --install vweb kiamol/vweb
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

You can unlock the release by deleting the Secret:

kubectl delete secret -l owner=helm,locked=true

And now you can merrily upgrade again:

>helm upgrade --install vweb kiamol/vweb        
Release "vweb" has been upgraded. Happy Helming!
NAME: vweb
LAST DEPLOYED: Wed Oct 16 08:26:49 2024
NAMESPACE: tracing-sample
STATUS: deployed
REVISION: 2
TEST SUITE: None

All that’s left is to tidy up the Bash script and wrap it into a Docker image with bash, kubectl and yq installed so you can run it without needing all the dependencies…

If you’re working with Kubernetes and containers, you might find these related posts helpful:

Getting Started with Kubernetes on Windows - A comprehensive introduction to setting up Kubernetes on Windows
This Blog Runs on Docker and Kubernetes in Azure - Real-world example of running production workloads on Kubernetes
You Can’t Always Have Kubernetes: Running Containers in Azure VM Scale Sets - Alternative approaches when Kubernetes isn’t suitable

For more container orchestration insights, check out my Docker and Kubernetes learning resources.

Tracing External Processes with Akka.NET and OpenTelemetry: Part 2 (Running the Demo)

2024-07-16T10:00:00+00:00

In the last post I introduced a client project where I’m using OpenTelemetry and Akka.NET to collect traces for processes running in an external system. I’ve worked up a simplified demo on GitHub so you can see how it works for yourself.

Just a couple of pre-requisites and you can run this in Docker and/or Kubernetes:

a Git client
PowerShell (if you want to use my scripts)
Docker Desktop

git clone https://github.com/sixeyed/tracing-external-workflows.git
    
cd tracing-external-workflows
    
./scripts/docker/run.ps1
    
# or if you don't have PowerShell:
# docker compose -f docker/docker-compose.yml -f docker/docker-compose-monitoring.yml up -d

That will start a bunch of containers running:

> docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3d48751c5ece redis:7.2-alpine "docker-entrypoint.s…" 8 minutes ago Up 8 minutes 0.0.0.0:6379->6379/tcp tracing-sample-redis-1

133d27dc8536 sixeyed/tracing-sample-external-api:202407-linux-arm64 "dotnet /app/Externa…" 8 minutes ago Up 8 minutes 0.0.0.0:5010->8080/tcp tracing-sample-api-1

d8259aeef523 grafana/tempo:2.5.0 "/tempo -config.file…" 8 minutes ago Up 8 minutes 0.0.0.0:4317->4317/tcp tracing-sample-tempo-1

2df5bbb547b7 grafana/grafana:11.0.0 "/run.sh" 8 minutes ago Up 8 minutes 0.0.0.0:3000->3000/tcp tracing-sample-grafana-1

668c8eeefb01 sixeyed/tracing-sample-worker:202407-linux-arm64 "dotnet /app/Tracing…" 8 minutes ago Up 8 minutes tracing-sample-worker-1

2b26bc987791 sixeyed/tracing-sample-load-generator:202407-linux-arm64 "dotnet /app/Tracing…" 8 minutes ago Up 8 minutes tracing-sample-load-generator-1

What we have here is the real stack for monitoring, and a dummy stack for generating data:

the API just pretends to start Workflows; when a new Workflow is POSTed the API generates random durations for each of the stages and returns with a random ID. When the client checks the status of the Workflow the API responds with the current status based on the durations it calculated;
the background worker is where the interesting stuff happens - this is the component which tracks the external Workflows, using Akka.NET actors for each Workflow and each stage. The actors poll the API and record OpenTelemetry spans as the stages progress;
Redis is used in the real system to publish events - in the demo the background worker listens for WorkflowStarted events coming from Redis, and triggers the monitoring for each one;
the Workflow Generator is a simple tool which simulates batch processing by publishing a bunch of WorkflowStarted events to Redis, which kicks off all the monitoring in the back end;
Tempo is a collector for distributed traces, with a simple storage model. It replaces Jaeger or Zipkin and can ingest the standard OpenTelemetry protocols.

I use Jaeger in my 5* Pluralsight course - Getting Started with Istio but Tempo is nice alternative and integrates very well with Grafana.

Grafana is configured to read trace data from Tempo. In the real system the worker collects additional metrics which we store in Prometheus, and this stack gives us a single UI for dashboards and trace visualization.

Exploring the Demo App

If you want to follow the logic through the different components, they all publish logs which you can see in Docker - the API lists the random durations it generates for each workflow:

> docker logs tracing-sample-api-1

dbug: External.Api.WorkflowEntityStateMachine[0]
      DataLoader: 9f623b49-1a25-4759-bc2e-f1bcca307a50 will transition to status: Processing; after: 13s
dbug: External.Api.WorkflowEntityStateMachine[0]
      DataLoader: 9f623b49-1a25-4759-bc2e-f1bcca307a50 will transition to status: Completed; after: 168s
dbug: External.Api.WorkflowStateMachine[0]
      Workflow: e730bcce-609f-494d-860e-84af7df37ccf added new entity: DataLoader
dbug: External.Api.WorkflowEntityStateMachine[0]
      DataLoader: b1bc2564-66d0-46b4-9564-7a6da7f74a27 transitioned to status: Completed

And the worker lists the tracing activity:

> docker logs tracing-sample-worker-1
    
Creating monitor actor for: 3e2fef36-109f-4f99-af10-bc7670b4f997
Monitor: WorkflowMonitor starting; initialDelaySeconds: 5; intervalSeconds: 10; timeoutMinutes 10
Started activity. Is recording: True
Set activity tags
Refresh timer triggered
Loaded workflow
Updating entity
Update received

The worker is configured with two exporters - the console exporter prints traces when they complete, and the OTLP Exporter sends data to Tempo (set up using the OTEL_EXPORTER_OTLP_PROTOCOL and OTEL_EXPORTER_OTLP_ENDPOINT environment vairables). It’ll take a few minutes for the dummy Workflows to start completing, and when they do you’ll see log entries like this in the worker from the console exporter - in this example the Workflow ended with an error state:

Stopped activity, status: Error
Stopping WorkflowEntityMonitor actor for: DataLoader
Activity.TraceId: 28036142016d8fe90492dd49328d484f
Activity.SpanId: 94ca096e5110ef7c
Activity.TraceFlags: Recorded
WorkflowEntityMonitor stopped
Activity.ActivitySourceName: sample-tracing
Workflow finished; all entity monitors finished
Activity.DisplayName: Workflow
Terminating
Activity.Kind: Internal
Activity.StartTime: 2024-07-05T06:57:51.0737407Z
Activity.Duration: 00:01:05.0139422
Activity.Tags:
    workflowId: 704c0b60-3614-4e54-8985-5134cc20df22
    startTime: 07/05/2024 06:57:56 +00:00
    endTime: 07/05/2024 06:58:56 +00:00
StatusCode: Error
Activity.StatusDescription: Entity failed: DataLoader
Activity.Events:
    Submitted [07/05/2024 06:57:51 +00:00]
    Initializing [07/05/2024 06:57:56 +00:00]
    Processing [07/05/2024 06:58:06 +00:00]
    Failed [07/05/2024 06:58:56 +00:00]
Resource associated with Activity:
    service.name: Tracing.Worker
    service.namespace: dev1
    service.version: 1.0.0
    service.instance.id: c73a7c0d-72c1-4cb2-839f-a3b233085bf2
    telemetry.sdk.name: opentelemetry
    telemetry.sdk.language: dotnet
    telemetry.sdk.version: 1.8.1

All the data in the activity filters into Tempo and can be used for searches, so you can find individual workflows by ID, or check for failures within a given time period. The namespace tag is very useful for multi-tenant environments where you have different instances of the app pushing to a centralised monitoring stack.

You can open Grafana at http://localhost:3000/explore - no credentials needed for this deployment. Tempo is already configured as a data source, so you can select the Search tab and explore the traces coming in:

Traces aren’t shown in their entirety until all the child spans are complete, but when that happens you can drill into a Workflow to see the details:

The OpenTelemtry spec lets you record additional data with traces and spans as tags (arbitrary key-value pairs) and events (with timestamps). The Workflow monitor actor sets the key details when it starts the Activity:

Activity = Instrumentation.Tracing.ActivitySource.StartActivity(ActivityName, ActivityKind.Internal);
    
if (Activity != null)
{
  Activity.AddTagIfNew("workflowId", started.WorkflowId);
  Activity.AddEvent(new ActivityEvent("Submitted", new DateTimeOffset(started.SubmittedAt)));
}

Activity objects record the start time when you create them, but you can override that if you have more accurate data. In this case we get the real start time when we poll the external API, and we can set that in the update logic, along with any new tags. We also track changes in status as events:

Activity.SetStartTime(updated.GetStartTime());
    
Activity.AddTagIfNew("startTime", workflow.WorkflowStartTime)
        .AddTagIfNew("endTime", workflow.WorkflowEndTime);
    
var currentStatus = updated.GetStatus();
if (currentStatus != _lastStatus)
{
  Activity.AddEvent(new ActivityEvent(currentStatus));
  _lastStatus = currentStatus;
}

Those shows nicely in Grafana, showing the timestamp relative to the span:

And finally when all the Entity processing has completed, we can end the Activity. The API can respond with a lengthy set of errors if there’s been a failure but we don’t need to record all that - just flagging the Activity with a status code of OK or Error will flow through into Tempo:

Activity.SetEndTime(updated.GetEndTime());
if (string.IsNullOrEmpty(errorMessage))
{
  Activity.SetStatus(ActivityStatusCode.Ok);
}
else
{
  Activity.SetStatus(ActivityStatusCode.Error, errorMessage);
}
Activity.Stop();

Tags and attributes can all be used for filtering in Grafana, so you can search for failures or build a dashboard with a table for errored workflows. In the detail you see the status and the error message:

On to Production

As always it’s great to be able to run this whole thing in Docker on a developer’s laptop, to prove out the process and make code changes quickly. The real system runs in Kubernetes on Azure, and next time I’ll walk through deploying the monitoring subsystem and the demo app using Helm.

Tracing External Processes with Akka.NET and OpenTelemetry: Part 1 (The Code)

2024-07-03T17:29:57+00:00

Distributed tracing is one of the most useful observability tools you can add to your products. Digging into the steps of some process to see what happened and how long everything took gives you a valuable debugging tool for distributed systems. It’s usually straightforward to add tracing to HTTP components - you can get a lot of the work for free if you use a service mesh like Istio - but I had an interesting problem where I wanted to monitor processes running in an external system.

I cover the easy(ish) way to do this with HTTP services, and look at the benefits of observability in my 5* Pluralsight course Managing Apps on Kubernetes with Istio.

The system is a risk calculation engine. It has a REST API where you submit work and check on progress, but it doesn’t expose much useful instrumentation. When we submit a piece of work it goes through several stages, which range in duration from 5 minutes to several hours. In that time we can poll the API for a progress report, but we just get a snapshot of the current status, we don’t get an overall picture of the workflow.

I wanted to capture the stages of processing as a tracing graph, so we could build a dashboard with a list of completed processes, and drill down into the details for each. Something like the classic Jaeger view:

Terminology

To make sense of the rest of this post (and the series), some definitions:

each job we send to the calculation engine is called a Workflow
each Workflow has several stages, represented in the API as a collection of Workflow Entities in the Workflow object

In the real system there are different categories of job, each of which creates a Workflow with a different set of Entities. For this series I’m using a simplified version where very workflow has three Entities which run in sequence:

Data Loader, representing the initial setup of data, which typically takes from 2 to 10 minutes
Processor, which is the real work and can take from 30 to 240 minutes
Output Generator, which transforms the processor output into the required format and can take from 5 to 60 minutes.

I have a dummy API for testing which does nothing but reports on Workflow progress using random durations for each Entity.

Architecture

We’ve been live with the real system for a while so we have a good understanding of the workload. It’s pretty bursty with batches of processing coming in for a few hours, and then going quiet. During the batches we have a fairly small number of workflows, typically under 500. The external system breaks each Processor stage into tens of thousands of tasks (running on Spark), but we’re only interested in high-level progress of the Workflow and Entities. We also have a custom-built infrastructure around the external system, to publish events when we submit work, and a backend processor which listens for those events.

So to monitor the processes we need to spin up ~500 watchers which can poll the external system and track workflow progress. The actor model with Akka.NET is a great fit here; I can use one actor for each Workflow - and the Workflow actor in turn manages an actor for each Workflow Entity - and not have to worry about threads, parallelism, timers or managing lifetime. Here’s the overall design:

register a supervisor process with Akka.NET and listen for “workflow started” event messages (which we already publish to Redis)
on receipt of a message, the supervisor creates an actor to monitor that new Workflow
each actor polls the external REST API to get the status of the Workflow, and as the stages progress it creates its own actors to monitor the Workflow Entities
status updates are recorded in the actors using OpenTelemetry, stopping and starting spans for each Workflow Entity, linked to the overall trace for the Workflow.

I’ve published a full code sample on GitHub here if you want to see how it all fits together: sixeyed/tracing-external-worflows.

Towards the end of processing, each Workflow monitor actor has had three Entity monitor actors, one for each stage. The Workflow owns the overall trace, and in this example the spans for Data Loader and Processor would be complete, and the span for Output Generator would still be running:

Interesting Bits of Code

In the worker a background service runs which creates the supervisor actor and subscribes to Redis, listening for Workflow started messages. When it gets a message it sends it on to the supervisor:

_supervisor = _actorSystem.ActorOf(Props.Create<TSupervisor>(), ActorCollectionName);
    
_subscriber = _redis.GetSubscriber();
_subscriber.Subscribe(MessageType, (channel, value) =>
{
  var message = JsonSerializer.Deserialize<TStartedMessage>(value);
  _supervisor.Tell(message);
});

(The work happens in base classes because in the real system we actually have a few types of process we monitor - hence the generics - but in the sample code there’s just one type).

When the supervisor gets a “started” message, it spins up a monitor actor to watch the Workflow:

var id = started.GetId();
var props = DependencyResolver.For(Context.System).Props<TMonitor>();
     
var monitor = Context.ActorOf(props, id);
_monitors.Add(id, monitor);
monitor.Forward(started);

The monitor is loaded with the DependencyResolver, which connects the .NET Dependency Injection framework to Akka.NET. The monitor uses an Akka.NET periodic timer to trigger polling the external API for updates, and an additional one-off timer is also used as a timeout, so if the Workflow stalls (which can happen) we don’t keep watching it forever.

So the Workflow Actor responds to four message types - when the workflow starts, when an update is due, when the update is received and if the timeout fires:

Receive<TStartedMessage>(StartActivity);
    
ReceiveAsync<MonitorRefresh>(async refresh => await RefreshStatus());
    
Receive<TUpdatedMessage>(UpdateActivity);
    
Receive<MonitorTimeout>(_ => Terminate("Monitor timed out"));

When the refresh timer fires, the actor calls the external API to get the current status of the Workflow and its Entities. The client code is generated from the system’s OpenAPI spec and then wrapped in services. Those are all registered with standard .NET DI, and every call to the API uses a scoped client:

using (var scope = _serviceProvider.CreateScope())
{
  var workflowService = scope.ServiceProvider.GetRequiredService<WorkflowService>();
  workflow = await workflowService.GetWorkflow(EntityId);
}
_log.Info("Loaded workflow");

Each monitor actor tracks state using an Activity object, which is part of the .NET implementation of OpenTelemetry tracing. The Activity gets started when the actor is created, and updated when there’s a status update in the response from polling the API. The status updates include the current stage of the process, and for each stage the workflow monitor actor creates a Workflow Entity actor which has its own Activity linked to the parent Activity:

foreach (var entity in workflow.WorkflowEntities)
{
  var entityType = Enum.Parse<EntityType>(entity.Key);
  if (!_entityMonitors.ContainsKey(entityType))
  {
    var entityMonitor = Context.ActorOf(WorkflowEntityMonitor.Props(entityType, Activity), entity.Key);
    _entityMonitors.Add(entityType, entityMonitor);
  }
}

When the stage completes, the Workflow Entity actor ends the child Activity, ending the span, and sends a message to the workflow monitor actor to say this entity is finished with:

_activity.AddTagIfNew("endTime", entity.EntityEndTime);
if (string.IsNullOrEmpty(entity.EntityErrorMessage))
{
  _activity.SetStatus(ActivityStatusCode.Ok);
}
else
{
  _activity.SetStatus(ActivityStatusCode.Error, entity.EntityErrorMessage);
}
    
_activity.SetEndTime(entity.EntityEndTime.Value.DateTime);
_activity.Stop();
    
var ended = new WorkflowEntityEnded(_entityType);
Context.Parent.Tell(ended, Self);

And when all the Entities are done and the whole Workflow is finished, the parent Activity is ended which completes the trace and sends it on to the exporters. In the sample code I’ve configured the console exporter so traces get published as logs, and the OTLP exporter to send the traces to a real collector so you can visualize them:

Continue reading in Part 2: Running the Demo where I’ll show you how to run the sample app with Docker containers, collecting the traces with Tempo and exploring them with Grafana.

You can’t always have Kubernetes: running containers in Azure VM Scale Sets

2021-03-09T15:51:46+00:00

Rule number 1 for running containers in production: don’t run them on individual Docker servers. You want reliability, scale and automated upgrades and for that you need an orchestrator like Kubernetes, or a managed container platform like Azure Container Instances.

If you’re choosing between container platforms, my new Pluralsight course Deploying Containerized Applications walks you through the major options.

But the thing about production is: you’ve got to get your system running, and real systems have technical constraints. Those constraints might mean you have to forget the rules. This post covers a client project I worked on where my design had to forsake rule number 1, and build a scalable and reliable system based on containers running on VMs.

This post is a mixture of architecture diagrams and scripts - just like the client engagement.

When Kubernetes won’t do

I was brought in to design the production deployment, and build out the DevOps pipeline. The system was for provisioning bots which join online meetings. The client had run a successful prototype with a single bot running on a VM in Azure.

The goal was to scale the solution to run multiple bots, with each bot running in a Docker container. In production the system would need to scale quickly, spinning up more containers to join meetings on demand - and more hosts to provide capacity for more containers.

So far, so Kubernetes. Each bot needs to be individually addressable, and the connection from the bot to the meeting server uses mutual TLS. The bot has two communication channels - HTTPS for a REST API, and a direct TCP connection for the data stream from the meeting. That can all be done with Kubernetes - Services with custom ports for each bot, Secrets for the TLS certs, and a public IP address for each node.

If you want to learn how to model an app like that, my book Learn Kubernetes in a Month of Lunches is just the thing for you :)

But… The bot uses a Windows-only library to connect to the meeting, and the bot workload involves a lot of video manipulation. So that brought in the technical constraints for the containers:

they need to run with GPU access
the app uses the Windows video subsystem, and that needs the full (big!) Windows base Docker image.

Right now you can run GPU workloads in Kubernetes, but only in Linux Pods, and you can run containers with GPUs in in Azure Container Instances, but only for Linux containers. So we’re looking at a valid scenario where orchestration and managed container services won’t do.

The alternative - Docker containers on Windows VMs in Azure

You can run Docker containers with GPU access on Windows with the devices flag. You need to have your GPU drivers set up and configured, and then your containers will have GPU access (the DirectX Container Sample walks through it all):

# on Windows 10 20H2:
docker run --isolation process --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 sixeyed/winml-runner:20H2

# on Windows Server LTSC 2019:
docker run --isolation process --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 sixeyed/winml-runner:1809

The container also needs to be running with process isolation - see my container show ECS-W4: Isolation and Versioning in Windows Containers on YouTube for more details on that.

Note - we’re talking about the standard Docker Engine here. GPU access for containers used to require an Nvidia fork of Docker, but now GPU access is part of the main Docker runtime.

You can spin up Windows VMs with GPUs in Azure, and have Docker already installed using the Windows Server 2019 Datacenter with Containers VM image. And for the scaling requirements, there are Virtual Machine Scale Sets (VMSS), which let you run multiple instances of the same VM image - where each instance can run multiple containers.

The design I sketched out looked like this:

each VM hosts multiple containers, each using custom ports
a load balancer spans all the VMs in the scale set
load balancer rules are configured for each bot’s ports

The idea is to run a minimum number of VMs, providing a stable pool of bot containers. Then we can scale up and add more VMs running more containers as required. Each bot is uniquely addressable within the pool, with a predictable address range, so bots.sixeyed.com:8031 would reach the first container on the third VM and bots.sixeyed.com:8084 would reach the fourth container on the eighth VM.

Using a custom VM image

With this approach the VM is the unit of scale. My assumption was that adding a new VM to provide more bot capacity would take several minutes - too long for a client waiting for a bot to join. So the plan was to run with spare capacity in the bot pool, scaling up the VMSS when the pool of free bots fell below a threshold.

Even so, scaling up to add a new VM had to be a quick operation - not waiting minutes to pull the super-sized Windows base image and extract all the layers. The first step in minmizing scale-up time is to use a custom VM image for the scale set.

A VMSS base image can be set up manually by running a VM and doing whatever you need to do. In this case I could use the Windows Server 2019 image with Docker configured, and then run an Azure extension to install the Nvidia GPU drivers:

# create vm:
az vm create `
  --resource-group $rg `
  --name $vmName `
  --image 'MicrosoftWindowsServer:WindowsServer:2019-Datacenter-Core-with-Containers' `
  --size 'Standard_NC6_Promo' `
  --admin-username $username `
  --admin-password $password

# deploy the nvidia drivers:
az vm extension set `
  --resource-group $rg `
  --vm-name $vmName `
  --name NvidiaGpuDriverWindows `
  --publisher Microsoft.HpcCompute `
  --version 1.3

The additional setup for this particular VM:

pre-pulling the Windows base image
configuring the Nvidia GPU to use the correct driver mode for video decoding - MDDM instead of TCC
installing the Azure CLI so the VM can authenticate to a private Azure Container Registry to pull application images
running SysPrep to generalize the Windows OS

Then you can create a private base image from the VM, first deallocating and generalizing it:

az vm deallocate --resource-group $rg --name $vmName

az vm generalize --resource-group $rg --name $vmName

az image create --resource-group $rg `
    --name $imageName --source $vmName

The image can be in its own Resource Group - you can use it for VMSSs in other Resources Groups.

Creating the VM Scale Set

Scripting all the setup with the Azure CLI makes for a nice repeatable process - which you can easily put into a GitHub workflow. The az documentation is excellent and you can build up pretty much any Azure solution using just the CLI.

There are a few nice features you can use with VMSS that simplify the rest of the deployment. This abridged command shows the main details:

az vmss create `
   --image $imageId `
   --subnet $subnetId `
   --public-ip-per-vm `
   --public-ip-address-dns-name $vmssPipDomainName `
   --assign-identity `
  ...

That’s going to use my custom base image, and attach the VMs in the scale set to a specific virtual network subnet - so they can connect to other components in the client’s backend. Each VM will get its own public IP address, and a custom DNS name will be applied to the public IP address for the load balancer across the set.

The VMs will use managed identity - so they can securely use other Azure resources without passing credentials around. You can use az role assignment create to grant access for the VMSS managed identity to ACR.

When the VMSS is created, you can set up the rules for the load balancer, directing the traffic for each port to a specific bot container. This is what makes each container individually addressable - only one container in the VMSS will listen on a specific port. A health probe in the LB tests for a TCP connection on the port, so only the VM which is running that container will pass the probe and be sent traffic.

# health probe:
az network lb probe create `
 --resource-group $rg --lb-name $lbName `
 -n "p$port" --protocol tcp --port $port

# LB rule:
az network lb rule create `
 --resource-group $rgName --lb-name $lbName `
 --frontend-ip-name loadBalancerFrontEnd `
 --backend-pool-name $backendPoolName `
 --probe-name "p$port" -n "p$port" --protocol Tcp `
 --frontend-port $port --backend-port $port

Spinning up containers on VMSS instances

You can use the Azure VM custom script extension to run a script on a VM, and you can trigger that on all the instances in a VMSS. This is the deployment and upgrade process for the bot containers - run a script which pulls the app image and starts the containers.

Up until now the solution is pretty solid. This script is the ugly part, because we’re going to manually spin up the containers using docker run:

docker container run -d `
 -p "$($port):443" `
 --restart always `
 --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 `
 $imageName

The real script adds an env-file for config settings, and the run commands are in a loop so we can dynamically set the number of containers to run on each VM. So what’s wrong with this? Nothing is managing the containers. The restart flag means Docker will restart the container if the app crashes, and start the containers if the VM restarts, but that’s all the additional reliability we’ll get.

In the client’s solution, they added functionality to their backend API to manage the containers - but that sounds a lot like writing a custom orchestrator…

Moving on from the script, upgrading the VMSS instances is simple to do. The script and any additional assets - env files and certs - can be uploaded to private blob storage, using SAS tokens for the VM to download. You use JSON configuration for the script extension and you can split out sensitive settings.

# set the script on the VMSS:
az vmss extension set `
    --publisher Microsoft.Compute `
    --version 1.10 `
    --name CustomScriptExtension `
    --resource-group $rg `
    --vmss-name $vmss `
    --settings $settings.Replace('"','\"') `
    --protected-settings $protectedSettings.Replace('"','\"')

# updating all instances triggers the script:
az vmss update-instances `
 --instance-ids * `
 --name $vmss `
 --resource-group $rg

When you apply the custom script extension that updates the model for the VMSS - but it doesn’t actually run the script. The next step does that, updating instances runs the script on each of them, replacing the containers with the new Docker image version.

Code and infra workflows

All the Azure scripts can live in a separate GitHub repo, with secrets added for the az authentication, cert passwords and everything else. The upgrade scripts to deploy the custom script extension and update the VMSS instances can sit in a workflow with a workflow_dispatch trigger and input parameters:

on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment to deploy: dev, test or prod'     
        required: true
        default: 'dev'
      imageTag:
        description: 'Image tag to deploy, e.g. v1.0-175'     
        required: true
        default: 'v1.0'

The Dockerfile for the image lives in the source code repo with the rest of the bot code. The workflow in that repo build and pushes the image and ends by triggering the upgrade deployment in the infra repo - using Ben Coleman’s benc-uk/workflow-dispatch action:

deploy-dev:  
  if: $
  runs-on: ubuntu-18.04
  needs: build-teams-bot
    steps:
    - name: Dispatch upgrade workflow
      uses: benc-uk/workflow-dispatch@v1
      with:
        workflow: Upgrade bot containers
        repo: org/infra-repo
        token: $
        inputs: '{"environment":"dev", "imageTag":"v1.0-$"}'
        ref: master

So the final pipeline looks like this:

devs push to the main codebase
build workflow triggered - uses Docker to compile the code and package the image
if the build is successful, that triggers the publish workflow in the infrastructure repo
the publish workflow updates the VM script to use the new image label, and deploys it to the Azure VMSS.

I covered GitHub workflows with Docker in ECS-C2: Continuous Deployment with Docker and GitHub on YouTube

Neat and automated for a reliable and scalable deployment. Just don’t tell anyone we’re running containers on individual servers, instead of using an orchestrator…

Want to learn more about container orchestration? Check out my guide on Getting Started with Kubernetes on Windows or explore my Docker and Kubernetes learning path.

How to Experiment with .NET 5 and 6 using Docker containers - No Local Installation Required

2021-02-21T20:38:10+00:00

The .NET team publish Docker images for every release of the .NET SDK and runtime. Running .NET in containers is a great way to experiment with a new release or try out an upgrade of an existing project, without deploying any new runtimes onto your machine.

In case you missed it, .NET 5 is the latest version of .NET and it’s the end of the “.NET Core” and “.NET Framework” names. .NET Framework ends with 4.8 which is the last supported version. and .NET Core ends with 3.1 - and evolves into straight “.NET”. The first release is .NET 5 and the next version - .NET 6 - will be a long-term support release.

If you’re new to the SDK/runtime distinction, check my guide on Understanding Microsoft’s Docker Images for .NET Apps.

Setting Up .NET 5 Development Environment

Run a .NET 5 development environment in a Docker container

You can use the .NET 5.0 SDK image to run a container with all the build and dev tools installed. These are official Microsoft images, published to MCR (the Microsoft Container Registry).

Create a local folder for the source code and mount it inside a container:

mkdir -p /tmp/dotnet-5-docker

docker run -it --rm \
  -p 5000:5000 \
  -v /tmp/dotnet-5-docker:/src \
  mcr.microsoft.com/dotnet/sdk:5.0

All you need to run this command is Docker Desktop on Windows or macOS, or Docker Community Edition on Linux.

Docker will pull the .NET 5.0 SDK image the first time you use it, and start running a container. If you’re new to Docker this is what the options mean:

-it connects you to an interactive session inside the container
-p publishes a network port, so you can send traffic into the container from your machine
--rm deletes the container and its storage when you exit the session
-v mounts a local folder from your machine into the container filesystem - when you use /src inside the container it’s actually using the /tmp/dotnet-5-docker folder on your machine
mcr.microsoft.com/dotnet/sdk:5.0 is the full image name for the 5.0 release of the SDK

And this is how it looks:

When the container starts you’ll drop into a shell session inside the container , which has the .NET 5.0 runtime and developer tools installed. Now you can start playing with .NET 5, using the Docker container to run commands but working with the source code on your local machine.

In the container session, run this to check the version of the SDK:

dotnet --list-sdks

Creating and Running a Quickstart Project

The dotnet new command creates a new project from a template. There are plenty of templates to choose from, we’ll start with a nice simple REST service, using ASP.NET WebAPI.

Initialize and run a new project:

# create a WebAPI project without HTTPS or Swagger:
dotnet new webapi \
  -o /src/api \
  --no-openapi --no-https

# configure ASP.NET to listen on port 5000:
export ASPNETCORE_URLS=http://+:5000

# run the new project:
dotnet run \
  --no-launch-profile \
  --project /src/api/api.csproj

When you run this you’ll see lots of output from the build process - NuGet packages being restored and the C# project being compiled. The output ends with the ASP.NET runtime showing the address where it’s listening for requests.

Now your .NET 5 app is running inside Docker, and because the container has a published port to the host machine, you can browse to http://localhost:5000/weatherforecast on your machine. Docker sends the request into the container, and the ASP.NET app processes it and sends the response.

Packaging Your App into a Docker Image

What you have now isn’t fit to ship and run in another environment, but it’s easy to get there by building your own Docker image to package your app.

I cover the path to production in my Udemy course Docker for .NET Apps

To ship your app you can use this .NET 5 sample Dockerfile to package it up. You’ll do this from your host machine, so you can stop the .NET app in the container with Ctrl-C and then run exit to get back to your command line.

Use Docker to publish and package your WebAPI app:

# verify the source code is on your machine: 
ls /tmp/dotnet-5-docker/api

# switch to your local source code folder:
cd /tmp/dotnet-5-docker

# download the sample Dockerfile:
curl -o Dockerfile https://raw.githubusercontent.com/sixeyed/blog/master/dotnet-5-with-docker/Dockerfile

# use Docker to package from source code:
docker build -t dotnet-api:5.0 .

Now you have your own Docker image, with your .NET 5 app packaged and ready to run. You can edit the code on your local machine and repeat the docker build command to package a new version.

Running Your App in a New Container

The SDK container you ran is gone, but now you have an application image so you can run your app without any additional setup. Your image is configured with the ASP.NET runtime and when you start a container from the image it will run your app.

Start a new container listening on a different port:

# run a container from your .NET 5 API image:
docker run -d -p 8010:80 --name api dotnet-api:5.0

# check the container logs:
docker logs api

In the logs you’ll see the usual ASP.NET startup log entries, telling you the app is listening on port 80. That’s port 80 inside the container though, which is published to port 8010 on the host.

The container is running in the bckground, waiting for traffic. You can try your app again, running this on the host:

curl http://localhost:8010/weatherforecast

When you’re done fetching fictional weather forecasts, you can stop and remove your container with a single command:

docker rm -f api

And if you’re done experimenting, you can remove your image and the .NET 5 images:

docker image rm dotnet-api:5.0

docker image rm mcr.microsoft.com/dotnet/sdk:5.0

docker image rm mcr.microsoft.com/dotnet/aspnet:5.0

Now your machine is back to the exact same state before you tried .NET 5.

Working with .NET 6

What about .NET 6?

You can do exactly the same thing for .NET 6, just changing the version number in the image tags. .NET 6 is in preview right now but the 6.0 tag is a moving target which gets updated with each new release (check the .NET SDK repository and the ASP.NET runtime repository on Docker Hub for the full version names).

To try .NET 6 you’re going to run this for your dev environment:

mkdir -p /tmp/dotnet-6-docker

docker run -it --rm \
  -p 5000:5000 \
  -v /tmp/dotnet-6-docker:/src \
  mcr.microsoft.com/dotnet/sdk:6.0

Then you can repeat the steps to create a new .NET 6 app and run it inside a container.

And in your Dockerfile you’ll use the mcr.microsoft.com/dotnet/sdk:6.0 image for the builder stage and the mcr.microsoft.com/dotnet/aspnet:6.0 image for the final application image.

It’s a nice workflow to try out a new major or minor version of .NET with no dependencies (other than Docker). You can even put your docker build command into a GitHub workflow and build and package your app from your source code repo - check my YouTube show Continuous Deployment with Docker and GitHub for more information on that.

Looking to learn more about Docker and .NET? Check out my guide on Understanding Microsoft’s Docker Images for .NET Apps or my comprehensive series on Docker containers and orchestration.

Elton’s Blog

Why Would You Write a Book About Docker in 2025?

Why Would You Write a Book About Docker in 2025?

The Reality of Learning Docker in Production

What’s New in the Second Edition

From Basics to Production

The Practical Approach

Getting Started

My New SRE Course: Resiliency and Automation (2025)

My New SRE Course: Resiliency and Automation 🚀

SRE Resiliency: The Story 📖

SRE Skills You’ll Master

Real Problems, Real Solutions 🎯

Target Audience for SRE Professionals

The SRE Partnership Model 🤝

Next Steps ➡️

10 Essential Claude Code Tips: Boost Your AI Coding Productivity in 2025

Getting Started with Claude Code

1. Run Multiple Claude Instances: Multitask Like a Manager

2. Delegate Debugging: Let Claude Do the Work

3. Code Review Mindset: Roll With AI-Generated Code

4. Rapid Prototyping: Design and Iterate on the Fly

5. Git Best Practices: Stay in Control of Commits

6. Beyond Application Code: Let Claude Handle Infrastructure

7. Troubleshooting Complex Tasks: Be Persistent

8. CLAUDE.md Best Practices: Provide Context Upfront

9. Async Work Queue: Batch Your Changes

10. Knowledge Management: Capture and Reuse Prompts

Claude Code Pricing and Usage Limits

Frequently Asked Questions

The Future of AI-Assisted Development

An Evening with Claude Code - or - How I Learned to Stop Worrying and Love AI

Welcome to the Paradigm Shift

Project #1: The Cloud Proof-of-Concept

Project #2: Pluralsight Course Demo

Project #3: The Blog UI (Finally!)

The Evening’s Tally

The Realization

The Competitive Reality

Site Reliability Engineering (SRE) on Pluralsight: Complete 4-Course Learning Path

The SRE Learning Path

Course 1: SRE Concepts and Principles

What You’ll Learn

Course Outline

Course 2: SRE Monitoring and Observability

What You’ll Learn

Course Outline

Real-World Tools and Practices

Who Should Take These Courses?

What’s Next?

Getting Started

Frequently Asked Questions

Locking Helm Releases to Prevent Upgrades (and Downgrades)

The Challenge: Preventing Unwanted Helm Upgrades and Downgrades

Pending Status to the Rescue

Inspecting the Helm Secret

Updating the Helm Secret to Lock the Release

Locking and Unlocking the Helm Release

Related Reading

Tracing External Processes with Akka.NET and OpenTelemetry: Part 2 (Running the Demo)

Exploring the Demo App

On to Production

Tracing External Processes with Akka.NET and OpenTelemetry: Part 1 (The Code)

Terminology

Architecture

Interesting Bits of Code

You can’t always have Kubernetes: running containers in Azure VM Scale Sets

When Kubernetes won’t do

The alternative - Docker containers on Windows VMs in Azure

Using a custom VM image

Creating the VM Scale Set

Spinning up containers on VMSS instances

Code and infra workflows

How to Experiment with .NET 5 and 6 using Docker containers - No Local Installation Required

Setting Up .NET 5 Development Environment

Run a .NET 5 development environment in a Docker container

Creating and Running a Quickstart Project

Packaging Your App into a Docker Image

Running Your App in a New Container

Working with .NET 6