Quantcast
Channel: Particular Software
Viewing all 138 articles
Browse latest View live

NServiceBus on .NET Core - It's time

$
0
0

During Build 2017, Microsoft released .NET Core 2.0 Preview 1. While we previously determined it was too early to seriously consider adopting .NET Core, with this release we now believe that the current platform can support a comprehensive, reliable, and production-ready version of NServiceBus. As a result, we are happy to say...

Good News, Everyone

NServiceBus 7 will support .NET Core 2.0 running on any of the supported platforms. This initial release will include .NET Core compatible versions of the RabbitMQ transport and SQL persistence. We're not going to stop there, though. Over time, more NServiceBus packages will get .NET Core support.

Since we depend on the release of .NET Core 2.0, we can't provide a release date for NServiceBus 7 just yet. .NET Core 2.0 is currently expected to be released in Q3 2017, and we expect to release NServiceBus 7 shortly after that has happened.

Keen observers will already have noticed that we have made some changes to the NServiceBus 6 codebase to better align us with the goal of supporting .NET Core 2.0. For example, Encryption and Windows Performance Counters have been moved into dedicated repositories and NuGet packages. More changes are still required before we can fully compile against .NET Core, and those efforts are already underway.

For those of you who do not have plans to move your systems to .NET Core, we are not abandoning you. Future versions of NServiceBus will continue to run on the .NET Framework as well as running on .NET Core.

If you haven't already, subscribe to receive email updates about our .NET Core support and the release of NServiceBus 7.


Putting your events on a diet

$
0
0

Anybody can write code that will work for a few weeks or months, but what happens when that code is no longer your daily focus and the cobwebs of time start to sneak in? What if it's someone else's code? How do you add new features when you need to relearn the entire codebase each time? How can you be sure that making a small change in one corner won't break something elsewhere?

Complexity and coupling in your code can suck you into a slow death spiral toward the eventual Major Rewrite. You can attempt to avoid this bitter fate by using architectural patterns like event-driven architecture. When you build a system of discrete services that communicate via events, you limit the complexity of each service by reducing coupling. Each service can be maintained without having to touch all the other services for every change in business requirements.

But if you're not careful, it's easy to fall into bad habits, loading up events with far too much data and reintroducing coupling of a different kind. Let's take a look at how this might happen by analyzing the Amazon.com checkout process and discussing how you could do things differently.

What do I mean by event?

Before we get to the checkout process, let me be specific about what I mean by the word event. An event has two features: it has already happened and it is relevant to the business. A customer has registered, an order was placed, a new product was added—these are all examples of events that carry business meaning.

Compare this to a command. A command is a directive to do something that hasn't happened yet, like place an order or change an address. Often, commands and events come in pairs. For example, if a PlaceOrder command is successful, an OrderPlaced event can be published, and other services can react to that event.

Commands only have one receiver: the code that does the work the command wants done. For example, a PlaceOrder command only has one receiver because there's only one chunk of code capable of placing the order. Because there's only one receiver, it's quite easy to change, modify, and evolve the command and the handling code in lockstep.

However, events will be consumed by multiple subscribers. There might be two, five, or fifty pieces of code that react to the OrderPlaced event, such as payment processing, item shipping, warehouse restocking, etc. Because there can be many places subscribing to the event, modifying the event can have a large ripple effect through multiple different systems, as you'll see shortly.

Let's buy something

Let's go to Amazon to buy Enterprise Integration Patterns by Gregor Hohpe and Bobby Woolf, which is valuable reading for anybody building distributed systems. You visit Amazon, put the book in your shopping cart, and then proceed to checkout. What happens next?

Note: Amazon's actual checkout process is more complex than presented here, and it changes all the time. This simplified example will be good enough to illustrate the point without getting too complicated.

As you're guided through the checkout process, Amazon will gather a bunch of information from you in order to place the order. Let's briefly consider what information will be necessary for your order to be completed:

  • The items in your shopping cart
  • The shipping address
  • Payment information, including payment type, billing address, etc.

When you reach the end of the checkout process, all of this information will be displayed for your review, along with a "place your order" button. When you click the button, an OrderPlaced event will be raised containing all of the order information you provided, along with an OrderId to uniquely identify the order. The event could look something like this:

class OrderPlaced {    Guid OrderId    Cart ShoppingCart    Address ShippingAddress    PaymentDetails Payment}

In your Amazon-like system, there will be subscribers for this event that will spring into action once it has been published: billing the order, adjusting inventory levels, preparing the item for shipment, and sending an email receipt. There could be additional subscribers that manage customer loyalty programs, adjust item prices based on popularity, update "frequently bought with" associations, and countless other things. The important thing is that, a few days later, a new book arrives in a box on the doorstep.

So everything is great, right?

Event bloat

This OrderPlaced event decouples the web tier from the back-end processing, which makes you feel good about yourself but hides more insidious coupling that could get you into trouble later. It's like overeating at a big family gathering—it feels good in the moment, but eventually you're going to have a stomachache.

An event such as this robs each service of autonomy because they are all dependent upon the Sales service to provide the data they need. These different data items are locked together inside the OrderPlaced event contract. So, if Shipping wants to add a new Amazon Prime delivery option, that information needs to be added to the OrderPlaced event. Billing wants to support Bitcoin? OrderPlaced needs to change again. Because the Sales service is responsible for the OrderPlaced event, every other service is dependent upon Sales.

With each change to the OrderPlaced event, you'll need to analyze every subscriber, seeing if it needs to change as well. You may end up having to redeploy the entire system, and that means testing all of the affected pieces as well.

So really, you don't have autonomous services. You have a tangled web of interdependent services. The aim of event-driven architecture was to decouple the system so that changes to business requirements could be implemented by only targeted changes to isolated services. But with a fat event like the one shown above, this becomes impossible.

Congratulations, you've created Frankenstein's monster. In essence, you traded a monolithic system for an event-driven distributed monolithic system. What if you could untangle these systems so that they are truly autonomous?

Time for a diet

To trim down the event and get it into fighting shape, you need to put it on a diet. To do that, let's start over and analyze each piece of information in the OrderPlaced event and assign it to a specific service.

Time for a diet

OrderId and ShoppingCart relate to selling the product, so those can be owned by Sales. ShippingAddress, however, relates to shipping the products to the customer, so they should be owned by a Shipping service. Payment relates to collecting payment for the products, so let's have that belong to a Billing service.

class OrderPlaced {    Guid            OrderId          // Sales    Cart            ShoppingCart     // Sales    Address         ShippingAddress  // Shipping    PaymentDetails  Payment          // Billing}

With these boundaries drawn, we can review the checkout process and see if there's a way to improve things.

Slimming down

The trick to slimming down our events and reducing coupling between services is to create the OrderId upfront. There's no law that all IDs must come from a database. An OrderId can be created when the user starts the checkout process.

You can start the checkout process by sending a CreateOrder command to the Sales service to define the OrderId and the items in the cart:

class CreateOrder {    Guid OrderId    Cart ShoppingCart}

The next step of the checkout process was selecting the shipping address. Rather than adding that data to the OrderPlaced event, what if you instead created a separate command?

class StoreShippingAddressForOrder {    Guid OrderId    Address ShippingAddress}

You can send the StoreShippingAddressForOrder command from the web application straight to the Shipping service that owns the data. The order hasn't even been placed at this point, so no packages are getting shipped just yet. When it does come time to ship the order, the Shipping service will already know where to send it.

If the customer never finishes the order, there's no harm in having completed these steps already. In fact, there are valuable business insights to be gained from analysis of abandoned shopping carts, and having a process to contact users who have abandoned shopping carts can prove to be a valuable way to increase sales.

Next in the checkout process, you must collect payment information from the customer. Since Payment is owned by the Billing service, you can send this command to Billing:

class StoreBillingDetailsForOrder {    Guid OrderId    PaymentDetails Payment}

The Billing service will not charge the order yet—just record the information and wait until the order is placed. If your organization does not want to bear the security risk of storing credit card information, the payment can be authorized now and captured after the order is placed.

All that's left is to place the order. By creating the OrderId upfront, we were able to remove most of the data that was in the original OrderPlaced event, sending it instead to other services that own those pieces of information. So the Sales service can now publish an extremely simple OrderPlaced event:

class OrderPlaced {    Guid OrderId}

This slimmed-down OrderPlaced event is a lot more focused. All the unnecessary coupling has been removed. Once this event is published by Sales, Billing will take the payment information, which it has already stored, and charge the order. It will publish an OrderBilled event when the credit card is successfully charged. The Shipping service will subscribe to OrderPlaced from Sales and OrderBilled from Billing, and once it receives both, it will know that it can ship the products to the user.

System Diagram

Let's take a look at the two versions of the OrderPlaced event again:

// Beforeclass OrderPlaced {    Guid            OrderId          // Sales    Cart            ShoppingCart     // Sales    Address         ShippingAddress  // Shipping    PaymentDetails  Payment          // Billing}// Afterclass OrderPlaced {    Guid OrderId}

Which event would be the least risky to deploy to production? Which would be easier to test? The answer is the smaller event, with all of the unnecessary coupling removed.

Fighting shape

The benefit to slimming down our events is getting them into fighting shape to tackle changes in business requirements that are sure to come down the line. If we want to introduce Amazon Prime shipping or support Bitcoin as a form of payment, it's now a lot easier to do it without having to modify the Sales service at all.

To support Prime shipping, we would send a SetShippingTypeForOrder command to the Shipping service during the checkout service. It would look something like this:

class StoreShippingTypeForOrder {    Guid OrderId    int ShippingType}

This would be the second command we send to the Shipping service, along with StoreShippingAddressForOrder. The addition of Prime shipping will change how the Shipping service prepares an order, but there's no reason to touch the OrderPlaced event or any of the code in the Sales service.

In similar fashion, we could implement Bitcoin, a concern of the Billing service, in a few different ways. We could add Bitcoin properties to the PaymentDetails class used in the StoreBillingDetailsForOrder command. Or we could devise a new command specifically for Bitcoin and send that instead of StoreBillingDetailsForOrder. In this case, Billing would not publish an OrderBilled unless payment had been made in one of the two forms. After all, the Shipping service just cares that the order was paid for. It doesn't care how.

In any case, support for Bitcoin would be implemented solely by changing elements of the Billing service. Sales and Shipping would remain completely unchanged and would not have to be retested or redeployed. With less surface area affected by each change, we can adapt to changing business requirements much more quickly.

And that was kind of the point of using event-driven architecture in the first place.

Summary

In event-driven systems, large events are a design smell. Try to keep events as small as possible. Services should really only share IDs and maybe a timestamp to indicate when the information was effective. If it feels like more data than this needs to be shared between services, take it as an indication that perhaps the boundaries between your services are wrong. Think about your architecture based on who should own each piece of data, and put those events on a diet.

For more information on how to create loosely-coupled event-driven systems, check out our Introduction to NServiceBus tutorial.


About the author: David Boike is a developer at Particular Software who, due to an unfortunate affinity for pizza and burritos, finds it much easier to put his events on a diet than himself.

Asynchronously unload the dishwasher

$
0
0

In a previous blog post, I discussed a very complex and intricate process: how my family unloads our dishwasher using a chain of responsibility. We examined a happy-path scenario in which each person hands a dish to the next. Every step takes the same amount of time, and the process hums along like clockwork. You can almost hear us singing “Whistle While You Work” while we gleefully put away dishes.

Now let’s see what happens when we add a few distractions. (Hint: it involves a List<Func<Func<Task>, Task>> object.)

Let's say that, just when my son gives a dish to my wife, the phone rings and she decides to take the call. In the synchronous world, my son would freeze in location and time, dish in his hand, until my wife is finished. But in the asynchronous world, while my wife is on the phone, my son and I can play with Legos, do other chores, or make funny faces at my wife since there is nothing more fun than distracting someone on the phone.

In other words, only the process of unloading the dishwasher is frozen in time. While my wife is on the phone, my son and I can still do other things, and as soon as my wife returns, we can pick up the dishwasher unloading where we left off. So we maximize our occupancy even though one element in the chain is not responding.

Now let's see how this translates into code.

Unloading asynchronously

A single person (a link in the chain) that is not blocked can be represented as just a method, as in the synchronous version before. But this time, the method needs to be asynchronous. So the return type of the method needs to be changed from void to Task, and the method can be marked with the async keyword.

static async Task Person(Func<Task> next){    // Implementation    await next();}

The Person method above has a single parameter named next. In the synchronous version, the parameter was of type Action. The Action type is a delegate pointing to any method that returns void and accepts zero parameters. In the asynchronous version, the return type is of type Task, so the next delegate is declared with type Func<Task>. Func<Task> is a delegate pointing to any method that returns Task and accepts zero parameters. As before, passing in the delegate allows us to compose multiple individual elements into a chain of responsibility. But this time, it's an asynchronous one.

Let's define an asynchronous function that represents my son:

static async Task Son(Func<Task> next){    // son can reach? return; else:    Console.WriteLine("Son can't reach");    await next();}    

When my son's chore is done, he calls the next delegate and awaits the outcome of the asynchronous operation, which is the continuation of the asynchronous chain.

Now let's chain everything together:

public async Task ManualDishwasherUnloading(){    await Son(() => Wife(() => Husband(() => Done())));}

The ManualDishwasherUnloading method contains the chain of the individual links. You can see how we've represented each person, or a link in the chain, as a method that matches the signature for a Func<Task>. In contrast to the synchronous version, the asynchronous version of ManualDishwasherUnloading needs to return a Task as well and await the outcome of the chain. Only by making the full call stack asynchronous can resources working on the dishwasher-unloading process be freed up when awaiting completion of elements in the chain.

We'd probably not enjoy manually writing asynchronous and complex method-chaining, like shown in ManualDishwasherUnloading, over and over again. Luckily, there's an even more flexible and maintainable way of doing this.

A better asynchronous chain of responsibility

A simpler and more generic approach is to create a list of functions which are executed one by one until the end of the list is reached:

public async Task MoreFlexibleDishwasherUnloading(){   var elements = new List<Func<Func<Task>, Task>>   {     Son,     Wife,     Husband,     next => Done()   };   await Invoke(elements);}static async Task Invoke(List<Func<Func<Task>, Task>> elements, int currentIndex = 0){     if(currentIndex == elements.Count)       return;     var element = elements[currentIndex];     await element(() => Invoke(elements, currentIndex + 1));}

In the previous post, this was done using a List<Action<Action>>. That was hard enough to wrap my head around, but it wasn't until I first wrote this code and declared List<Func<Func<Task>, Task>> that my head really exploded. Why not just List<Func<Task>>? Remember the signature of the methods stored in the list is Task LinkInTheChain(Func<Task> next). We want the ability to execute them in a generic way. The Invoke method takes a Func<Func<Task, Task>> from the list and invokes it by passing itself recursively as the next function parameter. Just like with the synchronous version, the process terminates when the end of the list is reached.

The output of this code would be

Son can't reachWife can't reachHusband put dish awayDish put away!  

As soon as we start introducing more asynchronous links in the chain, the generic approach shows its merits. For example, say you want to ignore dishes that are still wet. We can surround a link in the chain with a wrapper that filters out exceptions:

static async Task IgnoreDishStillWetException(Func<Task> next)){    try    {        await next();    }    catch(DishStillWetException) { }}  

It's easy to add IgnoreDishStillWetException before any step in the chain of responsibility when it is needed.

Now that we have the fundamentals in place, let's apply them to specific scenarios.

Message handling as a chain of responsibility

Handling a message from a queue can be done as an asynchronous chain of responsibility:

public async Task MessageHandlingChain(){    var elements = new List<Func<Func<Task>, Task>>    {        RetryMultipleTimesOnFailure,        PickMessageFromTransport,        DeserializeMessage,        DetermineCodeToBeExecuted,    };    await Invoke(elements);}

The last few code snippets look an awful lot like the ones we had in the previous post about the synchronous chain of responsibility, except we've replaced Action<Action> with Func<Func<Task>> and all methods returning void now return Task instead. This isn't a coincidence. In the asynchronous world, Task is the new void. Just like when my wife gets a call, or IO-intensive work is kicked off, the occurrence of an await is an opportunity for a thread to go work on other, more pressing tasks. When the asynchronous operation is completed, the continuation of the code is scheduled and executed. In contrast to the synchronous version, the thread driving our chain of responsibility will almost never be blocked and can work on hundreds or thousands of tasks concurrently.

For my son and me, this means we can work on many other chores like preparing dinner or doing the laundry, until my wife gets off the phone. This not only makes my wife happy, but it allows us to get more done in less time, ultimately optimizing for more family/idle time.

A quick note on performance: if we look at only one dishwasher unloading process and compare the synchronous version to the asynchronous one, the former might outperform the latter. In our household, though, we unload the dishwasher multiple times a day, especially when guests come over. By looking at many chores executed concurrently and interleaving them, the asynchronous version will drastically improve our family's overall chore throughput — or in the messaging world, the message handling throughput. Every worker that can be freed up is worth freeing up since it can participate in getting more work done with less context switching.

Summary

Like in the real world, when unloading the dishwasher, message handling needs to be truly asynchronous to avoid locking resources (such as threads) longer than needed. The asynchronous chain of responsibility achieves that by returning a Task instead of void. Message handling almost always requires a concurrency level greater than one. The higher the concurrency, the more important it is to free up threads. All freed-up threads can efficiently participate in handling messages, thus optimizing the system resource usage while minimizing the thread context switches.

For more information, check out my presentation on breaking the chain of responsibility asynchronously or our series on async/await best practices.


About the author: Daniel Marbach is a solution architect at Particular Software and sometimes wakes up in the middle of the night, fearing he forgot a ConfigureAwait(false) in his code.

The challenges of monitoring a distributed system

$
0
0

I remember the first time I deployed a system into production. We built a custom content management website backed by a single SQL Server database. It was a typical two-tier application with a web application and a database. Once the system was deployed, I wanted to see if everything was working properly, so I ran through a simple checklist:

  • Is my database up? (Yes/No)
  • Is my web server up? (Yes/No)
  • Can my web server talk to my database? (Yes/No)
A simple monitoring workflow

If the answers to these questions were all yes, then the system was working correctly. If the answer to any of those questions was no, then the system wasn't working correctly and I needed to take action to correct it.

Beyond the checklist

For a simple little website, that was all the monitoring I needed. It was easy to maintain and to automate, so I was often able to fix problems before the users even noticed them. But over time, the complexity of the systems that I've deployed and monitored has increased.

These days, I work mostly with distributed systems, which are typically made up of a larger number of processes running on many different machines. Processes can run on-premises or in the cloud, and some solutions may include a mix. Some of your processes may run on physical machines while others run on virtual machines. For some processes (in platform-as-a-service environments), you may not even care about the machine they run on. Monitoring a distributed system requires keeping track of all of these processes to ensure that they're running and running well.

Distributing a system over more processes increases the number of potential communication pathways exponentially. My original website was a typical two-tier application and had two major components (web server and database) with only the one communication pathway between them. If you distribute a system over just five processes, any two of them might need to communicate. That's 10 different point-to-point communication pathways that might fail. When the system is first deployed, a small subset of those 10 pathways might be important. But as the system grows organically over time, the set of pathways being used can change.

Increased complexity isn't the only challenge you face when monitoring a distributed system. There are some common patterns and techniques seen in distributed systems that need to be taken into account when designing a monitoring solution.

When a failure isn’t really a failure

Distributed systems need to be designed to tolerate failure. It's common practice to introduce a persistent queue between distributed processes, which allows them to continue to communicate without both having to be online and available at the same time. That's great for the stability of your system, but it can hide real problems from you.

If a server restarts to apply OS updates, the rest of system isn't impacted. The server will come online again in a few minutes and start processing its backlog of messages. If a process isn't running right now, it’s not necessarily a sign of an inherent failure. It could just be routine maintenance. But it could also be a sign that something has gone wrong.

If you aren't watching carefully, a process can be offline for quite a while before anyone notices. For example, if your website is online and accepting orders but your back end isn't charging customer credit cards, it could be some time before someone realizes there’s no money coming in. When that happens, they usually won't have a smile on their face.

Failures don't just happen at the process level. Individual messages can fail as well. Sometimes these problems are caused by a programming error or by a malformed message. In these cases, either the process or the message needs to be manually modified, and then the message can be retried. More frequently, these problems are temporary (database deadlock, network glitch, remote server restarting), and a common approach to dealing with them is to wait a short while and then retry them automatically. Implementing message retries keeps messages moving through the system, but it can also mask growing problems.

I once worked on a system where nearly a third of the messages being processed by an endpoint failed at least once due to deadlocks occurring in a database. As these messages were retried, they never ended up in an error queue, and I just assumed that the endpoint was a little slow. Eventually, I tried to scale the endpoint out to get more performance out of it. All that did was put more pressure on the database and cause more deadlocks. I needed to address the deadlocks directly in order to solve the problem, but I hadn't seen them.

Monitoring at scale

Modern distributed systems are designed to scale out. The recent surge of cloud infrastructure and virtualization makes it much easier to spin up new processes than to spend money provisioning huge servers for individual systems. Knowing which parts of your system need to be scaled, when, and by how much requires collecting and analyzing large piles of data.

Once you do scale out, your data collection also scales. Not only do you have to collect data from more servers, but now you have to aggregate it so you can analyze your system as a whole rather than as individual parts.

Gathering the data for a single server isn't difficult. Modern operating systems and platforms make it very easy to collect any sort of information you might need. How much CPU is this process consuming? How many messages per second are being processed? Which queues are filling up? All of these questions are easy to ask and get answers for.

But the key to successfully monitoring a distributed system isn't just about gathering data; it’s about knowing which questions to ask, and knowing how to interpret the data to answer those questions at the right time.

Summary

System monitoring is important, but that isn't news to anyone. The problem is that, without a standard methodology, monitoring is a challenge, especially when dealing with distributed systems. Gathering data can flood you with information that doesn’t necessarily help you identify what needs to be fixed at the right time.

Coming up, we're going to discuss our philosophy on the subject; what we’re doing at Particular to help you ask the right questions; and, more importantly, how to answer those questions. In the meantime, check out the recording of our What to consider when monitoring microservices webinar for more details.


About the author: Mike Minutillo is a developer at Particular Software. He has three children, so he has experience watching a collection of unreliable processes attempt to collaborate.

Minivans and marathons

$
0
0

I read the script and performed my lines well. College, good jobs with increasing responsibility in corporate America, marriage and kids. When suburbia beckoned, it wasn't too hard to swap my briefcase for the diaper bag. At least for some period of time, home was a lot more interesting than my work experience had been. Children have a charming way, though, of exposing the insecurities we don't even know we have. My revelation came during the first opportunity to meet our five-year-old's teachers.

In preparation for this evening, the children were asked to draw pictures of their parents. Our task was to recognize our images through our daughter's eyes. My husband's was easy to find. The text included, among other accolades, "He is practicing to run a marathon." The board became increasingly bare as I tried to find mine. With only a couple left, the horror set in. "My mom always tells my middle sister and me to clean up the toy room. She has 3 kids. She has a beige minivan." Ouch.

I laughed it off with the reassurance that there will come a time for me to continue my professional journey. After all, I mostly enjoyed being a stay-at-home mom. I was comfortably in control of my kids' schedules, the household routine, and--so I thought--our plans for the future. Little did I know that an intercontinental move to Israel was around the corner, presenting me with an opportunity for a whole different kind of growth than I had ever imagined.

Once in Israel, challenges abounded to push me out of my comfort zone. Learning a new language, helping my kids adjust in a foreign school, and overcoming bureaucratic obstacles were just a few of these hurdles. This was a period of significant change for me. Fighting my way through the maze made me stronger and more independent, and it tremendously expanded my worldview. Once I felt settled, I decided to shift the scale of work/life balance more to the work side. With this decision, I assumed that my period of personal growth would be put on hold. In my experience, personal growth and career progression were a zero-sum game.

After a short search, I found a position working at Particular Software. I was drawn to it because the work itself looked diverse and challenging, and the work-from-home arrangement seemed like a great fit for me. To my surprise, I discovered a work culture that would provide yet another platform for personal growth.

A unique company culture

I've found many parts of Particular Software's culture helpful in my journey toward personal growth. Certainly,the organizational structure is a primary element, but there are specific components of that structure I found to be particularly influential.

Non-traditional hierarchy

I've always worked for companies with rigid organizational structures, and I didn't give it much thought. Working for a company with no formal hierarchy and no managers has shown me that the traditional organizational structure was actually constraining my growth rather than promoting it. Now, I am free to work on tasks that energize me and in areas where I can make the most impact. I'm not distracted by what will "get me ahead." I focus on improving my relationships with my colleagues, improving the processes to get work done, and producing a better product. Additionally, I'm encouraged to explore how I can make an impact in areas of the company I would not have ventured into in a traditional hierarchy.

Transitory teams

Work is organized very differently than in my past experiences. Previously, I had my individual tasks to complete, and success was measured by meeting deadlines and, beyond that, by someone's judgement as to the effectiveness of the result. In contrast, the majority of my work now is done as part of temporary teams that are formed around tasks. Working in these teams has forced me to recognize how clearly I had defined myself as a consummate facilitator, planner, and doer. Through this transition, I'm learning to share these roles and to participate freely without always needing to be in charge. In addition, I sometimes guide and coach others. Instead of continuing to rely on the skills that have always defined me, I'm developing new skills that benefit all areas of my life.

Peer feedback

Without a manager to give me feedback, I rely on my peers to let me know how I'm doing. I work with a variety of people across the organization who are in the best position to observe my behavior and performance. This is a powerful way to learn firsthand how my behavior affects others. I'm learning about my blind spots and constantly improving. The opportunity for personal retrospection is both more frequent and more empowering. Not only receiving feedback but also providing feedback to others has been beneficial.

Perpetual starting line

As I review the list above, I'm realizing that, to me, the lack of a formal hierarchy is the most significant element of Particular's work culture. This has created and nurtured the individual growth I've experienced. I see now that the zero-sum game I anticipated upon my return to full-time work was a fallacy. My work can and should be a significant contributor to how I grow and develop as a person, colleague, friend, family member, and community member.

I asked my daughter, now a young adult, what she would write if she were asked to repeat her childhood exercise. Hoping for some brilliant pearls of wisdom that would provide a perfect end to this post, she said, "Mom is a nag." Sigh. Unlike a marathon, personal growth has no finish line.

What interesting, unexpected personal growth opportunities have you had in your career? Tell me about them in the comments below.


About the author: Karen spends most of her time on activities that help make Particular staff love showing up for work each day. She happily barters her staffing insight for lessons on Git, GitHub, markdown, and other tools she never knew she loved.

Evolving NServiceBus persistence

$
0
0

While we've been working hard on supporting .NET Core lately, you may have noticed that we also released a brand new (and dare we say better?) persistence library for NServiceBus called SQL Persistence. The new persister supports multiple database engines and uses raw ADO.NET and native SQL queries, without the need for an intermediate ORM.

We dreamed up some powerful new features that would take NServiceBus persistence to the next level. Up until now, our primary method of persisting data in relational databases used NHibernate, which was making it impossible to realize those dreams. We decided it was time for NServiceBus to make an evolutionary leap forward in its persistence capability.

That leap is our new SQL Persistence, and we think you're going to like it. We believe it's better for you as a developer, better for DevOps, and better for the future evolution of your NServiceBus systems.

Better for you

The headline feature of SQL Persistence is its support for multiple engines, which currently includes Microsoft SQL Server, Oracle, and MySQL. We're evaluating support for additional engines, including PostgreSQL, MariaDB, and Amazon Aurora. This gives you the freedom to use the database engine you feel most comfortable supporting, whether on-premises or with cloud databases through Microsoft Azure or Amazon RDS. Each database engine supported by SQL Persistence is fully documented and backed by a suite of acceptance tests that is run for every engine type on each release.

Additionally, SQL Persistence is fully async from top to bottom, the same as NServiceBus 6. Since it only references the System.Data namespace, it has no blockers to full support for .NET Core 2.0, and we'll be including it in our plans to support .NET Core going forward1. It doesn't have a dependency on any ORM library, so you can easily use it with Entity Framework, Dapper, or the ORM of your choice to manage your business data. All it uses is a DbConnection.

Using SQL Persistence will also save you a lot of time if you're working with sagas. In SQL Persistence, each saga will get its own table, but it will store saga data as a JSON blob rather than as multiple interrelated tables. This means that changes to a saga will not require an update to the saga table schema or a migration script to fill in missing data. As sagas are meant to represent complex, always-changing business processes, this tends to be a frequent occurrence.

You and your company's database administrators will definitely appreciate not needing to modify the database schema every time business rules are updated. In fact, there are a lot of other features in SQL Persistence that DBAs will enjoy.

Better for DevOps

In a DevOps context, DBAs are going to want to be able to inspect, audit, and manage the SQL used by an application. SQL Persistence was created with the DBA and DevOps in mind.

SQL Persistence uses an MSBuild task to write table creation DDL scripts to the bin directory at compile time. During development time, you can still have these scripts executed when an endpoint starts up. But at deploy time, you can pass these scripts off to your DBA to inspect and execute as part of a DevOps process with properly elevated permissions. You can even elect to elevate the generated scripts outside the bin directory so that they can be added to source control. This way, when schema changes are necessary, the diff will be visible in the changelog along with your code changes.

Your DBAs can also audit all the queries executed by the library. In fact, we use the library to generate the scripts and automatically include them in our documentation. There, you can view all the DDL and DML scripts for Microsoft SQL Server, Oracle, and MySQL.

These features, focused on DevOps and DBAs, bring a level of visibility that is hard to achieve when using an ORM.

Data evolution as a feature

We wanted to make it easier to evolve your NServiceBus system over the long term. Business requirements change all the time, and we wanted to help you react to those changes while respecting the "data at rest" already stored in the database.

To start, every entity stored by SQL Persistence will include the version of the NServiceBus.Persistence.Sql.dll assembly that created it. In the event of breaking changes between versions of the persistence library, having the version attached to every row allows any migration scripts we create to be that much more intelligent. Instead of making educated guesses about how to migrate data based solely on the shape of the data present, we'll know exactly what version of the persistence was responsible and how to address it.

We're also extending the ability to be version-aware to your saga data by including the version of the saga data's assembly with each saga entity. So, for example, if you were forced to make a breaking change to one of your saga data classes between versions 1.0 and 2.0, you would be able to detect the version while the saga was being loaded and take corrective action on the JSON data before it's deserialized into the new structure. With this ability, you can make gradual changes to each saga entity as it's loaded, rather than requiring downtime to manually update all existing saga instances.

Along with changing business requirements, sagas can change in other ways. SQL Persistence introduces a new feature called a transitional correlation ID to help. Imagine your company is acquired and you need to transition from the older company's OrderId to a new GlobalOrderId that contains a different value. The transitional correlation ID allows you to store both order values as table columns, either of which can be used to find the saga during a transitional period. Once all "in-flight" sagas have been updated and the older ID is not needed, it can be dropped, and the transitional ID becomes the new correlation ID.

With these new features, we believe that it will be much easier to evolve your business systems, even in the face of breaking changes and shifting business requirements.

Summary

We're very proud of the new SQL Persistence. We believe it's an evolutionary leap forward for persistence in NServiceBus, and we plan to invest heavily in it in the future. Although we remain committed to supporting our other persistence libraries, in most cases, we believe that SQL Persistence should be your first choice for new development and that the only choice you should have to make is which database engine you're most comfortable using. If you start using it now, you'll get all the benefits described above, and you'll be better prepared for a potential move to .NET Core in the future.

So go ahead and take the new SQL Persistence for a spin. First, start with the Simple SQL Persistence Sample, and then read some more about how it all works in the SQL Persistence documentation. It's worth it.


About the author: David Boike is a developer at Particular Software who always prefers to write his own artisanal, handcrafted SQL and claims this qualifies him as old-school.

Footnotes

1 Support for .NET Core will depend upon support from the underlying client library. At the time of this writing, .NET Core is supported by Microsoft SQL Server and a prerelease build of the MySQL client, but a compliant Oracle package does not yet exist.

Decisions without managers

$
0
0

Decision making is tricky business. Decisions often move up and down the chain of command without the input of those best equipped to make those decisions. In smaller companies, there's often too much reliance on the CEO, and that doesn't scale as the company grows. Ultimately, we can easily end up in a situation where the input of those most knowledgeable is not considered.

At Particular Software, we've struggled with these types of issues as our company has grown, and that's led us to rethink our organizational structure. We've eliminated our old departments and switched to working in small groups, bringing together those with the right skills, knowledge and context to get the work done. We've even gone a step further and eliminated our management positions.

Everyone in favor...

Overnight, the former directors were stripped of their titles, but not one of them stormed out in protest. Instead, they chose to trust that their depth and breadth of knowledge would continue to guide staff and positively impact the organization. (And that's exactly what happened.)

We had a general concept of how this management (or rather, lack-of-management) philosophy could look in practice, but we realized the best results could only be achieved with involvement from everyone. With our compass set in this direction, we navigated the twist and turns by placing the steering wheel in the hands of the staff. All staff members have been involved in designing this organization, shaping it to be one in which we want to work. Guiding principles keep us from hitting any icebergs while still providing plenty of room for autonomy. There's no question the process is challenging... but it's also liberating!

Guiding principles

We know there's no ideal structure, and we definitely didn't get it right the first time. Our guiding principles proved to be critical, though, in keeping us moving toward a clearer vision.

  • No such thing as a "right decision": We don’t criticize or judge each other for making a decision that results in a negative or less-than-optimal outcome, nor are we excessively lauded for positive outcomes. We care about results, of course, but we recognize that there’s always an element of luck involved. If the decision-making process is sound, we have a better chance of long term success and we minimize the impact of events beyond our control.

  • Fail small, learn big: Trying new ideas is critical to building the great products and services our customers expect. However, we need to balance experimentation with investment. Using a "fail small, learn big" approach, we try new ideas in the smallest way possible. We may spike new software features to see if there's a viable option to pursue. This applies to process changes as well as software. For example, we recently tested providing peer feedback in a group setting to see if it would promote more feedback flowing through the organization. After three sessions with a small group, we decided that the idea was good, but not practical for rolling out to the whole organization. However, this experiment gave us two new ideas. These spikes allow us to fine-tune the processes before rolling them out to everyone, and potentially, having to roll them back if they don't work.

  • Humility, not heroism: Giving thoughtful consideration to the make-up of temporary teams helps mitigate the risk of relying on any one "rock star". We do this by deliberately selecting the members of these groups, valuing those who constructively challenge the status quo. We build in checkpoints to test our assumptions.

  • Timing isn't usually critical: Have you ever been on a project, missed an artificial deadline, only to have management give you another equally meaningless one? It makes you wonder why you couldn't have had that extra time from the beginning! We don't believe we need deadlines to stay motivated, keep the work moving forward, and get the job done. We don't assign deadlines or communicate them to our customers. Whether it's a project or a task, we work together until we have the desired outcome, without artificial pressure.

Building the structure

All our processes aim to assign work and decision making to those with the most knowledge and context. Here are a few highlights:

Maintainer groups

To avoid a diluted sense of responsibility and ensure continuity of knowledge, each code repository is assigned to a maintainer group of three or four developers. Anyone can submit pull requests, but only the maintainers can merge them. This also gives us ample coverage for critical support cases and mitigates the effect of vacations and sick time.

Task forces

Representatives from maintainer groups combine with staff from other areas of the company to form task forces where the work gets done. These are transitory teams that form as issues are raised, and disband when issues are closed. Through this proces, we are able to spread knowledge of our systems and process improvements throughout the organization.

Squads

As a developer-focused business, it makes sense for our staff to help prioritize the work to be done. This happens through permanent squads, whose primary responsibility is to prioritize tasks and ensure work starts, progresses and finishes.

Learning along the way

We're learning a lot about how to improve our structure and processes as we mature. For instance, ensuring everyone on a task force has input, without stalling progress, is an ongoing challenge. Sometimes the pace of an issue can be slow and frustrating since there’s no manager to intervene if we disagree. We're trying to improve on this by asking ourselves the following questions:

  • Is this issue too big? Should it be broken down into smaller, more manageable pieces?
  • Do we have the right skills and knowledge assembled on the task force?
  • If we don't agree, is there a "fail small, learn big" spike we can do to try out an idea?

Slowly, but surely

Although we occasionally miss the days when a manager would overcome our indecisiveness and just tell us what to do, our staff can't imagine going back to the old way. While the company embraced the new organizational structure and moved toward it quite naturally in the beginning, the past year has been more about making deliberate changes to tweak the model where needed. The trend is positive. Our task force model is adding the expected value and we predict it will continue to be beneficial as our company matures.

Given the pace of change in our organizational structure and processes, we're looking forward to seeing what it will look like in the future. If you’re interested in helping to build a community of like-minded, organizational and cultural explorers, consider joining the "peopletechconnect" Slack team by emailing me at karen.fruchtman@particular.net.


About the author: Karen Fruchtman spends most of her time on activities that help make Particular staff love showing up for work each day. She happily barters her staffing insight for lessons on Git, GitHub, markdown, and other tools she never knew she loved.

NServiceBus for .NET Core beta

$
0
0

Today we're happy to announce that you can start building production-grade NServiceBus systems on .NET Core. Although the bits are currently marked as beta, a release candidate with a go-live license is coming soon.

On NuGet, you can now find beta packages for:

Many of our samples have also been updated to work with .NET Core. Use the blue Switch Version dropdown button to switch to the version with a -pre suffix. Once you download and unzip the solution, you'll find a solution file ending in .Core.sln which targets .NET Core 2.0.

If you're looking for a sample to start with, we recommend Full Duplex, Simple RabbitMQ, or Simple SQL Server.

In the coming days, we will continue to release beta versions of all our other NuGet packages, including multi-targeted versions for .NET Core and the full .NET Framework for the components that can support it.

The next step after beta is a Release Candidate which will come with a go-live license. We know some of you may have a time-sensitive project that requires a go-live license. If that's the case, please contact us at support@particular.net so we can talk about your needs.

Lastly, we need your feedback! Come find us in the NServiceBus Gitter chat room to talk to the developers working on NServiceBus for .NET Core. We'd love to hear from you.

Happy hacking!


A new Azure Service Bus transport—but not just yet

$
0
0

If you've been looking forward to using .NET Core with NServiceBus on Azure, I'm afraid we've got some bad news. Instead of making their existing Azure Service Bus client library support .NET Core, Microsoft has released a brand-new incompatible client. This makes it impossible for us to upgrade the NServiceBus Azure Service Bus transport you know and love to support .NET Core as is, and forces us to write a brand-new transport as well.

Here's the full story.

A tale of two clients

NServiceBus uses a Microsoft client library to talk to Azure Service Bus, and there are now two different clients. The older WindowsAzure.ServiceBus client only supports the full .NET Framework, so there's no way our Azure Service Bus transport can support .NET Core using that.

But Microsoft has released a new Azure Service Bus client in the Microsoft.Azure.ServiceBus package, which targets .NET Standard 1.3 and the Windows 10 Universal App Platform, in addition to the full .NET Framework.

While it may seem like we could upgrade the message transport using the new client library, that's not going to be possible. The new client is not yet fully featured enough to build a capable message transport, let alone achieve feature parity with the functionality that currently exists in the Azure Service Bus transport. Indeed, Microsoft's new client library is a complete rewrite—there isn't a single API that hasn't changed in some way.

Perhaps most importantly, the new client does not support transactional semantics. This means that atomic sends and receives, where outgoing messages are only truly sent if the incoming message is processed successfully, are not possible with the new client.

This lack of transactional semantics alone prevents us from immediately delivering a .NET Core version of our Azure ServiceBus transport.

Complicating things even further, the new Azure Service Bus client isn't wire-compatible with the old client. In the new client, Microsoft has removed the serialization that was baked into the old client. This is a positive change but a breaking one. Workarounds are possible, but they're problematic. That's because it's not possible to know if a destination endpoint will be using the old or new client. Every endpoint using both the old and new clients would need to be "bilingual" in order to understand all messages, no matter which client was used at their source. This makes it impossible to upgrade one endpoint at a time.

The new client also does not include any management operations, by design. Using the new client, it's not possible to check if an entity (whether that be a queue, topic, or subscription) exists or to perform any sort of CRUD operation on it. These types of operations will need to be scripted when using the new client.

So what does all this mean for our Azure Service Bus transport?

The new deal

In the long term, the split in Azure Service Bus client libraries will be echoed by the creation of a new message transport, based on the new client library. Meanwhile, we'll continue to support the existing transport.

We'll update the NServiceBus.Azure.Transports.WindowsAzureServiceBus package to work with the next version of NServiceBus for the .NET Framework only, and continue to support this transport for customers that require transactional semantics. This transport will be supported without new features for the foreseeable future, as specified by our support policy.

We're building a new Azure Service Bus transport that targets .NET Standard 2.0 using the new Azure Service Bus client. This transport won't interoperate with the old transport, but we'll provide a way to move messages from the old transport to the new transport.

We'll release an alpha package as soon as possible that doesn't contain support for transactional semantics, rather than wait for Microsoft to fully support the feature. This will allow you to experiment with the new transport and give us feedback.

Summary

Just as we waited until the release of .NET Core 2.0 to release NServiceBus for .NET Core, we want to wait until the appropriate time to release an Azure Service Bus transport for .NET Core. It's not a question of if, but when, which will be dictated by the maturity of the new client.

If you have an Azure Service Bus project and you're itching to make a move to .NET Core, let us know! We'd love to stay in contact with you or get you involved in beta testing for the new transport before it's fully released. Send us an email at support@particular.net and we'll keep you in the loop.

Break that big ball of mud!

$
0
0

This article was originally published on the NDC 2016 Blog.

Cover Image

Have you ever had to deal with a function that had hundreds and hundreds of lines? Code that had duplication all over the place? Chances are you were dealing with legacy code that was written years ago. If you're a Star Wars fan like I am, it's like dealing with the Force. As Yoda would say, “Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering.”

In my 15+ years of coding, every single time I've dealt with legacy code, fear, anger, hate, and suffering were pretty common.

Fear: I fear that monster two-thousand-line function because I know what lies beneath in that monster function. Absolutely nothing good!

Anger: I get angry when I see code repeated all over the place. On one occasion, I came across 50 modules that had the same ten lines of code. That code was trying to look up a registry key to find some detail and then writing back to HKLM. Remember the time when Vista OS introduced new OS security features? Applications with standard user access could not write to the HKLM registry keys. To be Vista compliant, the same ten lines of code needed a change in not one but 50 modules!

Hate: Of course, the next natural emotion is hate, leading to a lot of WTF moments looking at the code. What in the world was that developer thinking!? When you start duplicating the code for the second time why didn't the alarm bells go off, let alone the 50th time!

Suffering: All this naturally leads to suffering, just like the wise Yoda predicted. Because, at the end of the day, why the existing codebase has its quirks doesn't matter. It just does, and I'm the one that has to deal with it.

Unfortunately, I am not the only one in our industry who goes through these same emotions. Many of us experience them on a daily basis. You're probably in that same boat.

You think you could probably do better. The codebase is so bad that it needs a rewrite. You look at the smart developer team around you and think, "We've got this." Technology is much better than it was all those years ago when that legacy project began. You now have newer frameworks in your toolbox. What could possibly go wrong?

The "All New" Death Star?

The problem is that it requires time to wrap your head around the many nuances and nooks and crannies in that legacy code. While you might be convinced that these are the exact reasons that warrant the rewrite, chances are you'll miss a few details that might be important to your business. These oversights might end up as costly mistakes later.

You may not be able to get away from your responsibility of fixing existing bugs. You might also have to add new and important features to the existing codebase to satisfy customer needs. What if you still have to support the existing system? As you're rewriting the system, how are you going to deal with these new feature requests for customers?

If you've scanned the existing code and compiled a list of features, you probably know it's going to take you a long time to get the rewrite done. Months? Years? A lot can happen in that time frame. What was relevant when you first started may no longer be relevant when you finish.

How do you plan to sell the rewrite to business stakeholders? A rewrite based purely on technical concerns (like upgrading .NET Framework versions) won't sound like a convincing reason to them. If the multi-year rewrite plan does get approved by the business, what's to say you won't be making the same mistakes as your predecessors but with the technology of today? What if you accidentally end up building a new Death Star?

A New Hope

As part of supporting existing customers, it's likely you add new features to your system. Working on a greenfield project is every developer's dream. It's a place filled with unit tests, automated integration tests, and code written using the "most latest" framework in town. Adding new features to existing legacy code and supporting existing customers, on the other hand, is tricky. To do it right, without breaking existing features, you'll have to try and find your inner Jedi.

"Use the Force, Luke."

"Thank you, Obi-Wan. If only it were that simple!"

The young Luke Skywalker's job was simple. He fired those missiles and made the Death Star explode. While, you, on the other hand, now have to navigate a minefield. One line of code in the wrong place can break existing functionality. And it can upset QA, product owners, and other developers, as it might lead to slipping deadlines. Even worse is if this bug isn't caught in the development, QA, and staging environments and only showed its nastiness in production.

While this paints a pretty dark picture, every problem has a solution.

Event-Driven Architecture to the Rescue

Event-driven architecture is a software design style that uses messaging techniques to convey events to other services. It promotes asynchrony. Events are nothing but simple data transfer objects (DTO) to convey something significant that happened within your business system. They can signify a change, such as a threshold change, to indicate that something just completed, that there was a failure, etc. These events can then be used to trigger other downstream business processes.

Events typically contain the relevant identifier for the related entity, the date and time of occurrence, and other important data about the event itself. For example, the CustomerBecamePreferred event would contain relevant details, such as the CustomerId, CustomerPreferredFrom, CustomerPreferredUntil — i.e., the date and times that are relevant to this event and other important information.

When the condition in the code is met, an event is published. The interested parties of this event then take the appropriate action upon receiving it. These interested parties could be another service, which might publish another event and so on. In this model, the software system follows the natural business process.

The beauty of this model is that the publisher of the event just publishes the event. It has no knowledge of what the subscriber intends to do with it. Event-driven architecture leads to loosely coupled systems.

Back to the legacy conundrum. How can event-driven architecture help?

You can now use messaging patterns like Publish-Subscribe to evolve your system. The book Enterprise Integration Patterns is a fantastic read that covers over a hundred messaging patterns you can use to solve everyday problems using asynchrony.

Step 1: Add Visibility to Your Monolith

Instrumenting your monolith with events can help for a variety of reasons. First and foremost, it adds visibility inside your monolith.

You'll first need to analyze your monolith code to find the lines that complete important actions. Now you can publish those actions as events. Repeat this process until you've captured all the important actions as events your monolith is performing. Adding the code to publish an event inside your monolith is less intrusive than attempting to write a brand new feature inside it. And doing this is simple, especially if you're using an available library like NServiceBus or MassTransit.

This strategy came in handy once when my team had to discover why a process was taking so long. After we had instrumented the legacy code with events, we had a subscriber module that received all these events. Using these events, we knew how long each of these critical business processes took to complete. We were able to identify the part that took too long. It turned out that an RPC call to a legacy service, whose response value was stored in the 72nd column in the database table, wasn't even necessary. After we had removed the offending line of code, the process was 30 times faster. Would we have found this without instrumenting? Eventually, after a few gray hairs and a whole lot of caffeine, yes. However, instrumenting with events provided visibility into a process that was opaque from the dawn of its creation, which turned out to be quite valuable in gathering other insights.

Instrumenting with events

Step 2: Add Your New Feature on the Outside.

Create your feature as a new service that subscribes to the event published by the monolith. When this service receives the event, take the relevant action that needs to occur. This service could also end up publishing an event for processing other downstream business processes.

The Return of the Natural Order

The good news is that you've now completely decoupled your new service from the existing monolith. Even better, you can be assured that this new feature is not to going to break other parts of your monolith. You've managed to isolate it.

Because the feature is on the outside, it doesn't fall victim to the old constraints of your monolith. Your service could be written using the latest C# language version making use of the Elvis operator, etc. You can deploy this service without having to stop the monolith service, and now you have all the freedom in the world to write unit tests and integration tests. That land with rainbow unicorns might still be around the corner after all!

Because you're on the outside of the monolith, your database technology constraints also disappear. You're now free to select the database technology of your choice to build this feature.

You may not always need a relational database. What you need might be better accomplished with a graph database or a NoSQL database driven by the requirements of your feature.

Strangler Pattern

You can use events to build new features on the outside, but how can you make this monolith obsolete? Again, the answer is with events. You could use different events whose only goal is to start slowly funneling the relevant data from your monolith into your new parts of the system, also written on the outside. You can deploy this new module to production. Since the goal of this module is to gather the needed data from your monolith slowly, running it side by side with your monolith shouldn't be a problem. Once you've shoveled enough data from the monolith database to different storages, you can activate your new feature and remove those redundant parts of the code. Granted, this is a little more intrusive, but you still control the risks.

Martin Fowler talks about a pattern called the Strangler Pattern, where slowly but steadily the monolith starts to lose its relevance and eventually becomes dead.

When you've managed to suck enough of the data from the legacy system using these events and funnel them through new smaller services that can have separate databases living on the outside, you're on target. You can kill your monolith, as does the strangler vine referred to in Martin's article. With enough time, the monolith might become completely obsolete. That's how you destroy the Death Star. That's how you bring balance to the Force.

So what's the catch?

Since this potentially involves a lot of little services doing their thing based on the single responsibility principle, you could end up with a lot of services. These services will be communicating with other services using messages. It becomes extremely important to audit these messages to have a better handle on what's going on during debugging. Invest some time in finding the appropriate tools to help you trace these messages. Visualize the flow of messages to get a better understanding of your system from the messaging vantage point to see what works best for you.

In Summary...

Use event-driven architecture style to your advantage when dealing with legacy systems to add visibility to your monolith. Use messaging patterns like publish/subscribe to build new features outside of your monolith. Make sure to evolve your monolith into newer, smaller services, using events from the monolith, to make the monolith lose its relevance.

To start using these design patterns in your current code, and learn how to use the publish-subscribe pattern with NServiceBus, check the samples out.

And finally, stay on target. Destroy your Death Star.

Related Reads:


About the author: Indu Alagarsamy is currently using her force as a developer for Particular Software. She likes to build K-2SOs and other Lego Star Wars construction kits with her kids when she's not coding or talking about Event Driven Architecture.

10X faster execution with compiled expression trees

$
0
0

Good software developers will answer almost every question posed to them with the same two words.

It depends.

The best software developers don't stop there. They'll go on to explain what it depends on. But this common response highlights a fundamental truth about developing software: it's all about tradeoffs. Do you want speed or reliability? Maintainability or efficiency? A simple algorithm or the fastest speed?

It gets even worse when you don't even know exactly what the code is supposed to do until runtime. Perhaps you're building a runtime procedure from steps stored in config files or a database. You want it to be flexible, but you don't want to sacrifice performance.

This is exactly what happened during the development of the NServiceBus messaging pipeline. We wanted it to be flexible and maintainable but also wicked fast. Luckily, we found a way to have our cake and eat it too.

By building expression trees at startup and then dynamically compiling them, we were able to achieve 10X faster pipeline execution and a 94% reduction in Gen 0 garbage creation. In this post, we'll explain the secret to getting these kinds of performance boosts from expression tree compilation.

Stringing behaviors together

As described in our post on the chain of responsibility pattern, the NServiceBus messaging pipeline (represented by the BehaviorChain class) is composed of individual behaviors, each with its own responsibility. Each behavior accomplishes a specific task and then calls the next() delegate in order to pass responsibility to the next behavior in the chain.

NServiceBus behaviors can be used for all sorts of things. NServiceBus contains built-in behaviors both for processing incoming messages (like managing message retries, deserializing messages from a raw stream, or processing performance statistics) and for sending outgoing messages (like enforcing best practices, adding message headers, or serializing messages to a stream). Customers can also define their own behaviors to address any system-wide concern at an infrastructure level rather than in every single message handler. Message signing/encryption, compression, validation, and database session management are all easy to implement with an NServiceBus behavior.

One way to see how these behaviors work is to look at a stack trace when an exception is thrown so we can see all the methods that are involved. Let's take a look at a stack trace thrown from a simplified version of the NServiceBus pipeline with only three behaviors, where each logical behavior is shown in a different color:

Initial pipeline stack trace

The important thing to note is that the behaviors themselves aren't the only pieces of code getting executed. As shown in the stack trace, each behavior requires three lines, or stack frames, each of which represents a real method call. Each method called increases the number of instructions the processor must execute, as well as the amount of memory allocated. As more memory is allocated, the garbage collector will be forced to run more often, slowing the system down.

In this case, each behavior requires two extra stack frames after its own Invoke(context, next) method (as shown above) in order to call the next behavior. The eight stack frames from BehaviorChain are all pure overhead, and they vastly outnumber the three stack frames from the actual behaviors.

So, in this version, we have 13 total stack frames, with three per behavior. A real NServiceBus pipeline would have many more behaviors, so it's important to focus on the per-behavior cost. At first glance, three may seem to be an acceptable amount of overhead in order to be able to dynamically compose the behaviors together. After all, software is about tradeoffs. In return for just a bit of overhead, we get small, maintainable behaviors and an extensible pipeline where it's easy to add, remove, or reorder individual steps.

It gets worse

For NServiceBus 6, we wanted to improve how the pipeline is composed from its individual behaviors so that it would be even more flexible and maintainable. We also wanted to fix some problems users were having getting their own custom behaviors registered in the correct position. What we found was that adding in the additional pieces to create that flexibility was also adding significant overhead to pipeline execution, and that overhead was adversely impacting the speed at which NServiceBus could process messages.

Although each behavior in NServiceBus is sufficiently isolated, composing them together turned out to be problematic in version 5. Adding a new behavior involved an API using RegisterBefore(stepName) and RegisterAfter(stepName), with step names essentially being "magic strings." It was too easy to create a behavior that wouldn't end up being registered in the right location in all cases, given that it could only define itself relative to other behaviors—some of which might not always exist.

To address this, we added the concept of pipeline stages in version 6. Behaviors would be grouped into "stages," one for each type of context—either the IncomingPhysicalMessageContext, the IncomingLogicalMessageContext, or the InvokeHandlerContext. Within each of these stages, the exact order of behaviors would not matter. This removes the problem of getting behaviors registered in the right order, as execution order within a pipeline stage no longer matters.

Within this pipeline, each stage would be separated by a special behavior in the framework called a connector, which would be responsible for transitioning between them. For example, the PhysicalToLogicalConnector is responsible for changing the IncomingPhysicalContext containing the message as a raw stream into an IncomingLogicalContext containing a fully deserialized message object.

Simplified pipeline

Now, the thing is that the stage connector is also a step in the pipeline, which means it will also increase the total depth of the stack trace. This, in turn, causes even more performance overhead. What's worse, a few extra pieces (a generic BehaviorInvoker and a non-generic BehaviorInstance) are required to deal with the realities of generic types in C#, further deepening the stack trace.

Behavior execution

Let's compare the previously shown NServiceBus 5 stack trace to one using the stage and connector concepts introduced in NServiceBus 6. In the previous stack trace, there are three behaviors. In the interest of brevity, we can eliminate one behavior and replace it with the PhysicalToLogicalConnector mentioned previously so that there are still the same number of total behaviors:

Non-optimized stack trace

Stages and connectors clearly come with a cost. The stack trace grew from 13 stack frames to a whopping 26! Notice how many more stack frames are required per colored behavior, compared to just three each in the earlier example. Behaviors now require seven stack frames each, while connectors are only a bit cheaper, requiring five frames. The dramatic increase in overhead doesn't bode well for pipeline performance.

Adding async/await to the mix would increase the execution overhead even more. With that, you can get some truly monstrous stack traces.

The need for speed

By adding stages and connectors to the pipeline, we successfully reduced API complexity for NServiceBus 6. But in doing so, we added so many additional layers that the performance of the pipeline suffered.

This is a classic software tradeoff between maintainability and efficiency. Which did we want? Well, we really wanted both.

If all we cared about was performance, we'd just hard-code the entire pipeline. Instead of iterating through a list of behaviors and doing backflips with generic types to invoke each behavior with the correct context, we'd write code that looked like this:

behaviorOne.Invoke(contextO, context1 => {    physicalToLogicalConnector.Invoke(context1, context2 => {        behaviorTwo.Invoke(context2, context3 => {            invokeHandlersBehavior.Invoke(context3, context4 => {                 // End of pipeline            });        });    });});

From a performance perspective, this is as good as it gets. But then, if the pipeline was statically defined, there would be no way to insert new behaviors or modify the pipeline in any way.

So, we decided to try to find a way to use behaviors for their flexibility but then compile that pipeline down to replicate the optimized, hard-coded version shown above at runtime.

Compiling Expressions

In NServiceBus 6, we now compile the entire behavior chain into a single delegate containing each behavior directly calling the next, just as if we had hard-coded it. All of this is possible thanks to the magic of the Expression class.

We start by finding all of the registered behaviors and arranging them in order, according to the pipeline stage they target. Starting at the end of the list, we use reflection to get the MethodInfo corresponding to the behavior's Invoke(context, next) method.

At the heart of the optimization is this chunk of code, which builds a lambda expression from each pipeline step and then compiles it to a delegate.

static Delegate CreateBehaviorCallDelegate(IBehavior currentBehavior, MethodInfo methodInfo,ParameterExpression outerContextParam, Delegate previous{// Creates expression for `currentBehavior.Invoke(outerContext, next)`Expression body = Expression.Call(instance: Expression.Constant(currentBehavior), method: methodInfo, arg0: outerContextParam, arg1: Expression.Constant(previous));// Creates lambda expression `outerContext => currentBehavior.Invoke(outerContext, next)`var lambdaExpression = Expression.Lambda(body, outerContextParam);// Compile the lambda expression to a Delegatereturn lambdaExpression.Compile();}

Here's the really important bit: You have to start at the end of the pipeline because the lambdaExpression.Compile() method only compiles down one level—it is not recursive. Therefore, if you assembled all the lambda expressions together and tried to compile it from the outermost level, you would not get the performance boost you were seeking.

Instead, each compilation is like a tiny fish getting eaten by a slightly larger fish. As you continue to work backward toward the beginning of the pipeline, each compiled fish gets eaten by successively bigger fish, until you have one very large fish that represents a fully compiled expression tree of the fully composed pipeline.

With all the levels of abstraction for the "glue" holding the pipeline behaviors together optimized away, the compiled delegate looks and behaves just like the hard-coded example above. Already this is a big increase in pipeline performance.

But wait, there's more!

Because we had gone to the trouble of profiling memory allocations in the pipeline, we were able to squeeze even more performance out of it by changing how we implement our own internal behaviors.

Although there's a nice base class that simplifies the creation of behaviors, specifically to make life easy for our users, we deliberately avoided using it for our internal behaviors and implemented a slightly more complex interface instead. That saved an additional allocation of the base class as well as an extra couple of frames per behavior.

With that additional optimization, let's take a look at what the stack trace looks like now:

Precompiled stack trace

At only 10 stack frames (compared to 26, previously), we're now running a much tighter ship—even more efficient than the 13 frames before we added stages and connectors. Not only are there fewer methods in the stack, but we're also creating a lot fewer objects when processing each message. This means there's less memory allocated, which means there's less pressure on the garbage collector as well.

Results

Before deciding to include this optimization in NServiceBus 6, we ran a slew of benchmarks against it to see how it would perform. You can see the full details in the pull request. However, here are some of the highlights:

  • A real-life call stack, based on an example of the outgoing message call stack when publishing a message, was reduced from 113 lines to approximately 30 lines for a 73% improvement in call stack depth compared to the non-optimized version. With no infrastructure overhead, the change results in a dramatic size reduction for error messages, as well as call stacks that are easier to read and debug.
  • There was up to 10X faster pipeline execution for the cases where no exceptions were thrown and the pipeline completed successfully.
  • There was up to 5X faster exception handling, even in cases where an exception is thrown at the deepest level of the call stack. When an exception is thrown, it matters quite a bit how deep in the pipeline it's thrown from. An exception thrown earlier in the pipeline will not have to bubble up as far and will be handled even faster.
  • We saw a 94% reduction in Gen 0 garbage creation with the dramatic reduction in calls and allocated objects.

Summary

With the improvements to the behavior pipeline in NServiceBus 6, it's easier than ever to create and register your own custom pipeline behaviors to handle system-wide tasks on an infrastructure level. But rather than accept additional overhead to make the API improvements possible, we've supercharged the pipeline so it can process your messages faster and more efficiently than ever before.

This easy extensibility is one of the capabilities of NServiceBus that we're most proud of. If you'd like, you can find out how to build your own NServiceBus behavior. Or if you're new to NServiceBus, check out our introductory tutorial to get started.

Similar success stories


About the author: Daniel Marbach is an engineer at Particular Software who has a very particular set of skills. Skills he has acquired over a very long career. Skills that are a nightmare for resource hogs everywhere. If you use LINQ where it's not necessary, he will look for you, he will find you, and he will…submit a pull request.

Maximizing fun (and profit) in your distributed systems

$
0
0

While you probably wouldn't expect this from a software infrastructure company, we opened a theme park! Welcome to Particular World.

Welcome to Particular World

Based on our experience running business systems in production, we know we need to monitor our theme park to make sure it's working properly. Luckily, there are tools in place that let us keep track of electricity and water usage, how much parking we have available, and how much trash the park generates.

This infrastructure monitoring helps us understand whether our theme park has the infrastructure it needs to operate. We can use this data to extrapolate when we need to upgrade the electrical system, add a new water pipe, add more bays to our carpark, or commission more trucks to haul away our trash. These same basic tools used for infrastructure monitoring would work whether we'd opened a theme park, a hospital, a police station, or a school.

Infrastructure monitoring is also common in the software industry. How many CPU cycles is a system is using? How much RAM? What is the network throughput? Every software system is dependent on a few types of infrastructure (CPU, memory, storage, network), so it's no surprise that there are lots of off-the-shelf tools that provide this capability.

Infrastructure monitoring tools generally treat systems as "black boxes" that consume resources. They don't really provide you with any insight into what's going on inside the box, why the box is consuming those resources, or how well it's actually running.

To understand if our theme park is running efficiently or not, we'll need monitoring tools that understand how theme parks really work. We need monitoring tools that can understand what's going on inside the box.

So how do theme parks work?

First and foremost, theme parks have attractions, and along with attractions come lines. The lengths of these lines can vary depending on a number of factors: the attraction's popularity, the time of day, the time of year, weather, the average duration of a ride on the attraction, etc. While it's impossible to get rid of lines completely, understanding how they work can help us maximize our park's efficiency.

With that in mind, how do we monitor a theme park to see if it's running efficiently? Generally speaking, we want to make sure people are moving through our park at a good pace. So we could narrow that down to two questions. First, how many people can ride a particular attraction in an hour? Second, which attractions have the longest lines?

Knowing the answers to these questions will give us a richer understanding of the behavior of our theme park and help us change over time to make it a more efficient (and profitable) business.

Luckily, our experience with monitoring distributed systems seems to apply really well to theme parks. The types of distributed systems that we monitor are made up of individual components that exchange and process messages. Each component has a queue of messages to process, just like the attractions in a theme park. To see if our messaging system is running efficiently, we could ask the same two questions: how quickly are messages processed by a particular component, and which components have the greatest backlog (and how bad is it)?

This is what application monitoring is all about. It's a peek inside the black box that tells you why you're consuming the resources you're consuming.

Let's see how we can use this style of application monitoring to understand the behavior of our theme park and messaging systems.

How many people can ride an attraction?

Our first attraction is the Message Processor. It's a "Wild Mouse" style roller coaster, which means it has small cars designed to seat one passenger at a time.

The Message Processor

When we first opened, the Message Processor had a single car, and each ride took 20 seconds. Going at this speed, the Message Processor could handle 3 passengers per minute.

3 passengers per minute

After a while, the Message Processor started to slow down. These days, a single ride takes 30 seconds. That means our throughput has gone down to 2 passengers per minute, so we can't serve as many people as we could before. That's a problem.

2 passengers per minute

There are a couple of different approaches we could take to solve it. First, we could bring in a mechanic to try and tune the attraction back to its original speed. If we can get the average ride duration back to 20 seconds, then we can go back to servicing 3 passengers per minute. Alternatively, we could add another car to the Message Processor so that we can have 2 passengers on the ride at the same time. This doesn't do anything about the duration of each ride (which stays at 30 seconds), but it does increase our throughput to 4 passengers per minute.

4 passengers per minute

Eventually, we won't be able to add any more cars to the Message Processor. We'll need to open a second track, a copy of the Message Processor, to our theme park. This has the same effect on ride duration and throughput as adding another car but obviously comes at a much higher cost in infrastructure. There are some benefits, though. Now we can close one of the Message Processors for maintenance while still letting people ride the other one.

On a rainy (low traffic) day, our theme park will have fewer visitors. Even though the Message Processor retains the same ride duration (30 seconds) for a potential throughput of 120 rides per hour, there may only be 20 passengers in a given hour (actual throughput). As you can see, the throughput is heavily affected by the number of incoming passengers but is limited by the ride duration. The shorter our ride duration, the higher our potential throughput. It's important to monitor throughput and ride duration together. In fact, if we can predict when our throughput will be low, it's a great opportunity to close the Message Processor and get that mechanic in.

Closed for maintenance

When monitoring messaging systems, we also measure how fast messages are getting processed. Instead of ride duration, we measure processing time, which is how long it takes to process a single message. Processing time heavily influences the maximum throughput that a component can achieve. In order to maximize the number of messages a component can process, we need to minimize the time it takes to process each individual message.

When we start to hit the maximum throughput of a component, we can try to optimize a specific message handler (reducing processing time). But that may not be enough if, for example, the constraint is an external resource like a database or third-party web service. Eventually, a second copy of the component may be required (scaling out the process).

Which attractions have the longest line?

The second attraction that we added to our theme park is a waterslide called Critical Splash.

Critical Splash

Whenever someone wants to ride the Critical Splash, they walk up the stairs to the top, and as long as there's a clear slide, they can jump straight in and ride it to the bottom. If there isn't a clear slide, they'll wait in line until a slide clears up.

During the day, the length of the line for Critical Splash shrinks and grows as demand shifts. A shrinking line is better since that means the attraction is serving visitors faster than they're arriving. But is a growing line cause for alarm?

In short bursts, a growing line length usually isn't a problem. During the summer, for example, it's common for a busload of visitors to arrive at the park and all jump in line for Critical Splash. If we're monitoring the line's length for Critical Splash, we see this as a sudden spike in traffic which will eventually (we hope) go away naturally as the day progresses.

After an hour, if the line is still growing, then it might be time to take action. We need to increase the throughput of the attraction to shorten the line faster. We can do that by opening more slides (increasing concurrency) or by opening another instance of Critical Splash (scaling out the process) somewhere else in the park. We could also try to decrease the ride duration but it's pretty hard to make a waterslide go faster!

"I can't wait to get on the ride"

Something else we want to measure for Critical Splash (and indeed, for any attraction) is how long people spend waiting in line. Waiting around in line is boring, so the longer the line is, the more time our visitors will spend being bored. If we want to keep the wait time short, we need to keep the throughput high. And to do that, we need to keep the ride duration short.

Estimated wait time

If we want our theme park to be successful, we need to watch for increasing line lengths and wait times and take appropriate action to keep our lines under control and keep people moving.

We can monitor the "lines" (i.e., queues) of messaging systems with the same thinking. If the queue for a component spikes over a short period, that's something to keep an eye on. If it keeps going up over time, then action should be taken to increase message processing concurrency or to scale the component out.

For messaging systems, we measure critical time, which is the time between when a message is sent and when it's been completely processed. This takes into account the time it takes for a message to get from the sending component to the queue, how long it waits to get to the front of the queue, and how long it takes to be processed. The critical time metric for a component gives you a quick overview of how responsive that component is. In other words, if I send it a message now, how long will it take for it to be handled? A high critical time is a good indication that it's time to scale a component out to help it handle its backlog of messages.

Don't get your tickets just yet

I'll let you in on a secret: we didn't really open a theme park. But by now, you've probably realized where we're going with this post, which is to highlight the importance of going beyond infrastructure monitoring in your distributed systems.

To be clear, infrastructure monitoring is extremely important. Any park manager would want to keep track of the water and electricity usage of the park and even see a breakdown for each attraction. A spike in water usage might indicate a leak somewhere and that needs to be investigated, and tracking a steady increase in electricity usage lets you plan when you need to add more power lines. The paths between attractions need to be kept clear to allow visitors to move from attraction to attraction quickly.

When it comes to your distributed system, you should be using infrastructure monitoring in the same way. Sustained increases in RAM usage can indicate memory leaks, steady increases in storage can be extrapolated to determine when larger disks are needed, and network pathways need to be monitored to ensure that your components are able to communicate effectively.

But once your infrastructure monitoring is in place, don't forget to add on application monitoring to provide you with a deeper insight into how your components are behaving. It's important to know which components are running slow, which have large backlogs of messages to process, and how those backlogs are changing. This can help you diagnose issues quickly and can even provide clues on how to fix them. It can guide you to tune your components to ensure that messages spend less time waiting to be processed, keeping your system as a whole responsive.

If your system is built on top of NServiceBus, our new NServiceBus.Metrics package already calculates these key metrics for you. All you have to do is plug it into your endpoints. Check it out.


About the author: Mike Minutillo is a developer at Particular Software. His favorite theme park attraction is Space Mountain at Disneyland California. Space Mountain has an average processing time of 3 minutes, and even on a slow day, you're likely to wait in line for 45 minutes.

No Dogma Podcast with Adam Ralph

$
0
0

I'd like to share some highlights from a recent chat I had with Bryan Hogan on his No Dogma Podcast.

We kicked off with NServiceBus and how it helps building distributed systems and microservices. We talked about the general challenges such as coupling, communication, and fault tolerance. We also investigated some of the patterns that help, such as events, retries, and long running processes. We wrapped up with the importance of system monitoring, and what's next for NServiceBus.

If you don't have time to listen to all 48 minutes, here are some timestamps to help you home in on the bits you're most interested in:

  • 05:35 - How do you make sure messages are not lost?
  • 07:01 - Why do you need queues?
  • 10:52 - How did NServiceBus get its start?
  • 12:47 - What are the main benefits of NServiceBus?
    • 13:01 - Abstraction of underlying queuing system details
    • 14:22 - Deduplication & Outbox, turning at-least-once delivery into only-once processing
    • 15:30 - The Particular Software Platform tools for auditing, visualization, and monitoring
    • 16:48 - Publish/Subscribe
    • 17:51 - Long-running business processes (Sagas)
  • 26:20 - How does NServiceBus help us build loosely coupled systems?
  • 37:19 - How can I easily start using NServiceBus?
  • 40:37 - How do I monitor a distributed system like this?
  • 42:30 - Info on NServiceBus support for .NET Core, including containers

Happy listening!

P.S. There are plenty of chances to meet Particular Software folks throughout the year. Visit our events page for details.


About the author: Adam Ralph is a software developer at Particular Software. He gets his best ideas for conference talks while hiking up a mountain with a snowboard on his back.

NServiceBus 7 for .NET Core is here

$
0
0

It's a pretty cool time to be a .NET developer. Don't believe it? Check out this excerpt from a popular children's book1:

Congratulations! Today is your day.
You're off to Great Places! You're off and away!

Maybe you like Linux or have a MacBook,
Or want to host code without breaking your checkbook.
The license for Windows can be a bit pricey.
Getting approval for more servers can be a bit dicey.

But now you have choices, it's a bit of a shocker.
You can even choose to deploy your apps using Docker!
With your skills in .NET no opportunity shall go by,
When you can even deploy on a Raspberry Pi.

And now NServiceBus is ready, we've got your back.
The ultimate cross-platform messaging stack!
You're off to Great Places! Today is your day!
There's more than Windows now, so…get on your way!

-Adapted from Oh, the Places You'll Go! by Dr. Seuss

In other words, NServiceBus 7 for .NET Core is here.

Cross-platform NServiceBus

With NServiceBus 7, you get all the benefits of NServiceBus with the cross-platform capabilities of .NET Core.

You can develop your code on a Mac and deploy on your favorite flavor of Linux. You can reap the DevOps benefits of hosting your endpoints in Docker containers and use container orchestration technologies like Kubernetes on their own or through managed services like Azure Kubernetes Service or Amazon Elastic Container Service for Kubernetes.

You can even run NServiceBus on a Raspberry Pi.

By building your systems on top of NServiceBus and .NET Core, you benefit from the continous performance improvements being made to the framework. We see stories all the time about performance improvements being contributed to .NET Core through the power of open source. Low-level code in the .NET Framework that was previously considered good enough is now being optimized both by Microsoft and community contributions. The release of .NET Core 2.1 with its significant improvements to the HTTP stack is a great example of this. There's nothing better than being able to write code that gets faster over time.

But if you want to continue running on the regular .NET Framework, that's fine too. You don't have to switch. NServiceBus runs well on both frameworks, and we have the tests to prove it.

NServiceBus 7 is designed to be a smooth upgrade from NServiceBus 6, having just the changes necessary to support .NET Core. The upgrade will probably be a lot easier than you think.

Check out our NServiceBus 6 to 7 upgrade guide for more details.

Caveats

There are just a couple things to keep in mind.

We don't yet have support for everything on .NET Core, but we see no reason to hold back the release of NServiceBus 7 on anything that's left. We've documented the packages that don't yet support .NET Core and will work over time to get them migrated where possible.

Our broad suite of tests also detected poor performance with Azure Storage Queues when sending messages when using .NET Core 2.0. We traced this to inefficiencies in the HTTP stack, which we verified Microsoft has fixed in .NET Core 2.1. We recommend all customers use .NET Core 2.1 in their systems now that it's generally available.

But really, it's just those two things. That's it.

So are we done?

Heck no!

NServiceBus 7 is just one piece of our larger vision to support all platforms, both on-premises and across the clouds.

And even though other pieces of our stack are still Windows-only right now, we're simplifying their installation and deployment so that everything goes as smoothly as possible for you until we finish migrating them to .NET Core as well.

How do I get it?

As always, you get NServiceBus as a NuGet package, but as of today, you won't need to check the "Include Prerelease" box to get the latest and greatest.

Check our NServiceBus 7 page for more information, links to samples, frequently asked questions, and more.

Footnotes

1 Not a real book.

Classic rock and async/await: Stop breaking the rules!

$
0
0
Vinyl record collection
Image: Vinyl by Dun.can, CC BY-SA 2.0

The universe demands some things must always occur in a certain order. Queen's "We Will Rock You" must be followed by "We Are The Champions." Same with Led Zeppelin's "Heartbreaker" -> "Living Loving Maid," Van Halen's "Eruption" -> "You Really Got Me," and Boston's "Foreplay" -> "Long Time." You have to. Every DJ knows this. It's the rule!

Async code in .NET has a similar rule. Calling a method that returns a Task must be awaited afterward. Like the Rule of We Will Rock You, the Rule of Awaiting Tasks is unwritten, but the compiler is a bad DJ—it provides no support in making sure you follow through.

We've seen our customers make this mistake too many times (we've even done it ourselves!). So we've updated NServiceBus to include a new Roslyn analyzer, which ensures this is one thing you can't screw up when using NServiceBus APIs. You should probably update to make sure you aren't making this mistake in your code right now.

The problem with async

Awaiting an async method isn't just a nice-to-have; it's mandatory. Otherwise, you open yourself up to a world of pain.

Consider this code:

public Task Handle(PlaceOrder message, IMessageHandlerContext context){    db.SaveOrderDetails(message);    return Task.CompletedTask;}

So far so good. But now we want to publish an OrderPlaced event, so we add a line to do that:

public Task Handle(PlaceOrder message, IMessageHandlerContext context){    db.SaveOrderDetails(message);    context.Publish(new OrderPlaced { OrderId = message.OrderId });    return Task.CompletedTask;}

Now we have a problem. context.Publish() returns a Task, and that task needs to be awaited. If we forget, bad things start to happen:

  • The method execution returns to return Task.CompletedTask; before the Publish operation is completed.
  • The message being processed may complete (or not, randomly) before the Publish operation continues, meaning the current message context is no longer available.
  • The transaction wrapping the message handler completes before the message can be published.
  • The Publish operation fails with a transaction-related exception. Congratulations, you've now caused message loss.
  • The exception is never observed, so you don't find out what went wrong.
  • The compiler gives you absolutely no feedback that this could be a bad thing.

Worst of all, these problems become really hard to catch during development because it all boils down to when and in what order the scheduler decides to execute different async continuations. Forgetting an await is basically guaranteed to cause a race condition that you won't detect until you get production-level traffic. That's when you start observing weird and seemingly unrelated issues in production.

await just a minute!

But perhaps you’ll say that the compiler does provide feedback when you forget to await code. Well, that's true, but only if your method is already marked as async:

public async Task Handle(PlaceOrder message, IMessageHandlerContext context){    db.SaveOrderDetails(message);    // No await on context.Publish()    context.Publish(new OrderPlaced { OrderId = message.OrderId });}

Due to the addition of the async keyword on the method signature, this code generates the following compiler warning:

Warning CS2014: Because this call is not awaited, execution of the current method continues before the call is completed. Consider applying the 'await' operator to the result of the call.

But for being so important, this is only a warning! Unless your project has been set up to treat all warnings as errors (and it probably should be), this code will still compile just fine.

Even though the async keyword will allow this situation to at least be detected, it's far too easy to forget it for the simple reason that it's Visual Studio's fault. When you use Visual Studio's tooling to implement the IHandleMessages<T> interface, this is what you get:

public Task Handle(PlaceOrder message, IMessageHandlerContext context){    throw new NotImplementedException();}

Notice anything? That's right, no async. Visual Studio doesn't bother to add the keyword that might save you.

While it seems like this is maybe a problem that Microsoft should be fixing in Visual Studio, we've seen too many customers get burned by this. With Roslyn analyzers, we have the tools to fix it and keep our users from experiencing this pain.

We're not awaiting any longer

We're now shipping a Roslyn analyzer directly in the NServiceBus package that will detect when you use one of our async API methods and you don't immediately await or assign the task to a variable. If you forget, you'll get this compile-time error:

ERROR NSB0001: A Task returned by an NServiceBus method is not awaited or assigned to a variable.

Problem solved. It will no longer be possible to forget to play "We Are The Champions" after "We Will Rock You." The compiler will make sure that you do.

We wanted to make this analyzer available for all our customers, regardless of which version of NServiceBus you're using. If you're on NServiceBus 6, you can use the NServiceBus 6.5 package to get the analzyer. If you've already updated to NServiceBus 7, it will be available in version 7.1.

One day Visual Studio may decide to fix this problem and require that tasks be assigned or awaited. If so, great! We'll happily remove the analyzer from NServiceBus at that time, content that we helped everyone to write better async code. But until that day, we've got your back.

Summary

Some may say the order in which you listen to songs by Queen doesn't matter. (I would not be one of those people.) But forgetting to await a Task definitely matters a whole lot. The production problems it causes are insidious and difficult to track down, yet they're entirely preventable.

That's why we're backporting this feature to NServiceBus 6, rather than just for NServiceBus 7. We think it's that important.

NServiceBus 6.5 and NServiceBus 7.1 are available on NuGet now. Check out the NServiceBus analyzer documentation, and then upgrade as soon as you can. You shouldn't have to worry about whether your async code is correct. We made that the compiler's job.

Unfortunately, the solution to the "We Will Rock You" problem still proves elusive…We'll get our engineers on it right away.


About the author: David Boike is a developer at Particular Software who loves Guns N' Roses but refuses to believe that they should be classified as classic rock.


Third-order effects and software systems

$
0
0

Interstate Highway System

At the height of the Cold War, the United States passed the Federal Aid Highway Act of 1956, giving birth to the Interstate Highway System. Fueled by the fear of foreign attack and the need to quickly transport troops and equipment across the continent, the network of protected access highways ended up transforming the nation’s economy and culture forever.

It was perhaps easy to predict a first-order effect: people would travel longer distances given the ease of doing so. A second-order effect was perhaps also easy to foresee: people would be much more likely to work or shop further away from home.

It was easy to predict mass car ownership but hard to predict Walmart. Carl Sagan

The third-order effects were much harder to see coming. With businesses cut off from direct access to highways, Main Street business districts atrophied, while large shopping malls began to flourish. People were able to travel further to do their shopping but wanted to get it all done in one place, so shops clumped together in successively bigger shopping complexes, usually situated near an exit to the freeway.

In software

Third-order effects don’t come about only from building massive continent-spanning highways. They can be observed in our software systems as well. Every dependency we add, indeed every decision we make, has the potential to bring about third-order effects we may not immediately be able to see.

Maybe we add some kind of software library to our system, with the expectation of being able to build features faster or have cleaner code. We might be pleasantly surprised when it also makes the system easier to monitor in production.

These kinds of immediate effects are usually why most people would start using NServiceBus. They need a way to reliably send asynchronous messages back and forth, or to scale out a software process, or to orchestrate long-running workflows with sagas. But if you were to use NServiceBus in your own system, what third-order effects might you also run into? Let’s take a look.

Bring on the junior developers

It can be scary to allow a junior developer to fiddle with a system. They don’t know all the ins and outs of the system, so it’s just too dangerous to let them muck around with critical code. The risk of breaking something is simply too great. So they get relegated to the unglamorous tasks of answering support cases until they somehow become experts at the system they’re not really allowed to touch.

What if the system was built to allow junior developers to actively participate?

This is exactly what the Single Responsibility Principle enables. By splitting more complex processes into distinct loosely-coupled steps, each containing just enough to accomplish one discrete task, junior developers can be much more effective. NServiceBus follows this approach through its use of message handlers for those loosely-coupled steps.

If we take the e-commerce domain as an example, then rather than having a giant OnOrderSubmitted method, the process gets divided up. Each of these tasks is represented by a single message and message handler:

  • Storing the order in our database
  • Processing payment
  • Sending a confirmation email
  • Assigning a task to a specific sales manager to check their client
  • Decreasing available inventory for the amount ordered
  • Updating the customer loyalty status if necessary
  • The list can go on and on…

Now we notice a third-order effect.

  1. Because our system uses asynchronous messages, we limit our message handlers to performing only what can reliably be carried out within a transaction.
  2. Because we limit what our message handler does, each message handler is simple, well-defined, and contains fairly few lines of code.
  3. Because the message handlers are simple, junior developers can easily tackle them.

From requirements, an architect defines the message flow and any developer would be able to implement the task. The resulting code is isolated and easily testable: just invoke the handler with a sample message and make sure it behaves as expected. Using events with the Publish/Subscribe pattern makes this even better, as junior developers can easily extend the system by creating a new subscriber to an already-published event, without having to touch existing code at all.

Let me try that one again

As we all are painfully aware, debugging can be a dull and monotonous affair.

You add some code. You wait for it to compile. It starts to debug. You navigate from one page to another, to another, add some items to a shopping basket, enter some test data in a form, submit the form, and then…you realize you accidentally used <= instead of <. Oops. You’ll need to fix it and then start over from the beginning.

Rinse, repeat. All day long.

When you start using NServiceBus, you notice that you no longer need to do this.

  1. Because the system uses reliable messaging when an exception occurs the message is returned to the queue.
  2. Because the message returns to the queue, you can fix the problem and then have the very same step of the flow reprocess that message, but with the new code.
  3. Because message processing is retried, you don’t need to repeat mind-numbing user interface actions over and over to get to the spot where the failure occurs.

In fact, it’s possible to stop using the UI for testing altogether. Unit tests can be written to directly test the message handler rather than manually clicking buttons and entering form data. Then the UI becomes an ultra-thin layer that translates incoming form data into messages.

End database spelunking

You don’t really want to dig around in a production database, and your DBA (if you’re lucky enough to have one) probably doesn’t want you to either.

But this is what happens sometimes when things fail. You got an email notification that an exception has occurred, but you don’t know exactly where. So now you must mentally step through the code that failed and double-check whether each step left any of the expected breadcrumbs in the database.

Did this first part happen? Yes, here I found the record it created. The next step succeeded as well, but then this other row is missing.

But now what? How do you modify the database to get things consistent again? And how do I give the process a gentle kick so that it can continue, without duplicating the first few steps again? How do I know that, by mucking around with the database directly, I didn’t inadvertently violate a bunch of business logic? How can I be sure I won’t break more than I fix?

Transactions are supposed to be the solution for this, but fall well short. A transaction can’t roll back a sent email, a web service call, or a push notification.

When we start dividing this process up into small steps, each of which can be individually retried, that whole problem goes away:

  1. Because the system uses reliable messaging, failed messages can be replayed through the original message handler after a bug is fixed.
  2. Because messages can be replayed, a failed process can be restarted from the point of failure, maintaining data consistency in the database.
  3. Because the database is always consistent, we don’t need to go spelunking through the database to fix anything anymore.

Your DBA ends up much happier, because they don’t trust you being in the production database, but honestly, you really didn’t want to be there in the first place.

Summary

Introducing messaging and NServiceBus isn’t a small decision. But once you start working with it and experience all these third-order effects, it is hard to imagine how you were ever able to work without them. Enabling junior developers, making debugging easier, and enabling processes to be restarted at the point of failure is just the tip of the iceberg.

If you’re interested in getting these third-order effects to work for you, check out our Quick Start tutorial where you’ll see first-hand how powerful building systems with asynchronous messaging can be.

Introducing the new Azure Service Bus transport for .NET Core

$
0
0

The wait is over! Today we’re releasing the new Azure Service Bus transport, which is fully compatible with NServiceBus 7 and .NET Core.

You will now be able to run NServiceBus endpoints using Azure Service Bus anywhere.

Mark Gould With the release of the new NServiceBus Azure Service Bus transport, we are now able to take full advantage of .NET Core and Azure. Getting up and running was simple and we don't have to worry about managing and maintaining queue databases anymore. Being able to use NServiceBus on .NET Core means we are now able to run our endpoints as Windows services for our on-premises clients and in Linux containers on Azure for our SaaS customers, using the same code! Mark Gould, Solution Architect
Spindlemedia, Inc.

With this news, we’re rebranding the previous transport as the “legacy” Azure Service Bus transport. Just as Microsoft will not be adding new features to their legacy client, we won’t add any new features to the legacy transport that uses it. The legacy transport will receive only critical bug fixes and security patches from this point on.

However, you can rest easy knowing that you will have a documented and supported migration path that will allow you to gradually transition your current system to the new transport.

Here’s what you need to know.

The new transport

The new Azure Service Bus transport uses the also-new Microsoft.Azure.ServiceBus client library to target .NET Standard 2.0. This new client library from Microsoft, which was for some time not feature-complete, was the main reason why we were unable to release the new transport immediately with the release of NServiceBus 7 for .NET Core.

Some features that the legacy transport accumulated over the years have been left behind. Because of the clean break Microsoft made from the older Service Bus client, the new Azure Service Bus transport for NServiceBus represents a clean break from the (now) legacy transport as well.

In most cases, the features that have been removed were conceived of in the first few years of Azure Service Bus. Since then, the service has matured significantly, and many of these features are now provided directly by Microsoft at the service level. It simply doesn’t make sense to continue to support these features in a new transport when Microsoft’s implementation is much closer to the metal.

A prime example of this is multiple namespaces. The legacy transport offered the ability to provide the connection strings for multiple namespaces to enable some measure of high availability. Fast forward to the present, and the Premium Tier of Azure Service Bus now uses Availability Zones to provide high availability transparently to the client using one connection string.

Migration

No matter how you’re using the legacy Azure Service Bus transport today, we have a path available to migrate to the new transport when you’re ready.

If you’re already using the forwarding topology with the legacy transport, then your system is already compatible with the new transport, under certain conditions. Assuming these are met, you can upgrade directly to the new transport.

If you are using the endpoint-oriented topology, which we have advised against for some time, you will first need to migrate to the forwarding topology. However, this migration can be accomplished with zero system downtime.

To facilitate the migration, we released a minor version of the legacy transport that included a migration feature. This feature operates the transport in a migration state where an endpoint can continue to communicate with other endpoints running either topology.

In the first phase of migration, all endpoints are upgraded so that the migration feature is enabled, one endpoint at a time. Once all endpoints are using the migration feature, the second phase can begin, where all endpoints are then converted to use the forwarding topology. See the upgrade guide and migration documentation for more details.

Once on the forwarding topology, the migrated endpoints are then compatible with the new transport and can be upgraded at will, assuming the previously mentioned conditions are met.

Premium vs. Standard

Going forward, we recommend using Premium Tier namespaces to get features that the transport no longer implements directly, such as High Availability, because these features aren’t available in the Standard Tier.

We recommend all customers use the Premium Tier for production workloads. The shared environment of the Standard Tier simply isn’t appropriate for anything but systems with very low message volumes or systems still in development.

When we attempted to do performance testing of the new transport to compare it to the legacy transport, we discovered that it is essentially impossible to run benchmarks on the Standard Tier. The performance of that tier is best described as completely random, as it is subject to all sorts of throttling and noisy neighbor problems.

Summary

The new Azure Service Bus transport is here, and it’s the future. Anything new coming from Microsoft will only be available via the new transport and the new transport alone. We would like to do whatever we can to ensure everyone using the legacy transport is able to migrate and upgrade as soon as they are able.

We’d love to talk to you about planning your migration. Reach out to us via our support channels and we’ll be happy to talk.

You don't need ordered delivery

$
0
0

In our family it's a tradition that you get to decide what we'll have for dinner when it's your birthday. On my daughter's last birthday, she picked pizza. I took her to the nearby pizza shop to decide what pizza to get.

A large screen dominates one wall of the pizza place, showing each order as it progresses through each stage of preparation. As I was looking at the screen, I noticed some names suddenly switched. Some pizzas with fewer toppings could be placed in the oven faster, and some would take longer to bake than others. In various steps towards putting the pizza in its box, the process could take longer depending on the pizza. My daughter's pizza required additional preparation time, so other customers were able to leave before we were. In short, pizzas were not being delivered in the same sequence as they were ordered.

Is ordered delivery a requirement?

Just as at the pizza shop, we might intuitively think that certain processes require in-order processing. But in real life, this is usually not true. There might be a lot of scenarios that don't fit with what we expected. In real-life scenarios, the business always adapts.

For example, imagine a payment arrives for an order we haven't received (yet). We could simply return the money, or we could wait a while for the order to show up. In another scenario, say a product is sold just after we ran out of inventory. We could automatically cancel the order. A better option is to automatically re-order stock and let the customer know it's been back-ordered. Or offer them a coupon for a different product.

The point is: our businesses can adapt to out-of-order delivery of information so our software should be able to as well.

To attempt to apply strict in-order processing would be to impose artificial limitations on our system. That's because to guarantee message ordering is technically very difficult and, even if successful, always comes with tradeoffs like lower message throughput and less scalability that hamper the system's ability to be successful. Consider our earlier pizza parlor and how many more orders it's able to process by filling them out of order based on how quickly certain pizzas can be prepared, rather than solely on when the order was placed.

Let's have a look at why it is difficult from a technical standpoint to guarantee ordered delivery.

Exceptions

What happens when programming exceptions occur in message processing code? Even with the most robust code possible, exceptions can still happen. As developers, this is not unfamiliar to us. After all, it's why we write unit tests: to guard ourselves against the unexpected. But not everything is under our control, and a lot can go wrong.

We can have messages that throw an exception because of a transient error. And although not all errors are severe, we do need to deal with this. A lot of the time we can simply requeue the failed messages to solve the issue. But we should also be able to deal with "poison" messages, those that keep failing and should be put aside for a moment to be retried at a later time.

Whether we retry a poison message within a few seconds or days later is irrelevant. The point is that the poison message has been dealt with and another message can now be taken from the queue, one that was supposed to be processed after the poison message that is being retried later. The result is that messages we expected to arrive in order are now being processed out of order.

Scalability

When a system is built with messaging as one of its foundations, the ability to scale out is hardly an afterthought. Unfortunately it makes ordered delivery virtually impossible to support.

The ability to scale out is a very powerful feature. Instead of scaling up, where you buy more powerful and expensive hardware, you scale out by having more servers processing messages. Every server basically competes for messages, doing their best to process as many as they can. Going back to our pizza place, this is similar to buying more ovens to bake more pizzas at once rather than upgrading the existing ones to process the same number of pizzas faster.

Although the servers can scale out, they have no knowledge of each other. This usually isn't an issue, except with ordered delivery. Messages that need to arrive in order might be processed on different machines. One of those machines could have less work or finish work more quickly, resulting in messages being processed out of order. Even if you don't scale out and only have a single server, your server must process messages using a single thread (and thus slower) for the same reason.

Back to the real world

When things happen out of order in the physical world, things usually work themselves out through various checks and balances. I may have ordered my pizza before someone else but if their pizza is done first, people don't stand around wondering what to do. I simply stand aside and let them take their pizza with no lasting harm done, aside from maybe a jealous stare at them for getting their pizza first.

Sometimes I'll order a pizza for carry-out on my way home. When I place the order, the shop will tell me roughly how long it'll take. If I get there early, I'll pay for it and wait until it's done. But maybe I get delayed in traffic or wait until the end of that Game of Thrones episode I'm watching for the tenth time. By the time I get there, my pizza is waiting for me so I pay for it and take it with me.

The important part of this scenario is that two events—the pizza being ready and the pizza being paid for—might happen out of order. But both need to be completed before the pizza can be delivered. In system modeling terms, we might have a DeliveryService that depends on an OrderPaid message from a PaymentService and a PizzaPrepared message from a KitchenService. If it gets the OrderPaid message first, it can't deliver yet because it doesn't know if the order has been prepared yet. In this case, you can imagine the customer constantly pinging the DeliveryService (i.e. the cashier) at regular intervals to see if it's finished yet.

Messages passed between PaymentService, KitchenService, and DeliveryService

Modeling out-of-order messages in software

Instead we can use an NServiceBus feature called sagas. These are message-driven state machines that allow us to orchestrate business processes. Sagas automatically store state, deal with concurrency, and can help us orchestrate long-running business processes.

Let's have a look at how a saga in NServiceBus deals with messages arriving out of order. With a little state it can remember what already happened and act based on those constraints. For simplicity of the code we'll use two flags.

class DeliveryPolicy : Saga{  public Handle(OrderPaid message)  {    Data.OrderPaid = true;    VerifyIfPizzaCanBeDelivered();  }  public Handle(PizzaPrepared message)  {    Data.PizzaPrepared = true;    VerifyIfPizzaCanBeDelivered();  }  private VerifyIfPizzaCanBeDelivered()  {    if (Data.OrderPaid && Data.PizzaPrepared)    {      // ... send message that pizza can be delivered    }  }}

In this example, when either message arrives, the state of the saga is altered. It then checks this state to see if it should continue with delivery. Let's assume the PizzaPrepared message arrives first. It will mark the order as having been prepared and then check to see if all the conditions have been met. They haven't so the saga goes back into a holding pattern until the OrderPaid message arrives. At this point, VerifyIfPizzaCanBeDelivered determines that all conditions have been met and we can continue with the order.

But what if the OrderPaid message arrives first? Perhaps the KitchenService is backed up with orders and hasn't finished it in time. In this case, the saga does virtually the same thing. It marks the order as paid and then checks its internal state to see if all conditions have been met to continue. They haven't so again, the order sits until a PizzaPrepared event arrives and completes the requirements to deliver the pizza.

Sagas provide a tool to solve all these ordering issues, as well as taking care of all the technical considerations like scale-out and optimistic concurrency, allowing you to focus on the business requirements instead.

Summary

From a technical perspective it is nearly impossible to have ordered delivery, deal with errors, and have a scalable system. At the same time it is very unlikely that a business process actually requires ordered delivery. Both from the business and the technical perspective we need to be able to adapt to different scenarios. What we actually need is a way to deal with those alternative scenarios.

But messages will arrive out of order and they should be allowed to arrive out of order. Instead of trying to "fix" this, we can embrace it and instead ask questions and offer alternative flows. We shouldn't accept a technical constraint that would force a customer to wait, while their pizza gets cold, just because some other customer ordered first.

If you're ready to start dealing with out of order delivery of messages, be sure to check out our saga tutorial.


About the author: Dennis van der Stelt is an engineer at Particular Software and is very strict about making sure everything is in the order correct.

Has Microsoft really changed?

$
0
0

People have a lot of opinions about the "new" Microsoft under CEO Satya Nadella. They've embraced open-source, including .NET Core. They declared Microsoft ❤ Linux. They acquired GitHub. It's been a wild ride for those of us used to the closed, dare I say grumpy Microsoft of the past.

But are things different today? When the rubber hits the road, is Microsoft really more open, more accessible, more helpful?

When we were building the Azure Service Bus transport for .NET Core we got a chance to find out.

The backstory

We want to make it as easy as possible for our customers to upgrade their systems with zero downtime. It’s important they be able to update one endpoint at a time. Turning off an entire system to run an irreversible conversion script while hoping everything turns out OK is not an option.

Making zero-downtime deployment work for our new Azure Service Bus transport for .NET Core proved to be a bit of a challenge.

The old transport still had two completely different ways to organize Azure topics and queues, called topologies. The forwarding topology is best for new projects, but we still had customers using the older endpoint-oriented topology. That's from the early days of Azure and isn’t compatible when it comes to how events are distributed to subscribers.

Migrating is pretty straightforward on the forwarding topology. We wanted to provide a path forward for customers on the endpoint-oriented topology as well.

So, we decided to release one last version of the old transport that would include a migration feature. This would allow people to upgrade each endpoint, one at a time, to the migration mode version. When complete, you would then upgrade to the new topology, again one endpoint at a time.

Afterward, you’d be on the forwarding topology, and could easily upgrade to the .NET Core transport.

It was a great plan. Too bad it didn’t work.

Hop to it

During testing, we discovered the migration strategy had a fatal flaw.

The migration feature depended on an Azure Service Bus feature called auto-forwarding. It only allows three hops to protect against infinite or circular forwarding, or your message is dead-lettered. But three hops was all we needed.

However, when using the SendVia feature, which is used by NServiceBus to implement the SendsAtomicWithReceive transaction mode, the number of hops is affected even though it isn’t a user-driven hop.

We were only using three hops, but the broker counted the use of SendVia as a fourth hop. As a result, it dead-lettered all messages when we tested using that transaction mode.

Working as designed, won’t fix?

We had a serious hopping problem, and we needed Microsoft’s help to fix it. We contacted the Azure Service Bus team and told them about our plight. It’s easy to imagine that the old Microsoft would have said “Sorry, working as designed. Won’t fix.”

Luckily, that’s not what happened.

Microsoft agreed that system-driven hops should not count against the forwarding limit. They decided to roll out a change to Azure Service Bus to differentiate user-driven and system-driven hops. This would allow our migration feature the number of user-driven hops required to move messages around.

At first, Microsoft said it could take months. As a stopgap, they could enable it only for customers who specifically requested it. But before we got the chance to notify customers, they rolled it out globally. Our customers wouldn't need to worry about contacting Microsoft, and we wouldn't have to create any runtime checks to verify the customer's environment.

The Azure Service Bus team came through for us and as a result, we were able to ship the migration feature.

Eleventh-hour bugs

But that’s not the end of the story. We had customers with Go-Live licenses testing in production without issue for months. Suddenly, we started getting bug reports that we traced back to Microsoft's Azure Service Bus client library.

Not a problem. The Azure Service Bus client library is open source on GitHub, so we submitted a pull request to fix it. Even though we were close to the holidays, the pull request was merged the next day, and we had a new release the day after that.

So we released the new transport.

Summary

In the days of the old, closed Microsoft, things wouldn’t have happened this way. If Microsoft had even agreed there was a bug at all, it would have been months (years?) before we had any sort of resolution.

We planned workarounds, just in case. Most involved a bunch of unnecessary pain for our customers. As it turned out, we didn't need them.

So has Microsoft really changed? We think it’s fair to say that yes, they have.

Fallacy #3: Bandwidth is infinite

$
0
0

Everyone who is old enough to remember the sound of connecting to the Internet with a dial-up modem or of AOL announcing that "You've got mail" is acutely aware that there is an upper limit to how fast something can be downloaded, and it never seems to be as fast as we would like it.

The availability of bandwidth increases at a staggering rate, but we're never happy. We now live in an age when it's possible to stream high definition TV, and yet we are not satisfied. We become annoyed when we run a speed test on our broadband provider only to find that, on a good day, we are getting maybe half of the rated download speed we are paying for, and the upload speed is likely much worse. We amaze ourselves by our ability to have a real-time video conversation with someone on the other side of the world, but then react with extreme frustration when the connection quality starts to dip and we must ask "are you there?" to a face that has frozen.

Today, we have DSL and cable modems; tomorrow, fiber may be widespread. But although bandwidth keeps growing, the amount of data and our need for it grows faster. We'll never be satisfied.

From Udi Dahan's Advanced Distributed Systems Design Course, Day 1

Problem of scale

The real problem with bandwidth is not one of absolute speed but one of scale. It's not a problem if I want to download a really big movie. The real problem is if everybody else wants to download really big movies too.

When transferring lots of data in a given period of time, network congestion can easily occur and affect the absolute speed at which we're able to download. Even within a LAN, network congestion and its effect on bandwidth can have a noticeable impact on the applications we develop. This is especially true since we don't tend to notice these problems during development, when there is very little load and congestion is low.

This problem can commonly surface through the use of O/RM libraries, some of which will have the habit of accidentally fetching too much data. That data must then be transferred across the network, even if only some of it is used. Different tech stacks sometimes exacerbate this problem. In early ASP.NET Web Forms applications, for example, the default method for paging a DataGrid component was to load all of the data, and then accomplish the paging in memory. This held the possibility of loading millions of rows in order to only display ten.

Additionally, we have to be mindful that bandwidth isn't only a concern at the network level. Disks, including those used by our databases, have bandwidth limitations as well. A complex join query may run quickly in development with a limited amount of test data. Run the same join query in a production scenario with millions of rows per table at the same time as dozens of other users, and you've got yourself a problem.

Solutions

In order to combat the 3rd fallacy of distributed computing, there are a few strategies we can use.

"Goldilocks" sizing

The first strategy is to realize that we can't eagerly fetch all the data. We have to impose limits and, to an extent, only download what we need. The big challenge is that we must strike a balance. To prevent running afoul of bandwidth limitations, we cannot download too much. But to prevent running afoul of latency (the 2nd fallacy), we must also be careful not to download too little!

Like Goldilocks (who probably would have made a wonderful systems architect but has a thing or two to learn about trespassing), we must carefully analyze the use cases in our system and make sure that the amount of data we download is not too big or too small but just right.

It might even be necessary to have more than one domain model to resolve the forces of bandwidth and latency. One set of objects can be tightly coupled to the structure of individual database tables to be used for updating data, while another set of classes deals with read concerns only, combining data from different tables and transforming it into exactly what is needed for that use case.

Sidestepping

Another thing to keep in mind is that bandwidth limitations and congestion in general will slow down delivery, which we can counteract by moving time-critical data to separate networks.

One strategy is to make use of the claim check pattern, in which large payloads are segregated into different channels and then referenced by URI or another identifier. Clients that are interested in the data can then choose to incur the download penalty for the large payload, but other parties can ignore it entirely.

This is especially useful in messaging systems, which work best when messages are small and can be transferred quickly. A good service bus technology will include an implementation of the claim check pattern to make dealing with large payloads more convenient.

Summary

Whether designing distributed systems or just watching Netflix, the effect of bandwidth is so pervasive that it's almost like a second currency. No matter how much we have, we'll always want more, and as a valuable commodity, we need to be careful with how we use it.

If we have time-critical data, we may be able to move it to a separate network or cleave off the weight of a large payload using the claim check pattern. Otherwise, we're left with hard choices: balancing the limitation of bandwidth against the limitations imposed by latency. This underscores the need for experience in analyzing these use cases.

We've come a long way since dial-up modems, but still we still perceive bandwidth as limited. And even though we know it's limited, we still continue to act (when coding or architecting) as if it's infinite. Our need for bandwidth will never be satisfied—but we need to stop acting as though there will ever be enough.


About the author: David Boike is a developer at Particular Software who is supremely annoyed that his neighborhood still doesn't have gigabit fiber.

Viewing all 138 articles
Browse latest View live