Productivity power-ups in the August Platform release

August 30, 2015, 10:00 pm

≫ Next: Goodbye microservices, hello right-sized services

This month, we released an update to our platform, designed primarily to give you a much better user experience. If you have been frustrated by having to deal with large numbers of failed messages, or if you routinely deal with relatively large messages, we have included new features specifically to address your pain points. We've also added multiple productivity enhancements across the platform, so there's sure to be something in it for everybody.

Head over to our downloads page for the latest versions, or read on for the details.

Failed message grouping

At times, some event might lead to a large amount of failed messages within ServicePulse. Maybe something failed within your infrastructure: a web service was unreachable, a disk filled up, a server restarted, or maybe a new version of an endpoint was released with a bug. This isn't a problem; this is your system dealing with failure as designed.

But dealing with all of those failed messages can be a real pain. Previously, ServicePulse gave you the ability to archive or retry all messages, or retry selected messages. Unfortunately, ServicePulse would only show you 50 messages at a time! This made it very difficult retry all failed messages of a given type if they counted in the hundreds. Worse yet, the retry process was slow and error prone, and the more messages you attempted to retry at a time, the worse it became.

What would be much better would be a grouping of similar failed messages, allowing you to retry an entire group, and know that they would all be retried reliably, regardless of the number of messages involved.

We've been doing a lot of work on failed message handling across our platform, and we think you're going to like our new message grouping features.

Failed message groups

The new Failed Messages screen in ServicePulse automatically groups similar failed messages so that they can be handled in bulk. If two or more failed messages were caused by the same exception in the same place, then it is likely that they can be dealt with in the same way. Each group can be retried or archived as a single operation. You can get details for each group to see a full list of messages and deal with them individually if you still need fine-grained control.

When you first install the updated platform components, all of your unresolved failed messages will be automatically grouped as a background process. This may take a some time, depending on how many unresolved failed messages you have, during which you will continue to see the old Failed Messages screen. Once the background process has completed, the new Failed Messages screen will appear.

When you choose to retry messages, whether an entire group or just a subset of messages, you will benefit from the updated retries capability of ServiceControl. We've invested a lot of effort to speed up the retry process and make it more robust. Large groups of messages will be split up into batches and dispatched separately, with progress reported for each batch.

Batched retries

The new retries capability uses a new queue named particular.servicecontrol.staging. Messages to be retried get batched and loaded into this queue. When the batch is fully staged, the entire contents are forwarded to their destination.

With these new changes to failed message groups, retrying or archiving hundreds of messages can now be handled with just a few clicks. The next time you have a flood of failed messages for whatever reason, the new failed message grouping features are going to save you a ton of time and headaches.

Very large messages

To audit all messages flowing through your system, you need to store copies of them. Most of the time, messages are fairly small, and this isn't a problem. But, to keep very large messages from affecting the performance of the system, previous versions of ServiceControl did not store the message body for audited messages larger than 100KB. Anything larger got discarded, and attempting to view the message within ServiceInsight would show only a blank window.

But 100KB is an arbitrary setting. You might regularly have messages that are just over that 100KB limit, and still would like to be able to view them in ServiceInsight. Maybe you are even willing to sacrifice some performance to have more information.

In order to allow auditing larger messages, the maximum message body size is now configurable. Additionally, if the message body of audited messages are larger than the maximum configured size, ServiceInsight will now present a meaningful message.

Message body was too long

Now you can make the decision on how large is too large for yourself, based on what makes sense for your organization. If you ever run into a message that goes above the limit, ServiceInsight will tell you what's going on and what to do about it.

More focused tooling

In this release, we've paid a lot of attention to common workflows. We have optimized our tools based on your suggestions, and we've made a bunch of improvements intended to help you get your job done quicker, more efficiently and with fewer distractions.

Stack trace coloring

Stack traces are invaluable for debugging, but previously we presented them as a wall of plain black text, which can be difficult for the brain to make sense of quickly. It would be ideal if the structure of the stack trace would make itself readily apparent so that you could understand it more quickly and continue with debugging.

To make stack traces in ServiceInsight easier to read, we are now using syntax highlighting to display them in color. Stack traces displayed in the Headers tab, and also in the Exception popup of the Flow Diagram tab (shown after clicking Exception type) will both be colored.

Stack trace coloring

With this small change, it's now much easier to see the method name, file path, and line number where an exception occurred, so you can pull up the offending code and fix the problem.

Better 3rd party integration

It's possible for NServiceBus to ingest messages created manually on the queuing infrastructure, perhaps as part of a 3rd party integration, but these messages commonly lack information in specific headers. Missing headers creates very odd output in some places, with values either missing or using default values like DateTime.MinValue, which created a lot of confusion.

It would be much clearer if these missing values were communicated better by providing meaningful descriptions, showing the true nature of the missing values. We have made these changes in ServicePulse and ServiceInsight.

In ServicePulse, Unknown will be displayed for missing Time Sent and Message Type values in the Failed Messages list.

Unknown headers in ServicePulse

The same goes for ServiceInsight in the Messages list, Flow Diagram, and Saga View.

Unknown headers in ServiceInsight

With this change, it's clear that the information is indeed not available, and not due to a random system error.

Streamlining ServiceInsight message list

In the previous version of ServiceInsight, the message list contained seven columns: Status, Message ID, Message Type, Time Sent, Critical Time, Processing Time and Delivery Time. This was too much information, making for a crowded grid, especially at lower monitor resolutions, where space is at a premium and horizontal scroll bars serve only to hide information from view.

ServiceInsight is primarily a tool to help you while you are developing and debugging your systems, so it would be more helpful to focus only on the information necessary for that purpose.

To accomplish this, we have streamlined the information displayed in ServiceInsight. Critical Time (the amount of time a message has to wait to be processed) and Delivery Time (the amount of time it takes for a message to be delivered to its destination) are related to monitoring, and are not that useful in ServiceInsight. Additionally, at times these calculated values were inaccurate due to clock drift between different servers, causing unnecessary confusion.

For these reasons, we have removed Critical Time and Delivery Time from the Messages list view to allow you an unobstructed view of the information that you really need to get an understanding of what your system is doing.

Usability improvements

We've made a lot of other improvements to make our tools easier to use.

In previous versions of ServiceInsight, message nodes were all the same size, regardless of content. Now, the width adapts to the length of its content.

Auto-sizing of message nodes in ServiceInsight

Now you can see everything you need, without resorting to a rollover tooltip.

In ServicePulse, we've reorganized the Endpoints Overview and Configuration screens to display their contents alphabetically, making the desired content easier to find. We've enabled text to wrap, where appropriate, so that you can see the full names of things without extra hassle. We've also introduced a new version notification, so you can always be sure you're running the latest and greatest version of ServicePulse.

Summary

A lot of great things went into this release of the platform, with the aim of removing some common pain points and frustrations, and optimizing things so that you can do your job better. Of course, we fixed a bunch of not-so-common bugs too. For the complete details, you can refer to the release notes for ServiceInsight 1.3.0, ServicePulse 1.2.0, and ServiceControl 1.6.0.

You can get the newest versions of these tools on our downloads page. We would love to hear what you think!

About the authors:Weronika Łabaj, Mike Minutillo, and John Simons are part of the Particular Software Engineering Team who are passionate about writing software that makes other developers happy.

↧

Goodbye microservices, hello right-sized services

September 27, 2015, 10:00 pm

≫ Next: Encouraging an "I don't know" culture

≪ Previous: Productivity power-ups in the August Platform release

If you've been to any conference recently or follow industry news, you have probably heard about microservices. They're everywhere. Whether it is called SOA done right, microservices, or just decomposing big balls of mud, the hype is there, and it's unavoidable. Even if you are not particularly interested in the subject, your client or manager may be asking you about it sooner than you think.

Micro is the new black

As is usually the case with new and complex subjects, microservices have raised varied, often contradictory opinions in the developer community. Sometimes people use the same words but have different things in mind; sometimes different names are used to refer to the same thing.

Some say there's nothing new under the sun. Others point out that we live in a different world than when SOA was first defined. Continuous Delivery, DevOps or cross-functional teams are more popular, which makes our current environment quite different from what we had only a few years ago. Even naming these things is controversial. Services, microservices, and autonomous components are all terms that have been thrown around to refer to almost (if not quite) the same thing.

For the purpose of this post, I consider the terms microservice, service and component as synonyms, leaving detailed discussion regarding the properties and differences between them out of scope. Clarifying this issue is a subject for another post. What is important here is that "the size question" comes up equally often no matter what label we put on that thing.

Each term comes with its own context, and there are ongoing discussions surrounding the definitions of these things. Over time, with some luck, that knowledge will be distilled, vocabularies unified, and beginners will have clear, consistent implementation guidelines.

But we're not quite there yet.

Does size matter?

Currently, one of the very few things that everybody agrees upon is that a component that is "small and focused on doing one thing well" is better than a big monolithic ball of mud. If you've worked with a huge system that was very hard to maintain, then you might agree that even if code is far from perfect, the smaller it is, the easier the work.

Smaller elements can be understood more quickly, are easier to modify and less likely to be accidentally broken. The smaller they are, the faster we can fix bugs and ship new features. If necessary, we can even throw away the whole microservice and rewrite it from scratch in a matter of weeks, instead of months or years.

It seems that size is kind of important. If it wasn't, why would we talk about microservices and big balls of mud? No wonder that one of the most frequently asked questions regarding services is how big they should be? What is the right size? How do we know if we're on the track?

This is where things get complicated...

Be careful what you ask for

I recently had a conversation with a developer about his current project. He told me that they are doing something like microservices, but their services are not real microservices. I was puzzled. What did he mean by not real? He explained that a microservice shouldn't be bigger than 300 lines of code, but their services were bigger than that.

New plugin I’ve been working on…Still in prototype stages... pic.twitter.com/NJ6orOw40Z
— Hadi Hariri (@hhariri) December 4, 2014

There are a lot of different metrics floating around, such as being able to rewrite each microservice in 6 weeks, or having a 2-pizza team per service. On the surface, these metrics appear more meaningful than lines of code, since they don't depend on the expressiveness of the programming language in use.

All these metrics can be useful rules of thumb, but what we often forget is the context. Each of these rules makes sense under certain circumstances. If you don't have the same team structure, work with a less experienced team or just use a different programming language, the specific value (or even type of metric) might not provide much benefit.

More importantly, don't forget that what you measure is what you get. If you tell everybody in your team that the limit is 300 lines of code, and you will track it, you will end up with small services. But that doesn't guarantee your system will be easy to modify, more reliable or that you will ship features faster.

The size of components is only one of the attributes of a system that make it easy to maintain. However, by focusing mainly on size we tend to ignore more important factors and risk ending up with the same old big balls of mud, only this time they are distributed.

Beyond lines of code

Fortunately, there are more meaningful things you can do. The key element is modeling services, in particular finding their proper boundaries and aligning them with bounded contexts or business capabilities. Finding Service Boundaries - illustrated in healthcare offers valuable insights and tips on how that looks in practice. In another example, a Norwegian company combined dozens of ultra-fine-grained services into just a few bigger, more focused ones.

On the surface, everything seems clear and easy, but the devil is in the details. Over time, you will identify gray areas. You will notice there's no single, well-organized source of truth with easy-to-follow guidelines. You need to do a lot of research and heavy thinking before you decide whose advice you will follow.

For example, one of the challenges with service boundaries is that it feels natural to align them with the user interface. Explore composite UIs in more detail to understand the alternative to a service per screen approach. Another natural expectation is that organization analysis will help us discover proper business boundaries. However, organizations are constantly evolving. Mergers, layoffs, and reorganizations happen frequently. People may work on cross-functional teams or have overlapping responsibilities. In the long run, trying to align service boundaries with organizational boundaries is very hard.

Only by digging deeper into these common challenges will you gain a better understanding of how to define services and their boundaries.

Focus on what matters

Rather than size, focus on service boundaries. Analyze business bounded contexts, align your services with business capabilities and focus on making sure your components are truly autonomous. In the meantime, you'll probably find an answer to the "how big?" question that makes sense in your circumstances. You might also realize that as long as you focus on service boundaries, it doesn't really matter how big each service ends being.

Finding the right boundaries is much more complicated than saying "each microservice has to be less than 300 lines of code." It requires (sometimes messy) work to better understand the underlying business context. That's why focusing on size is so tempting, even if it doesn't provide the expected benefits.

The good news is that there's already plenty of good advice available. More and more people share their experiences at conferences and talk about their journey from monoliths to (micro)services. You can learn from their successes and their mistakes.

However, you always need to be careful what you take away from those stories. After all, while everybody always talks about size, most people know that that's not what really matters.

About the author:Weronika Łabaj is a developer at Particular Software, passionate about exploring new paradigms, rediscovering familiar things by looking at them from fresh perspectives and providing business value with software.

↧

Encouraging an "I don't know" culture

October 26, 2015, 3:00 am

≫ Next: An organization deconstructed

≪ Previous: Goodbye microservices, hello right-sized services

I recently started as a software engineer with Particular. Being new means I've had plenty of opportunities to realize what I know and what I don't know.

With every new role, project, or technology, I've always found there is a lot to learn. I look at the team around me and see their good qualities, how they have it all together, and I realize how far I have to grow. They are experts, and I am supposed to know what I'm doing as well.

You respect the knowledge of your peers, and you want them to respect yours, too. Admitting that you don't know something is scary. Will you lose a bit of that respect? What if you admit not knowing something that was obvious to everyone else?

Saying "I don't know" doesn't have to be a bad thing. Just as one of the benefits of pair programming is knowledge sharing, saying "I don't know" gives someone the opportunity to help you. When we recognize that every individual brings a special set of skills , we can transcend the fear of judgment. Ultimately we are all working toward the same goal.

You can learn great things from your mistakes when you aren’t busy denying them.
Stephen R. Covey - The 7 Habits of Highly Effective People

Get rid of your assumption baggage

Saying "I don't know" on an individual basis uncovers gaps and encourages knowledge-seeking. Conversely, being afraid to say "I don't know" encourages a culture of perceived expertise instead of actual expertise.

We've all witnessed that awkward situation in which someone is unwilling to admit that they don't know something. It is a detriment to both the individual and the team.

There are a few pressures at play but at the core, shying away from saying "I don't know" reflects a desire to be seen as (1) intelligent and (2) credible. The irony is that failing to say "I don't know" will ultimately damage both pursuits. Pretending that you know things in order to be perceived as intelligent ultimately destroys your ability to be perceived as credible, leaving you with a reputation as neither. Admitting you don't know is actually a strong indicator that you are intelligent.

The ability to say "I don't know" is particularly helpful in the software development industry. Developers make many mistakes because of poor assumptions, which could be prevented by admitting we don't have all the answers.

If you tell the truth, it becomes a part of your past. If you lie, it becomes a part of your future.
Unknown

Lead by seeking truth

Here are some ways I think an "I don't know" culture can be grown:

Say "I don't know." There are countless opportunities to say "I don't know" and to follow up with "Let's find out." This simple act can encourage others to do the same.

Share mistakes. When we make mistakes, there are lessons to learn. Admitting to those mistakes will show your dedication to transparency and allow others to learn vicariously through you.

Rinat Abdullin provides an awesome example in his Lokad.CQRS retrospective. Here, he details how not knowing caused quite a bit of pain, and then he shares the lessons learned for posterity.

Seek critique. It's difficult for us to know our own deficiencies. We understand that about ourselves. Actively seeking feedback will help you improve yourself, and you'll find ways to share your discoveries with others.

Ask for help. People love to help and teach. This encourages a sense of community and fosters mentoring. You might even learn more than you bargained for.

Lead by example. People who seek to learn new things often find themselves in domains where they know very little. Those who are consistently learning will often need to say "I don't know." Being able to admit this shows that you embrace the fact that no person knows everything—and that includes you.

Hire the right people. Finding people who can admit deficits in their understanding will build and protect this culture of knowledge-seeking. One interview technique is to check how quickly a candidate will say that they don't know. Following that, you'll see how someone seeks education after discovering a gap in their knowledge.

Top-down buy-in. When leadership embraces these techniques, admitting "I just don't know" is more than a just a permissible thing to say. It's an approach to problem-solving at your company.

It also pays dividends to practice egoless programming. No matter how much karate you know, there will always be someone else who knows more.

Put it into practice

Speaking from experience, when I don't know something I have found it much easier to speak up if I have seen people in leadership positions do so. Actions speak louder than words, and publicly saying "I don't know" will help other people do the same.

The next time you find that you don't know, speak up! You'll broaden your horizons and do something good for your team and culture, all at the same time.

About the author:Colin Higgins is a software engineer at Particular Software. He is red-green colorblind and isn't afraid to say he doesn't know if his clothes match, even though he can't learn from it. He is passionate about software and continuous improvement.

↧

An organization deconstructed

November 9, 2015, 3:00 am

≫ Next: The day I tried to make cookies and learned something about writing documentation

≪ Previous: Encouraging an "I don't know" culture

It’s safe to say I knew next to nothing about the challenges of a software development company when I joined the Operations department at Particular Software two years ago. My background is in Human Resources, and that part of me was intrigued by the little I knew about the company – just past the start-up stage, 100% dispersed, flexible hours, growing fast.

As a passionate planner, it rocked my boat a bit that there was no master plan for the company. Three things were clear, though. One, I was part of a team passionate about our culture, organization, and products. Two, we wanted to build tools that developers were equally passionate about. And three, we wanted to build the kind of company where we all wanted to work.

Particular founder and CEO Udi Dahan says that we may not know exactly where we're going or how we're going to get there, but I trust him and my talented colleagues to keep things afloat as we work our way through the transition. We’re trying to make good decisions at every step along the way, to fail small and learn big. It’s a fascinating journey, and I plan to share our learnings through this series of blog posts.

In the beginning

When Udi decided to create a company to commercially license his open-source project, NServiceBus, he didn't plan on having employees working remotely from all over the world. His focus was only on bringing in the best people to create the best products for making developers better.

As the company grew, keeping everyone remote had advantages: no overhead or relocation, the broadest possible pool for great candidates, and a completely flexible work environment. We were able to provide 'round the world, 'round the clock support to our customers while still providing an excellent quality of life for our staff members.

After about two years, the company made a conscious decision to stay on this path – there would be no "home base." Like Udi's distributed system design philosophy, we also became officially "dispersed." And like NServiceBus itself, it just...worked. But, would this structure hold as our staff proceeded to grow dramatically?

Organizationally challenged

We saw how our growing staff allowed us to tackle projects we hadn't been able to address before, but there was a cost. While the department structure led to more cohesiveness and regular discussions within departments, there was much less of that between departments. Udi began to hear complaints that the processes were handled inconsistently between departments, leading to frustration. Also, with a hierarchical structure, we didn't always have the right combination of skill sets involved, leading to less than optimal decision making.

What? No departments?

Toward the end of 2014, Udi presented us with a vision for organizational change. Rather than optimize for the efficiency of certain groups of tasks, we would optimize around end-to-end processes. To do this, we would bring together, on an ad-hoc basis, individuals with the right skill sets to see a task through from beginning to end. These groups, called task forces, would be transitory and dissolve after completing a unit of work or making a decision.

This change included eliminating the director positions in the company, allowing for direct communication between individuals rather than forcing staff to follow a hierarchical path up and down a chain to resolve issues and get approvals. Decisions would be pushed down to the individual and task force level. (Lots more to to talk about here in future posts!)

When Udi made this announcement, he was met with puzzled faces. Some were thinking, "Why mess with something that is working?" Others thought, "I don't get it, but let's see how this thing plays out."

My mind went right to the human resources implications. This change would have a tremendous impact on how we hire, evaluate performance, provide career development guidance, compensate our staff, etc. For me, this was a gold mine of opportunity, and I couldn't wait to see how it would shake out!

How’s it going?

For each problem, task, or project that arises, we are trying to apply these new principles. We’re deconstructing the whole into components and putting them back together in a way that makes more sense. Through these efforts, the vision is slowly becoming clearer. But, there are significant challenges. They include prioritization, overloaded to-do lists, trying to formulate processes while still being productive, and determining how final decisions are made.

We've already seen some positive results coming out of this. For one, there are better and faster decisions being made by a wider variety of people with multiple skill sets. Also, staff members participate in a broader range of activities of interest to them, energizing them to collaborate and work through issues together.

I’m encouraged. The challenges are surmountable. The results are promising. And I’m personally having a great time!

About the author: Karen Fruchtman oversees all activities that allow the staff at Particular to shine. Karen is energized by the opportunity to help craft a company culture that attracts and retains such a stellar staff. Blogging is a new venture, and just one example of how staff at Particular can explore new ways to enhance their careers. Intrigued? We're hiring!

↧

The day I tried to make cookies and learned something about writing documentation

November 18, 2015, 6:00 am

≫ Next: The dangers of ThreadLocal

≪ Previous: An organization deconstructed

Have you ever wondered what it might feel like to read product documentation without the benefit of being an expert in the product? It’s a frustrating exercise, but it's usually something I can work around once I figure out the gap between what I know and what the reader knows.

It’s never been brought home so completely as it was the other day. I had a hankering for some home-baked cookies and, perhaps stupidly, I figured I’d cook them myself. After all, how hard can it be to follow a bunch of simple instructions to bake some cookies, right?

So I enthusiastically embarked on the cookie-baking journey. After a little trawling on the web, I gave a shout out to my friends, asking them to send me their favorite recipe.

This one came back from a trusted expert who runs cooking workshops and her own cake business. Suffice it to say, she knows what she's doing and because of that I chose to try hers.

White Chocolate and Macadamia Cookies

White Chocolate and Macadamia Cookies Recipe

Ingredients

50g butter, 2 cups plain flour, 2 to baking soda, 1 cup sugar, 2.5 cups oats (put through the blender), White chocolate chips/chunks, Macadamias, 1 cup brown sugar, 1/2 to salt, 2 eggs, Vanilla

Method

Beat butter & sugars until light & fluffy.
Add eggs one at a time, beating well.
Add a splash of vanilla.
Mix flour, oatmeal, baking soda & salt together. Add, mixing gently.
Mix in chocolate & macadamias.
Form mixture into balls and spread out on a lined baking tray.
Bake at 160c for 10 min.
Leave to cool for 10 min.
Crank out your best Cookie Monster impersonation.

And the adventure begins

Bad Cookies

After measuring out the ingredients carefully, things started to fall apart. The first step should have read, “Soften butter by microwaving on high for 30 seconds,” then, “Using a whisk, beat butter until light and fluffy.” Neanderthal that I am, I threw the butter straight from the fridge into the bowl with the other stuff and started pounding it with my fists. It became neither light nor fluffy, although the batter got to a point where I felt that it was probably time to add the eggs.

I followed the addition of eggs with more pummeling, and then I hit my first conundrum. In Australia, “g” means grams, but when I got to this line "2 to baking soda" I had no idea what a “to” was. I tried to phone my friend but she was unreachable at the time, so I just went ahead and assumed that it was a typo and that she meant "teaspoon," but upon reflection, it could have meant "tablespoon."

How much is a “splash”? A “splash,” it says! And how do I “splash” vanilla seeds? I solved the latter by digging around for some vanilla essence, but I agonized over the “splash.” In the end, I figured a teaspoon would probably be enough, but to this day I still don’t know.

The next instruction was perplexing, too. What is mixing gently, and how does the lack of rigorous force change the way the cookies taste? I searched the internet for techniques on "mixing gently," but all I achieved was getting smears of cookie dough on my iPad. Rapidly losing patience and getting hungry, I just mashed what was in the bowl so that it looked evenly mixed before throwing in the macadamias and milk chocolate pieces.

I was convinced I was ready to bake. All I needed to do was to "form the mixture into balls and spread out on a lined baking tray"...but how big is a ball? What do I line my baking tray with?

Twenty minutes later, my first batch came out, and they were not the cookies I was looking for. They seemed really dry, so I added more butter and tried again. Another twenty minutes went by, and they still looked wrong. So I added a bunch of honey. I know it wasn't part of the recipe, but I saw that another recipe called for it, and I was desperate. The last batch came out looking somewhat like the cookies I had imagined creating, but after the first bite, well, let's just say there was no second bite.

Aftermath

This adventure got me thinking about the documentation I produced for software projects in the past. I never really gave any thought to who the person reading that documentation would be - their background, skills, or experience. I just assumed they'd be familiar with all the things that were obvious to me.

It's clear now that my first mistake was pulling all the ingredients out of the pantry. I should have read a book on cooking — maybe even read how to operate the oven instead of twiddling the dials until it looked right. Let's face it: we've all had days in our careers when that method of trial and error produced code that's still in production today.

Beginners shouldn't assume instructions are written with the reader's level of competence in mind. Likewise, authors must be mindful that the audience for their expertise might be completely oblivious to what they think is glaringly obvious.

When my friend handed me the recipe, she had assumed that I wanted to bake because she loves the process. In fact, all I wanted was cookies. The truth is many people don't care about your motivations or what you know. They only care that your writing makes their lives better in some way. They will happily find something else to read, or worse, write nasty comments if you don't live up to their expectations.

What to focus on next

It's easy to be dismissive of documentation, especially when you're in a big organization and the documentation hot potato is being tossed around. Next time, don't pass it on. Make it the most important thing you do that day.

Aside from spelling and grammar, you need to think carefully about your audience and optimize the information for them. Before you publish, read it out loud - even to your cat (who is probably sitting on your keyboard anyway). Does the information still make sense?

This might also raise some questions about the information you are providing. Think about the quality of your communication. Is your writing clear and precise? Are your assumptions about your audience reasonable?

At Particular Software, we endeavor to make our documentation as relevant and useful as possible. We think our docs are pretty good, and our team is always looking to improve them. If you notice something that is confusing or needs attention, you can simply click the "Improve this doc" button and submit those changes or any comments you have through our GitHub back-end.

Additional reading

About the author:Peter Giles is a developer at Particular Software, is passionate about riding his Harley, and enjoys a good cookie.

↧

The dangers of ThreadLocal

December 1, 2015, 12:00 pm

≫ Next: Beyond ServiceMatrix

≪ Previous: The day I tried to make cookies and learned something about writing documentation

Languages and frameworks evolve. We as developers have to learn new things constantly and unlearn already-learned knowledge. Speaking for myself, unlearning is the most difficult part of continuous learning. When I first came into contact with multi-threaded applications in .NET, I stumbled over the ThreadStatic attribute. I made a mental note that this attribute is particularly helpful when you have static fields that should not be shared between threads. At the time that the .NET Framework 4.0 was released, I discovered the ThreadLocal class and how it does a better job assigning default values to thread-specific data. So I unlearned the ThreadStaticAttribute, favoring instead ThreadLocal<T>.

Fast forward to some time later, when I started digging into async/await. I fell victim to a belief that thread-specific data still worked. So I was wrong, again, and had to unlearn, again! If only I had known about AsyncLocal earlier.

Let's learn and unlearn together!

TL;DR
A Task is not a Thread. Task is a future or promise that gets eventually executed by a worker Thread.
If you need ambient data local to the asynchronous control flow, for example to cache WCF communication channels, use AsyncLocal instead of ThreadStaticAttribute or ThreadLocal provided by .NET 4.6 or the Core CLR.

There is no spoon

Some things just stick in our minds. When the world around us evolves, those things might no longer be true. All of us who started programming with .NET in the pre-.NET 4.0 era have synthesized the knowledge of threads and how useful they are in executing intensive operations in parallel. We knew the difference between foreground/background threads and how important it is not to block the UI thread.

Then the Task Parallel Library was released and turned our world upside down. When looking at System.Threading.Tasks.Task, we suddenly felt like Neo from The Matrix and realized there is no spoon — or in our context, no Thread!

I'm not the first one using the matrix analogy to describe the difference between a Thread and a Task. Stephen Cleary has an excellent post There is no thread from the year 2013 which uses the same analogy and dives deeper into the differences. I highly suggest reading it.

A Task under the Task Parallel Library is a future or promise. A Task is something you want to be done. In contrast, a Thread is one of the many possible workers who might perform that task. By the contract of a Task, we don't know whether it will be scheduled, whether it will be immediately executed, or whether it is already done the moment we declared it. The Task Parallel Library runtime has the built-in smarts to decide whether a task is executed on the thread that created it or if it needs to be scheduled on the worker thread pool or the IO thread pool. Furthermore, just because a thread was working on a given task doesn't mean that thread will execute all the continuations of that task.

This gets even more complex when we start introducing async/await into the equation. Any time you write an await statement, along with your friend ConfigureAwait(false), the thread currently executing the task can yield back and start executing multiple other tasks. When the I/O operation completes, the remainder of the task (the continuation) is again scheduled as a Task on the currently responsible TaskScheduler. The previously responsible thread might pick up that task and continue working on it. Alternately, any other available thread could do it. The following code illustrates this:

static dynamic Local;
static ThreadLocal<string> ThreadLocal = new ThreadLocal<string>(() => "Initial Value");

public async Task ThereIsNoSpoon()
{
    // Assign the ThreadLocal to the dynamic field
    Local = ThreadLocal;

    Console.WriteLine($"Before TopOne: '{Local.Value}'");
    await TopOne().ConfigureAwait(false);
    Console.WriteLine($"After TopOne: '{Local.Value}'");
    await TopTen().ConfigureAwait(false);
    Console.WriteLine($"After TopTen: '{Local.Value}'");
}

static async Task TopOne()
{
   await Task.Delay(10).ConfigureAwait(false);
   Local.Value = "ValueSetBy TopOne";
   await Somewhere().ConfigureAwait(false);
}

static async Task TopTen()
{
   await Task.Delay(10).ConfigureAwait(false);
   Local.Value = "ValueSetBy TopTen";
   await Somewhere().ConfigureAwait(false);
}

static async Task Somewhere()
{
   await Task.Delay(10).ConfigureAwait(false);
   Console.WriteLine($"Inside Somewhere: '{Local.Value}'");
   await Task.Delay(10).ConfigureAwait(false);
   await DeepDown();
}

static async Task DeepDown()
{
   await Task.Delay(10).ConfigureAwait(false);
   Console.WriteLine($"Inside DeepDown: '{Local.Value}'");
   Fire().Ignore();
}

static async Task Fire()
{
   await Task.Yield();
   Console.WriteLine($"Inside Fire: '{Local.Value}'");
}

The above code should be relatively straightforward to understand. The entry point is the method ThereIsNoSpoon, which calls into two methods TopOne and TopTen. These methods set the ThreadLocal value to ValueSetBy TopOne and ValueSetBy TopTen, respectively. Both methods call the method Somewhere, which prints out the value of the ThreadLocal and calls into DeepDown. DeepDown prints the value of the ThreadLocal again, and then kicks off an asynchronous method called Fire without awaiting it (hence the method Ignore, which suppresses the compiler warning CS4014). This code uses a tiny trick that lets us later reuse the code to demonstrate AsyncLocal. The methods used in the execution path access the dynamic static field Local. Both classes provide a Value property and, therefore, we can just assign either ThreadLocal or AsyncLocal without duplicating unnecessary code.

The output of the above code looks similar to the following (your result might vary):

Before TopOne: 'Initial Value'
Inside Somewhere: 'ValueSetBy TopOne'
Inside DeepDown: 'ValueSetBy TopOne'
After TopOne: 'ValueSetBy TopOne'
Inside Fire: 'Initial Value'
Inside Somewhere: 'Initial Value'
Inside DeepDown: 'ValueSetBy TopTen'
Inside Fire: 'Initial Value'
After TopTen: 'ValueSetBy TopTen'

Before the execution of the TopOne method, the ThreadLocal has its default value. The method itself assigns the value Initial Value to the ThreadLocal. This value remains until the Fire method is scheduled without being awaited. In our case, the Fire method gets scheduled on the default TaskScheduler, the ThreadPool. Therefore, the previously-assigned value is no longer bound to the ThreadLocal, leading to the Fire method only being able to read the ThreadLocal's default value. In our example reading the initial value has no consequences. But what if you used the ThreadLocal to cache expensive communication objects such as WCF channels? You'd assume that those expensive objects are cached and reused when executing the Fire method. But they wouldn't. This would create a dangerous hot path in your codebase, furiously creating new expensive objects with each invocation. You'd have a difficult time discovering this, until your shopping cart system crashed spectacularly on the busiest day of the holiday shopping season.

Long story short: a Task is not a Thread. Along with async/await, we must say goodbye to thread-local data! Say hello to AsyncLocal.

Bend it with your mind

AsyncLocal<T> is a class introduced in the .NET Framework 4.6 (also available in the new CoreCLR). According to MSDN, AsyncLocal<T> "represents ambient data that is local to a given asynchronous control flow."

Let's decipher that statement. An asynchronous control flow can be seen as the call stack of an asynchronous method call chain. In the example above, the asynchronous control flow from the view of the AsyncLocal starts when we set the Value property. So the two control flows would be

Flow 1: TopOne > Somewhere > DeepDown > Fire
Flow 2: TopTen > Somewhere > DeepDown > Fire

So the promise of the AsyncLocal is that we can assign a value to it that's present as long as we are inside the same asynchronous control flow. We'll prove that with the following code.

static AsyncLocal<string> AsyncLocal = new AsyncLocal<string> { Value = "Initial Value" };

public async Task BendItWithYourMind()
{
    // Assign the AsyncLocal to the dynamic field
    Local = AsyncLocal;

    // This code is the same as before but shown again for clarity
    Console.WriteLine($"Before TopOne: '{Local.Value}'");
    await TopOne().ConfigureAwait(false);
    Console.WriteLine($"After TopOne: '{Local.Value}'");
    await TopTen().ConfigureAwait(false);
    Console.WriteLine($"After TopTen: '{Local.Value}'");
}

The output of the above code looks similar to this:

Before TopOne: 'Initial Value'
Inside Somewhere: 'ValueSetBy TopOne'
Inside DeepDown: 'ValueSetBy TopOne'
After TopOne: 'Initial Value'
Inside Fire: 'ValueSetBy TopOne'
Inside Somewhere: 'ValueSetBy TopTen'
Inside DeepDown: 'ValueSetBy TopTen'
Inside Fire: 'ValueSetBy TopTen'
After TopTen: 'Initial Value'

As we can see, the code finally behaves as we anticipated from the beginning. Outside the asynchronous control flow, the AsyncLocal has its default value "Initial Value." As soon as we assign the Value property, it remains set even if we call the Fire method without awaiting it.

Sometimes Local just isn't local enough

We've just seen how [ThreadStaticAttribute] or ThreadLocal no longer works as expected when we combine it with asynchronous code using the async/await keywords. So if you want to build robust code that needs to access ambient data local to the current asynchronous control flow, you must use AsyncLocal and upgrade to .NET 4.6 or the Core CLR. If you can't upgrade your project to one these platform version you can try to mimic AsyncLocal by using the ExecutionContext instead.

Now that you've seen AsyncLocal, let me tell you the unlearning isn't over yet! In the next installment, I'll show you how you can restructure your existing code so you won't even need ambient data anymore. Stay tuned, and don't bend too many spoons in the meantime!

↧

Beyond ServiceMatrix

December 21, 2015, 3:12 am

≫ Next: The Slippery Slope - Love your job, Live your life... Lose your mind

≪ Previous: The dangers of ThreadLocal

When we originally started developing ServiceMatrix, our vision was to develop a tool that could help design distributed systems. One of our main goals was to enable architects to graphically design an immediately executable solution so they could quickly iterate on their designs. We also wanted to help developers who were new to NServiceBus to get started more quickly and avoid many common pitfalls with messaging.

Unfortunately, even after years of effort, we just could not find the right balance between the "it just works" code generation experience and leaving developers enough control over how their solution was built. Although we were aware of many other companies who had struggled with this very issue, we also fell into that well-worn trap of thinking "this time will be different".

We haven't given up though, and have started down a different path instead - one that doesn't rely on code generation and is more integrated with the rest of the platform.

First - NServiceBus

One of the benefits of ServiceMatrix was that it made it very easy to define how messages should flow between your endpoints. ServiceMatrix also made it much easier to build sagas - generating the code responsible for mapping messages to saga state. What we came to realize was that the regular NServiceBus routing configuration and saga APIs were too complex.

The good news is that with the upcoming version 6 of NServiceBus, you'll already see some improvements in these areas - and going forwards, we'll be rolling out even more simplifications.

Then - Visualization

One of the most requested features of ServiceMatrix we couldn't ever quite figure out was how to "import" an existing code base so that developers could get a visualization of how their system worked. There were just too many differences in how developers structured their solutions and how our code generation needed to work.

We think we've found a better way to visualize your existing solutions. Instead of trying to reverse-engineer your code and project structure, we're basing it on the stream of audited messages - something that is much more stable.

In short, you'll be able to get a graphical representation of the types of messages in your system and how they flow between the various endpoints, a kind of living documentation for how everything works - regardless of how you structure your code.

We've been evaluating this approach with a number of customers and are now releasing it as a Community Technology Preview at Particular Labs so that you can try it out.

Finally

We really do appreciate all the support you've given us over the years and remain as committed as ever to providing you with the best tools and infrastructure for your systems. Although at times we will make mistakes, we will be open and honest with you about them and do our very best not only to correct them but to make things right by you as well.

For those of you currently using ServiceMatrix, you can find more detailed information in our transition plan.

With thanks, Udi Dahan and the rest of the team in Particular

↧

The Slippery Slope - Love your job, Live your life... Lose your mind

February 9, 2016, 3:00 am

≫ Next: But all my errors are severe!

≪ Previous: Beyond ServiceMatrix

It all began so innocently.

Through a friend, I found a job at a small, young, high-tech company. It was a perfect fit for work-life balance—work from home and flexible hours. This job was completely in English, which was a major plus for one who chose to move overseas and, 10 years later, still struggles with a new language! As a bonus, there was a growing human resources component to the job, tapping into my original career choice years ago before I stepped off the corporate ladder to be home with my kids.

Although I knew my friend loved his job and worked a lot, I figured my family commitments would keep me from falling into that trap. This was just a job, after all, not a return to a full-fledged career. So I threw myself into it, working through the steep technical learning curve. We used many systems, and working in a completely remote, multiple-time-zone environment had additional challenges. Once I settled in and became functional, I realized the amazing amount of potential for me to make an impact in the organization.

Prove it!

A part of me felt I had a lot to prove—yes, to the company, but even more so to myself and my daughters who hadn’t really ever seen this side of me. Each new process I learned, each pleasant interaction with a customer I had, each improvement I made gave me a jolt of pride and spurred me on to do more. There were continual changes in the organization and in our processes, and the work was endless. I changed from someone who avoided change to someone who thrived on it.

Even with all that, I wasn’t a glutton. I stuck to my exercise schedule, I met friends for coffee, I continued to volunteer, and the house was still standing. I was mindful of the other aspects of my life, and yet, as each month went by, I drifted down to my office each night to do just a bit more. "If I don’t," I’d tell myself, "I’ll just be overwhelmed tomorrow." And because our remote workforce is all over the world, the work never stopped coming in.

The slippery slope

My family began to tease me. Then they started to complain about my lack of availability and little things falling through the cracks. I thought they just weren’t used to me working full time. My kids weren't little anymore and could certainly help out more around the house, and my husband could take on more of the household chores as well. They all did rise to the occasion to an extent, but I began to wonder when doing something you love becomes too much. Where is the line between feeling good about the work you do and feeling badly about what you're not doing?

I decided to sit down and add up the number of hours I was working each week. I was surprised when the total easily added up to 60 hours a week. Seeing that number made me recognize that the scales had shifted too far. Although our staff members may know the company policy encourages a reasonable work week with lots of flexibility, they'll still look at what people are actually doing in order to interpret what is really expected. I felt like it was my responsibility to show others that the way I've been working shouldn’t be the expected norm.

But I'm having fun!

The thing is, I’m still having fun. And I don't feel any effects of burnout. Still, I'm committed to cutting back. After all, I don't really know what will happen if I don't pace myself, and I'm in this for the long haul. I wonder, though...if I approach this experiment as part of my responsibility to the company, does that count as work? Hmmm…

I'm interested in hearing from you! Do you have any suggestions of things that have worked for you to scale back when working in a job you love? If so, hit me up over Twitter @KarenFruchtman.

About the author: Karen Fruchtman oversees all activities that allow the staff at Particular Software to shine. Karen is energized by the opportunity to help craft a company culture that attracts and retains such a stellar group of people. Blogging is a new venture for her, and just one example of how the team in Particular explores new ways of enhancing their careers. If you’re interested in learning more about us, check out our Careers Page.

↧

But all my errors are severe!

February 17, 2016, 3:00 am

≫ Next: Dish washing and the chain of responsibility

≪ Previous: The Slippery Slope - Love your job, Live your life... Lose your mind

I can't draw to save my life, but I love comics, especially ones that capture the essence of what it's like to be a software developer. They capture the shared pain we all go through and temper it with humor. Luckily, I no longer work for large corporations, so it's easier now to read Dilbert and laugh without also wincing.

But this comic from Geek&Poke hit me hard. The pain is just too fresh.

_{Simply Explained: Severity by Geek&Poke, reused via CC BY 3.0 license}

This spoke to me, drenching me in waves of nostalgia. Of course, nostalgia is the wrong word. It indicates a longing and wistfulness that I do not feel in any way. Instead, it was the stark regret of a life of sorting through hundreds of errors, the result of a "Log Everything" outlook. It is not a happy memory.

And of course, all those errors I mentioned were severe.

All errors are not created equal...

Why is it that all errors are treated the same? They all seem to be logged as severe. Errors mean the code failed; the exception was not handled, and the code was unable to continue. The notion of a "minor error" does seem questionable. And so, every error is black or white. The code either works or it doesn't. There is no room for shades of gray. Every error is severe.

Every organization has its own unwritten rules on how to take care of these supposedly severe errors. The developer has the best of intentions: a desire to know when errors occur. However, it results in the logging of every single error, and most times those errors end up ignored. Not at first, of course. But after the first couple of investigations, a sense of "The Boy Who Cried Wolf" sets in and the errors go unheeded. That is, until something goes horribly wrong.

Errors, or more specifically exceptions, by their very nature are supposed to be exceptional. There are normal actions that occur through everyday use of your application, and then there are exceptions which must be reported.

And so, as software developers, we log and report the exceptions. They usually fall into one of two buckets: warnings and errors. Part of the fault lies with generally-used logging libraries that don't give us much ability to differentiate any further. Sometimes the developer writing code randomly picks one of the two buckets, and even then the difference seldom holds any true meaning.

But this method describes an exception's severity, which is the wrong approach. Instead, we should be concentrating on a new axis entirely — one that requires us to think about our exceptions in a completely different light.

New breeds of exceptions

Let's think about exceptions in different terms. Suppose we run a piece of code and an exception occurs. What would happen if we just ran that code again right after it failed? We'd see that a few categories start to emerge.

Transient exceptions

Let's consider a common scenario. You have some code that updates a record in the database. Two threads attempt to lock the row at the same time, resulting in a deadlock. The database chooses one transaction to succeed and the other fails. The exception message Microsoft SQL Server returns for a deadlock is this:

Transaction (Process ID 58) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This is an example of a transient exception. Transient exceptions appear to be caused by random quantum fluctuations in the ether. If the failing code is immediately retried, it will probably succeed. Indeed, the exception message above tells us to do exactly that.

Semi-transient exceptions

The next category involves failures such as connecting to a web service that goes down intermittently. An immediate retry will likely not succeed, but retrying after a short time (from a few seconds up to a few minutes) might.

These are semi-transient exceptions. Semi-transient exceptions are persistent for a limited time but still resolve themselves relatively quickly.

Another common example involves the failover of a database cluster. If a database has enough pending transactions, it can take a minute or two for all of those transactions to commit before the failover can complete. During this time, queries are executed without issue, but attempting to modify data will result in an exception.

It can be difficult to deal with this type of failure, as it's frequently not possible for the calling thread to wait around long enough for the failure to resolve.

Systemic exceptions

Outright flaws in your system cause systemic exceptions, which are straight-up bugs. They will fail every time given the same input data. These are our good friends NullReferenceException, ArgumentException, dividing by zero, and a host of other idiotic mistakes we've all made.

Besides the usual mistakes in logic, you may have encountered versioning exceptions when deserializing objects. These exceptions occur when you serialize one version of an object, then attempt to deserialize it into another version of the object. Like the other exceptions mentioned above, you can rerun your code as often as you like and it'll just keep failing.

In short, these are the exceptions that a developer needs to look at, triage, and fix—preferably without all the noise from the transient and semi-transient getting in the way of our investigation.

Exception management, the classic way

Many developers, at one point or another, will create a global utility library and write code similar to the following:

public static void Retry(Action action, int tries)
{
    int attempt = 0;
    while(true)
    {
        try
        {
            action();
            return;
        }
        catch (Exception)
        {
            attempt++;
            if(attempt >= tries)
                throw;
        }
    }
}

While this might work in some transient situations, it becomes messy when you throw transactions into the mix. For example, this code could be wrapped within a transaction and the action itself may start its own transaction. Maintaining the proper TransactionScope can be a challenge.

In any case, some exceptions will escape this construct and will be thrown, making their way into our log files or error reporting systems. When that happens, what will become of the data that was being processed? If we were trying to contact our payment gateway to charge a credit card for an order, we would have lost that information and probably the revenue that went with it, too. As the exception bubbled up from the original code, we lost the parameters to the method we originally called.

At that point, we have no choice but to display an error screen to our user and ask them to try again. And how much faith would you put in a payment screen that said, "Something went wrong, please try again"? (In fact, most payment screens explicitly tell you not to do this and actually take steps to prevent it.)

Furthermore, to avoid data loss, you'd need to log not only the exception but also every single argument and state variable involved in the request. Unfortunately, hindsight is always 20/20 on what information should have been logged.

So let's take the next step and make that logged data more explicit—modeling the combination of method name and parameters as a kind of data transfer object (DTO) or a message.

Switching to a message-based architecture

With a "conventional" application comprised of many functions or methods all running within one process, we exchange data between modules by passing values into functions, then wait for a return value. This expectation of an immediately available return value is limiting, as nothing can proceed until that return value is available.

What if we structured our applications differently so that communicating between modules didn't have to return a value and we didn't have to wait for a response before proceeding? For example, instead of calling a web service directly from our client code and waiting for it to respond, we could bundle up the data needed for the web service into a message and forward it to a process that calls the web service in isolation. The code that sends the message doesn't necessarily need to wait for the web service to finish; it just needs to be notified when the web service call is done.

Once we start sending messages instead of calling methods, we can build infrastructure to handle exceptions differently. In the process, we'll gain all sorts of amazing superpowers. If there's a failure, we can retry the message if necessary since nobody is waiting on an immediate return value. This retry ability is powerful and forms the basis for how we can build better handling mechanisms for all our exceptions.

Handling transient exceptions

Consider this architecture: there's a message object which contains all of the business data required for an operation and a different object which contains code to handle the message—a "message handler." Since we wouldn't want to have to remember to include error-handling logic in each and every message handler, it makes sense to have some generic infrastructure code wrap the invocation of our message handler code with the necessary exception handling and retry functionality.

With this architecture in place, we can apply a retry strategy to our messages as they come in. If the message handler succeeds without throwing an exception, we will be done. However, if an exception is thrown, our retry mechanism will immediately re-execute the message handler code. If we get another exception when the code is re-executed, we can try again and continue retrying until either the code succeeds or we reach some pre-determined cap on the number of retries.

Because transient failures are unlikely to persist for very long, this kind of infrastructure support for immediate retries would automatically take care of these types of exceptions.

Handling semi-transient exceptions

Recall the earlier example of integrating with a third-party web service that is prone to occasional outages. Immediately retrying the code that calls it three or five or even 100 times isn't likely to help. For semi-transient exceptions like this, we should add a second retry strategy that would reprocess the message after some predetermined delay.

It's possible to have several rounds of retries with this strategy and we can increase the delay each time. For example, after the first failure, we could set the message handler to retry after 10 seconds. If that fails, we can retry after 20 seconds, then 40 seconds, and so on until we hit our predetermined maximum. (Many networks use this kind of exponential backoff to retransmit blocks of data in the event of network congestion.)

With this strategy in place, we can be confident that almost all semi-transient errors will be resolved automatically, at which point the message will be processed successfully. If a message repeatedly fails, even after both sets of retries, we can be fairly sure that we are now dealing with a systemic exception.

Handing systemic exceptions

If we have a systemic exception our retry logic may never break free, preventing a thread from doing useful work—effectively "poisoning" the queue, unless the message is removed by some other means. For this reason, messages which cause systemic exceptions are called poison messages.

Our message-processing infrastructure must move these poison messages to the side, into some other queue so that regular processing can continue. At this point, we are reasonably confident that the exception is an actual error that needs some investigation.

It's likely that someone would need to take a closer look at these poison messages to figure out what went wrong in the system before deciding what to do with them. Perhaps there was a simple coding error, and we can deploy a new version that will fix it. After the fix is deployed, because we still have our business data in a message, we can still retry it by returning the message to its original queue. Furthermore, we didn't lose any business data even though there was a bug in the system!

In an environment with multiple queues serving regular message processing logic, it would make sense to create a single centralized error queue to which all poison messages could be moved. That way, we would only need to look at a single queue to know that everything's working properly or how many system errors we need to deal with.

As an added benefit of moving poison messages to an error queue, the developer doesn't have to hunt through log files to get the complete exception details. Instead, we can include the stack trace along with the message data. Then the developer will have both the business data that led to the failure as well as the details of the failure itself—something not usually captured with standard logging techniques.

Summary

Getting 2413 severe exceptions overnight on a regular basis is a huge organizational failure, in addition to a technical one. It's bad enough that, due to the cognitive load of all these exceptions, they will probably get forgotten in an archive, unlikely ever to get fixed, relegated to a DBA to clean out periodically. What's worse is all of the countless developer hours, an organization's most precious resource, wasted in logging all these exceptions that can never be addressed practically.

Three things are needed to deal with these exceptions in a better way. First, we need to analyze exceptions based on how likely they are to occur, not on severity. Second, we need to employ an architecture based on asynchronous messaging so that we have the ability to replay messages as many times as necessary to achieve success. And finally, we need messaging infrastructure that takes into account the differences between transient, semi-transient, and systemic exceptions.

Once we do this, the transient and semi-transient exceptions that seem to "just happen" through no real fault of our own tend to fade away. Deadlocks, temporarily unavailable web services, finicky FTP servers, SQL failover events, unreliable VPN connections—the vast majority of these issues just resolve themselves through the retries, leaving us to focus on fixing the exceptions that actually mean something.

Additionally, we can use this pattern to add value to our system and our organization. Processes that fail in a message handler don't have to result in an error message displayed to a customer. This relieves pressure on first-level support, as many exceptions happen behind the scenes without the customer's knowledge, leading to a drastic reduction in support calls.

So if you have ever had to scan through thousands of "severe" exceptions looking for the real problem, you may want to take a look at message-based architectures to ease the burden of error reporting, freeing you up for more important things...

Like reading comics.

↧

Dish washing and the chain of responsibility

February 25, 2016, 3:00 am

≫ Next: NSBCon 2015: All about Transports

≪ Previous: But all my errors are severe!

In our house, cleaning out the dishwasher is a shared chore. My son starts the unloading process by removing a dish or utensil from the dishwasher. If he can put it away, then he does. If the proper location for the dish is out of his reach, then he passes it to his mother. She then goes through the same process; put the dish away if she can, or pass it off to the next person in line, which is me. When I get handed a dish I will put it away and, since I'm 6'4" (1.92m) tall, I can reach all of our cupboard space, which means that the process ends with me.

In our kitchen, the handing off of work can be thought of as an implementation of the chain of responsibility pattern. Each person in the family is a link in that chain. The chain of responsibility starts with my son removing a dish from the dishwasher and ends execution with me. The process of putting away dishes isn't all that different from the handling of messages in a message-based architecture. The system gets messages from a queue or stream and feeds them into the next transformation or piece of business logic.

Each person in our kitchen has a clearly defined set of questions that they ask themselves to determine if they should put the dish away or hand it off to the next process. With these clearly defined questions, each person, or link, has a well-encapsulated set of rules that they apply as they deal with each dish.

A chain of responsibility applies to more than just message-based architectures and emptying dishwashers. You'll find different variations of the pattern in frameworks and middleware like OWIN, FubuMVC, Express.js, and more. They usually share a common approach to the chain of responsibility implementation: nesting functions inside functions.

Here is a simple way to grasp the concept of the chain of responsibility.

TL;DR
The chain is essentially a list of links. To process a message, links are picked from the list one by one and executed recursively until the end of the list is reached.
A functional programming equivalent of the chain of responsibility pattern is called functional composition.

Unloading as a chain of responsibility

The process of unloading our dishwasher is, at its heart, a chain of responsibility. At our house, we have three steps in the chain and our output is an empty dishwasher. With this in mind, let's explore the code representation of our dishwasher-unloading chain so that we can see the chain and its links. A single person, a link in the chain, can be represented as just a method.

static void Person(Action next)
{
    // Implementation
    next();
}

The method Person above has a single parameter called next of type Action. The Action type is a delegate that can point to any method returning void and accepting zero parameters. Passing in the delegate allows us to compose multiple individual elements together into a chain of responsibility. As you can see in the sample below, the ManualDishwasherUnloading method contains the chain of the individual links. You can also see how we've represented each person, or a link in the chain, as a method that matches the signature for an Action.

public void ManualDishwasherUnloading()
{
    Son(() => Wife(() => Husband(() => Done())));
}

static void Son(Action next)
{
    // son can reach? return; else:
    Console.WriteLine("Son can't reach");
    next();
}

static void Wife(Action next)
{
    // wife can reach? return; else:
    Console.WriteLine("Wife can't reach");
    next();
}

static void Husband(Action next)
{
    Console.WriteLine("Husband put dish away");
    next();
}

static void Done()
{
    Console.WriteLine("Dish put away!");
}

If I, as the husband, am the only person who can reach to put the dish away, the output of this code is:

Son can't reach
Wife can't reach
Husband put dish away
Dish put away!

Visualized, the code above looks like this:

Chain of Responsibility

Of course, complex method-chaining like we've shown in ManualDishwasherUnloading isn't something we'd enjoy manually writing repeatedly. By itself, the code isn't complex. As the method chain grows, visualizing and understanding it becomes more and more difficult. Changing or adding to a large chaining sequence becomes a process that can easily introduce errors or create unintended side effects.

Luckily, there's an even more flexible and maintainable way of building a chain of responsibility.

A better chain of responsibility

The purpose of the chain of responsibility is to create a composition in which the links work together and are executed in a predefined order. In the examples above, we did it by writing it out method call by method call. A simpler and more generic approach is to create a list of actions. All actions are then picked from that list and executed one by one until the end of the list is reached:

public void MoreFlexibleDishwasherUnloading()
{
    var elements = new List<Action<Action>>
    {
        Son,
        Wife,
        Husband,
        next => Done()
    };

    Invoke(elements);
}

static void Invoke(List<Action<Action>> elements, int currentIndex = 0)
{
    if(currentIndex == elements.Count)
        return;

    var element = elements[currentIndex];
    element(() => Invoke(elements, currentIndex + 1));
}

It might be confusing at first that we declare a List<Action<Action>>. Why not just List<Action>? It's because the signature of methods stored in the list is void LinkInTheChain(Action next). We want the ability to execute them in a generic way. The Invoke method takes an Action<Action> from the list, then invokes it by recursively passing itself as the next function parameter. The process terminates when the end of the list is reached.

The output of this code would be:

Son can't reach
Wife can't reach
Husband put dish away
Dish put away!

This type of generic approach is probably overkill if all we want to do is compose four methods together. But as you introduce more links in the chain, the generic approach quickly begins to shine. For example, say you want to surround link with a wrapper that filters out exceptions or performs some other cross-cutting behavior:

static void IgnoreDishStillWetException(Action next))
{
    try {
        next();
    }
    catch(DishStillWetException) { }
}

It's easy to add IgnoreDishStillWetException before any step in the chain of responsibility when it is needed. As long as the cross-cutting behavior implements the same pattern as Action, you can create and add new links to the chain of responsibility.

Message handling as a chain of responsibility

Handling a message from a queue can also be implemented as a chain of responsibility.

public void MessageHandlingChain()
{
    var elements = new List<Action<Action>>
    {
        RetryMultipleTimesOnFailure,
        PickMessageFromTransport,
        DeserializeMessage,
        DetermineCodeToBeExecuted,
    };

    Invoke(elements);
}

Each operation, such as picking up the message from the transport of choice, can be thought of as a link in the message-handling chain. Because the chain is stored in a List, links can be added, removed, or reordered. You can make those changes at design time, or at runtime. There is a lot of flexibility in this chain of responsibility.

Finishing the chain

With the analogy of my family and me emptying the dishwasher, we saw how to conceptualize the chain of responsibility and its links as a simple chain of function calls wrapped in each other in a nested fashion. We have also seen that a message handling pipeline is, in itself, a chain of responsibility.

But while these chains are flexible, they are also very linear. As I showed in my previous pos, Async/Await: It's time!, the domain of message handling is primarily focused on IO-intensive work, and the answer to IO-heavy workloads is a task-based API combined with the async/await keywords.

So, what would happen if we used async/await on each of the links in the chain of responsibility? You'll just have to await for me to cover that next() time... ;)

Additional readings

↧

NSBCon 2015: All about Transports

March 2, 2016, 6:00 am

≫ Next: NSBCon 2015: Decomposing the Domain

≪ Previous: Dish washing and the chain of responsibility

Knowing which NServiceBus transport is best for your application is not easy. There are many factors involved in selecting a message transport; distributed transactions, legacy integration, cross-platform capabilities, and cloud deployments are a few that might be considered.

At NSBCon 2015 Andreas Öhlund outlines the different transports that are available for NServiceBus. He covers the highlights and lowlights of each. Rather than telling you which transport is the right one, Andreas provides you with the tools to make that decision yourself, within the context of your project.

If you're considering which transport to use on a new project or wondering if the transport you have chosen will continue to work for you, then watch Andreas's talk here:

Andreas Öhlund is a software engineer at Particular Software, the makers of NServiceBus. When not coding, he enjoys wandering randomly through furniture stores eating meatballs.

↧

NSBCon 2015: Decomposing the Domain

March 4, 2016, 6:00 am

≫ Next: NSBCon 2015: Integration Patterns with NServiceBus

≪ Previous: NSBCon 2015: All about Transports

In his presentation at NSBCon 2015, Gary Stonerock II talks about transforming a tightly-coupled synchronous process into an SOA/messaging-based solution using NServiceBus. Gary's team recently finished a rebuild of their clinical trial platform, and he shares with us some of the challenges, such as establishing stronger encapsulation and isolation, that they faced and how they overcame them.

If you are trying to figure out how to tackle a tightly coupled existing codebase, watch Gary's talk here:

Gary Stonerock II is the lead systems architect at Almac Clinical Technologies.

↧

NSBCon 2015: Integration Patterns with NServiceBus

March 7, 2016, 4:00 am

≫ Next: NSBCon 2015: Opening Keynote

≪ Previous: NSBCon 2015: Decomposing the Domain

Dealing with legacy systems is difficult. Complete rewrites take time. Components and functionality need to be migrated in stages while the remainder of the application stays operational. There's also the issue of integrating with third party systems and the impact that these can have on any system you're trying to improve.

Jimmy Bogard deals with these types of legacy system problems on a daily basis. Most of his work is rescuing rewrite projects, some of which are the second or even third attempt to get rid of a legacy system. At NSBCon 2015, he talks about various integration patterns he uses in his projects and explains how NServiceBus helped him solve a few typical challenges.

If you're trying to deal with slow batch jobs, maintaining data synchronization between the old and new systems, or a number of other scenarios, watch Jimmy's talk here:

Jimmy Bogard is the chief technical architect at Headspring, an Austin-based consulting company.

↧

NSBCon 2015: Opening Keynote

March 9, 2016, 6:00 am

≫ Next: NSBCon 2015: RavenHQ in the Cloud

≪ Previous: NSBCon 2015: Integration Patterns with NServiceBus

Udi Dahan opens NSBCon 2015 by summarizing the current state of the NServiceBus ecosystem. He outlines the foundation of our practices over the last year by comparing Particular Software to a duck on water: it looks calm on the surface, but below the waterline, it's paddling like hell.

This ethos of continually working to stay ahead of the issues that NServiceBus users encounter sets the stage for a discussion on the current and future state of the platform. Udi explains the retirement of ServiceMatrix, new features in ServiceInsight, and the future of NServiceBus with the coming release of v6.

If you want to know more about the direction of the Particular Service Platform, watch Udi's keynote here:

Udi Dahan is the founder and CEO of Particular Software and the creator of NServiceBus.

↧

NSBCon 2015: RavenHQ in the Cloud

March 11, 2016, 6:00 am

≫ Next: NSBCon 2015: Full-Stack, Message-Oriented Programming with Akka.NET Actors

≪ Previous: NSBCon 2015: Opening Keynote

Not all NServiceBus implementations are on-premise. More and more people are releasing applications into the cloud. Jonathan Matheus speaks at NSBCon 2015 about the challenges and rewards of working in the cloud. As the official cloud hosting provider for RavenDB, RavenHQ uses NServiceBus to handle database provisioning and usage-based billing.

If you want to hear about scaling, elastic scale, health monitoring, and cloud service availability with NServiceBus, Azure, and AWS, watch Jonathan's talk here:

Jonathan Matheus is an independent consultant and one of the founders of RavenHQ, the official cloud hosting provider for RavenDB.

↧

NSBCon 2015: Full-Stack, Message-Oriented Programming with Akka.NET Actors

March 14, 2016, 7:00 am

≫ Next: NSBCon 2015: Behind the Scenes at Particular Software

≪ Previous: NSBCon 2015: RavenHQ in the Cloud

The Moore's Law party is over. We cannot make processors more powerful by making them faster. We have no choice but to embrace multiple processors, multi-core processors, and multiple threads in our code.

But how do you do this? In his session at NSBCon 2015, Andrew Skotzko introduces us to message-based programming with the actor model and Akka.NET. Andrew shows how immutable messages, combined with the actor behaving as a unit of concurrency, make taking advantage of all those processors and cores much easier.

Check out the video of Andrew's talk here:

Andrew Skotzko is the co-founder and CEO of Petabridge and a contributor to Akka.NET.

↧

NSBCon 2015: Behind the Scenes at Particular Software

March 16, 2016, 7:00 am

≫ Next: NSBCon 2015: Top Mistakes Using NServiceBus

≪ Previous: NSBCon 2015: Full-Stack, Message-Oriented Programming with Akka.NET Actors

Have you ever wondered how Particular Software makes NServiceBus? At NSBCon 2015, David Boike outlines the systems, tools, and methods used in Particular to manage a large number of Github repositories, along with the techniques that make sure all of their releases follow semantic versioning.

You're sure to find some gems to take to your development team in this video:

David Boike is a solution architect at Particular Software, the makers of NServiceBus. He would prefer it if everything was automated, but his children continue to resist all his attempts at doing so.

↧

NSBCon 2015: Top Mistakes Using NServiceBus

March 18, 2016, 7:00 am

≫ Next: NSBCon 2015: Platform-Oriented Architecture

≪ Previous: NSBCon 2015: Behind the Scenes at Particular Software

One of the great things about NServiceBus is that it is so flexible. But with that flexibility comes the opportunity to shoot yourself in the foot. How, you ask? At NSBCon 2015, Kijana Woodard shares the top mistakes (14 in all) that, in his opinion, developers routinely make when using NServiceBus.

From row-based database operations instead of set-based ones to using callbacks as a permanent solution, Kijana covers a myriad of different issues that are inadvertently, or possibly intentionally, introduced to many NServiceBus projects. Watch Kijana's NSBCon 2015 presentation and learn how to avoid the pitfalls that many developers make.

Kijana Woodard is an independent consultant based in the Dallas/Fort Worth area. He is an NServiceBus champion and longtime proponent of messaging-based architecture. He’s made roughly half the mistakes in his talk himself and is only a little ashamed to admit it.

↧

NSBCon 2015: Platform-Oriented Architecture

March 21, 2016, 7:00 am

≫ Next: What Starbucks can teach us about software scalability

≪ Previous: NSBCon 2015: Top Mistakes Using NServiceBus

In the second keynote of NSBCon 2015, Ted Neward introduces the concept of Platform-Oriented Architecture (POA) as the logical successor to the currently used SOA/REST architectural approaches. POA is a developer-focused approach that has an established communication backplane, an entity definition, a built in agent model and a set of expectations around various execution topics. Ted also talks about the relationship between POA and operating systems, programming languages, and database engines.

If you want to learn more about Ted's vision of POA watch his video here:

Ted Neward is "The Dude of Software."

↧

What Starbucks can teach us about software scalability

April 7, 2016, 5:00 am

≫ Next: NServiceBus 6.0 Public Beta

≪ Previous: NSBCon 2015: Platform-Oriented Architecture

In 2004, Gregor Hohpe published his brilliant post "Starbucks Does Not Use Two-Phase Commit." When I read it, my time working at Starbucks during my college years suddenly became relevant. Over the years, I gradually realized there's even more that programmers can learn from the popular coffee chain.

Although many people may want to build scalable software, it can be much harder than it first appears. As we work on individual tasks, we can fall into a trap, believing all things are equally important, need the same resources, and happen synchronously in a predefined order.

It turns out they don't—at least not in scalable systems, and certainly not at Starbucks.

How to make coffee

Preparing coffee at Starbucks is a four-step process. First, customers stand in line (a queue) at the counter and place their orders, following the first-in-first-served rule. Second, the employee (barista) takes an order from the customer and accepts the payment. Third, they start preparing the drink. Fourth, when it's ready, they place the drink on the counter and call out the customer's name.

Starbucks basic process

Although this may sound like a reasonable model, it can quickly lead to long lines. It's impossible for one person to do more than one thing at a time, so customers start queuing up while the barista works through each order sequentially. If they want to serve more customers, they need to scale. Let's look at ways they can do that.

Scaling baristas

One way Starbucks can scale is to hire super baristas—very talented, fast-working, bright people. They'd need to invest heavily in their development, optimize every aspect of their work, and constantly improve their efficiency. In software, such an approach would be called scaling up (vertical scaling).

The problem with the scaling up strategy is that there's a limit to how fast (and how long) one person can work. At some point, even the super barista won't be able to meet the demand. When this happens, customers will leave the shop frustrated and may not come back.

Similarly, there's a limit to how far we can optimize our software if everything runs sequentially. We just can't buy a 200GHz CPU. Even the biggest CPUs are multi-core, with each core clocking at no more than 3–4GHz.

Another way Starbucks can scale is to organize the work in a way that allows adding more normal workers, which is the essence of concurrent processing in software. After one barista takes an order, another can start preparing it. The first barista can then take another order while the first order is prepared in parallel.

You might think that the best idea would be to consider concurrent processing only after the demand reaches a certain level. Unfortunately, it's not that simple. There's no magical switch that will allow us to turn on concurrency just when we need it. We need to prepare in advance.

Starbucks knows that. When a new store opens, even if they have only one employee per shift, everything needed for concurrency is in place from day one. They are ready to add more people at any time.

Lesson learned: We can't apply concurrent processing easily if we don't build our system in a way that supports it.

Now let's look at how Starbucks accomplishes this.

It starts with messaging

If you've ever ordered coffee at Starbucks, you might have noticed little boxes on the cup filled with symbols. These symbols are a sort of shorthand used by the baristas to quickly identify the drink as well as any extras (e.g., whipped cream, foam, etc.).

The cup, or message, is essential for communication between employees. It signals to the barista that a beverage needs to be created and the symbols written on it provide details on what kind of beverage to prepare. Even if the coffee shop isn't busy and there's only one person servicing customers up front, they will still add symbols to the cup.

At first glance, this might seem like extra work. But if a large group of customers suddenly enters the shop, the other employees from the back can immediately jump in to help. Without the need for any additional communication overhead, they can start making drinks based on the messages.

Lesson learned: Sudden spikes aren't problematic if we can easily add more workers anytime and divide the work among them. Using messages is one way to do that.

Divide and conquer

As described earlier, the whole coffee making process can be covered by a single employee—a barista. But the default setup at Starbucks is to have one employee (a cashier) taking orders and payments and another (a barista) making drinks.

Starbucks busy day

Usually, the slowest part of the process is preparing coffee, which is why multiple baristas prepare drinks when the shop gets busy. Often they'll take cups from the same pile and share the work evenly. This is an example of the Competing Consumers pattern.

There can be scenarios, however, in which this approach runs into trouble. Let's say there are three baristas working with one coffee machine and one Frappuccino machine. Three customers order a coffee and the next one orders a Frappuccino. The person taking orders queues up four cups with the appropriate symbols on each. Each barista grabs a cup to make coffee. The first one starts making their drink and the other two are now blocked waiting for the coffee machine.

We can avoid this contention for resources by dividing up the work. One way to do this is to separate messages into more fine-grained types so that they can be handled differently. For example, we've seen how Starbucks uses the cup as a message to indicate that a drink needs to be prepared. But the system also differentiates between hot and cold drinks: hot drinks are served in paper cups and cold ones in plastic cups. When we receive three orders for hot coffee followed by one for a Frappuccino, we now have three paper cups and one plastic one in two different piles. The first barista grabs the paper cup from the first pile and starts preparing the drink. The second barista, seeing the coffee machine is busy, grabs the plastic cup from the second pile and uses the Frappuccino machine instead. Now we have drinks from both piles being prepared in parallel.

This kind of work division, in which baristas divide tasks and work in parallel, is called partitioning.

Starbucks parallel work

Lesson learned: It turns out partitioning is a crucial element of an effective scaling strategy. Not all work needs the same level of scaling. Small tasks that are done fast can be done by a single worker while multiple workers take care of the more demanding, slower tasks. By using partitioning, we can scale each activity independently.

Not all work is equally important

One of the things that makes Starbucks successful is that they've trained their staff in the importance of recognizing the regulars. Take the guy who comes in every morning to get two venti americanos and two grande lattes to bring his team. Or the woman who every Wednesday orders a tall caramel macchiato and then stays in the shop for an hour to read her book.

If a barista notices the "tall caramel macchiato woman" entering the shop on Wednesday, they will start preparing her favorite drink even before she comes to the counter. The customer gets a pleasant surprise when she never has to say what she wants. The cashier already knows her usual drink, so they only ask her how she's doing and take the payment. Before the payment is completed, her coffee is already waiting for her at the counter.

You might be surprised how high a percentage of Starbucks' customers are regulars. Giving them the best possible experience is a high priority. Quite often, they end up getting their drinks faster than other customers. This makes them feel important and encourages them to come back, thus increasing their value to the company.

Starbucks with a regular customer

Lesson learned: Some tasks are more important than others. By organizing standard activities into reusable, independent building blocks, we can easily modify the process to provide superior service for the more valuable tasks when the need arises.

Not all mistakes are worth preventing

In all the examples above, Starbucks employees needed to verify that customers paid before receiving their coffee. To make sure that happens, baristas could ask customers to show their receipts before handing over a drink. But that's not how it actually works.

What Starbucks discovered is that very few people try to get coffee without paying. Their analysis showed it's more profitable to keep baristas focused on fulfilling orders instead of preventing the occasional lost coffee. If someone happened to take the coffee you ordered (which usually only happens by mistake), the barista would prepare a new one for you, no questions asked.

Starbucks more trust

Lesson learned: To build scalable systems, we need to embrace the idea that some failures are inevitable. It's too expensive to try to prevent them completely. Instead we should focus on making sure we can detect issues quickly and compensate for them when they arise.

Summary

What looked like a simple four-step process for making coffee evolved into an interesting business process. What seemed exceptional and rare at first glance turned out to be an essential aspect of the business.

Things like sudden spikes in demand or failures can happen multiple times per day. Designing a system that handles them well requires questioning common assumptions. Often the first model that comes to mind won't address such concerns. Also, there are many more exceptional situations to consider. For example, cancelling orders is an interesting problem all on its own.

As the example of Starbucks shows, if we followed a naïve approach, our business would not be able to expand to serve a larger number of customers. Our service level would drop as we got more and more customers, to the point where they would stop coming. Instead, we need to organize our work in such a way that we can meet increasing demand. In the end, building systems that scale is just as much about rethinking our business processes as it is about technology.

For more information about how to build scalable software, check out the following resources:

Scaling with asynchronous messaging
Saga patterns derived from fast food examples
Dish washing and the chain of responsibility

About the author: Weronika Łabaj is a developer at Particular Software. She is passionate about providing business value with software, exploring new paradigms, and challenging the obvious. At Starbucks, she always goes for a tall caramel macchiato.

↧