16May 2018

7 secrets to scaling with microservices


Follow these steps to ensure a successful transition from monolithic app to distributed microservices. They worked for Sumo Logic

Adopting microservices for modern applications is no longer a differentiator, but an imperative for organizations that want to stay relevant in today’s market. The pace of technological innovation has enterprises moving faster, smarter, and leaner, meaning the modernization of IT is required in order to move—and stay—ahead of competitors and scale the business.

While many organizations tell stories about their shift from a monolithic application to distributed microservices, the Sumo Logic story is even a bit more radical. Prior to starting Sumo Logic, my team spent almost 10 years building and scaling a similar solution, only on a monolithic architecture, delivered as enterprise software. We started Sumo Logic specifically to build the next generation of a log management system in a new way: as a distributed, scalable, multi-tenant service.

We took away many key lessons from our experience of incrementally implementing a very large-scale data processing system for logs and metrics. To save you from the difficulties we had to persevere through, I offer here seven tips on how to leverage microservices to scale your business.

#1. Implement production development units

First, you need to set up your teams. We currently apply a model called product development units (PDUs). PDUs are complete units that own a set of microservices. This helps keep smaller teams focused on specific projects and avoids complications that come with having too many hands on one set of microservices. This is essentially the idea of the “two-pizza team” rule made famous by Jeff Bezos at Amazon, meaning if you can’t feed a team with two pizzas, then the team is too large.

While PDUs are fairly strong because of their operational knowledge and expertise around specific parts of the product, there are downsides when you need to roll out crosscutting changes. (One example of a project that involved all of our microservices was the transition from Java 7 to Java 8.) Getting all the PDUs to execute the crosscutting change is not easy, but if your team communicates effectively it can be done. We usually nominate one person to own all the communication and cat-herding around such a change, and that person sends around a color-coded list of microservices to a relatively large distribution list. Folks don’t want to see their services show up there in RED.

#2. Shift your organization structure to encourage ownership

The way we’ve divided our teams has changed over time. When we started out, it was four or five developers and a handful of microservices. Now the team is much larger and we have 40 or 50 microservices. Initially, having each service owned by one person—or one team—made no sense, because the architecture was still rapidly evolving. But as the team grew, and the cornerstones of the architecture fell into place, we divided the services up among several teams.

It’s critically important that the people building the software take full ownership of it. In other words, they not only build software, but they also test, deploy and run it. They are responsible for the entire lifecycle. We like the accountability this creates. On the flip side, people also need to be given enough liberties to fully own their work and feel empowered to make decisions about the software. If they’re going to be the ones woken up by their code in the middle of the night, they also need to be trusted to make some game-time decisions.

When it comes down to it, this is very similar to a federal system and culture in society, where there is a system built around multiple independent units that come together to achieve a greater goal. This limits the independence of the units to some degree, but within the smaller groups, they are empowered to make as many decisions on their own within guidelines established on a higher level.

#3. Determine service boundaries

When establishing service boundaries, some of it comes down to pure intuition. However, as a starting point, like everybody else we grabbed a whiteboard and started identifying the main components of the system and drew boxes around them.

At first we established a hard boundary for microservices—a strategy we were very adamant about. We initially had separate repositories for everything, even back in the day when we just had three microservices. We had gotten burned before in previous jobs with monolithic systems because people often can’t even follow basic conventions of code organization, such as what should go into which Java package, and which package should never call some other packages. So in order to keep the coupling low, we were radical about this point. We ended up with a monolithic repository and softer boundaries eventually, but it works today because we taught all the teams to respect the boundaries.

From a perspective of factoring the system initially, domain experience also helps – the initial team had some experience in building log management systems. But of course there were some things we got brutally wrong. For example, at one point we initially separated data indexing and searching, when they really should have been the same module or the same service. It’s a very cumbersome process to exchange what is essentially the same data between those clusters, and in addition to adding latency, it has nasty implications for your architecture. In fact, it took us two to three years to figure out how much this initial decision affected a core part of the architecture, but once we did, it made our developers’ lives much easier.

#4. Approach timing of services upgrades with caution

I cannot overstate this next point: Upgrading all services at once is a bad idea. There was once a period when, for weeks at a time, we deployed, saw something go off the rails, rolled back, and started over. Then we deployed, saw another thing go off the rails, rolled back, and started from scratch again. At some point we did this for three, four, five weeks in a row. We eventually realized that this was absurd, and there had to be a more efficient way to go about it. Of course this is common knowledge today, and maybe it was back in 2010, but sometimes people have to discover these fundamental truths themselves.

At the time, we already had about 25 services and a team of about 15 or 20 people. The alternative was to roll out upgrades service by service, which seemed equally absurd. How are you going to upgrade a service, restart it, and make sure it’s running properly 25 times within a two-hour maintenance window? The short answer: You can’t.

We landed somewhere in the middle. We invented a concept called “assembly groups,” which are smaller groupings of the 25 services. Anywhere between two and six of these services would be upgraded together, which turned out to be a much more realistic undertaking for the team.

#5. Embrace multiple ways of testing

Testing is difficult with microservices, especially once you move toward a continuous deployment model. To combat this, we invested—and continue to invest fairly heavily—in integration and unit testing, and we use a few different ways to test depending on each individual circumstance.

One approach is what we call a “local deployment,” where you run most of the services on a laptop to get a fully running system. However, currently a laptop with 16GB of RAM is stretched to the limits running, so that doesn’t scale easily.

The second variation is what we call a “personal deployment.” Everyone here has his or her own AWS account, and our deployment tooling is so fully automated that you can stand up a full instance of Sumo Logic in AWS in about 10 minutes. This is the benefit of being 100 percent born and bred in the cloud—specifically AWS.

The third way is what we call “Stubborn,” which is the name of a stubbing module we built. Stubborn lets you write stubs of microservices that behave as if they were the real service, and that advertise themselves in our service discovery as if they were real services. However, they are a dummy implementation that does something that you have control over. That is much more lightweight than running all of these services.

For example, if you’re working on search components, you always need the service that knows about which customers exist. You also need the service that knows about users and partitions, but you don’t really need the real version with all its complexity and functionality. You just need something that pretends like there’s a user here. We use Stubborn in cases like that.

#6. Bring security front and center

Customers are placing a growing importance on compliance with certifications such as GDPR, HIPAA, PCI, and so forth.

Given this, security must be built from the ground up. Our deployment tooling is model-driven, and the model not only understands things like clusters and microservices, but also how they talk to each other. We can generate a pretty tight set of security controls from that model.

For example, on a network level, we can generate the firewall rules strictly from an understanding of who talks to whom. This is one of the places where AWS does some of the heavy lifting for us.

Building on the cloud can be advantageous from a security perspective. You simply can’t do anything by hand, so everything has to be automated. And once you are starting to script and automate everything, it suddenly becomes much easier to tie everything down by default.

But security is not just architecture and code—in reality there’s a ton of process around it. Specifically, customers need to see audit reports. You can turn that into an advantage by looking at what the audits require in terms of controls, and then test against it. Then when the auditors test it again, it establishes good habits.

#7. Curate wikis to your organization’s specific needs

Scaling means accumulating a lot of information, and often rapidly. Many development teams use internal wiki pages that they can turn to in order to document intel and best practices. However, if left open to editing by anyone in the organization, these can become a mess of disorganized information. Content like this is hard to manage and maintain, so it’s best to designate a moderator or a curator who is responsible for maintaining your organization’s wiki.

While designating a moderator might feel too centralized or too burdensome for one person to take on, it can be difficult to incentivize a group of developers to keep wikis updated over time. Additionally, while technical employees (particularly at startups) often wear many hats, leadership should prioritize the upkeep of this documentation and best practices.

Wiki moderators should be strategic about the content they curate. Instead of including everything and the kitchen sink, keep it limited to the key concepts that developers need to understand. Too much content can be overwhelming, and lead to underutilization.

Divide and conquer with microservices

At its core, microservices architecture is about dividing and conquering. The distributed system that needs to be built is going to be complex—that’s a given. One way to deal with this is to try to limit the complexity of any given part, and by following these seven tips, organizations can more easily tackle the transformation.

Adopting a microservices architecture is one of the best ways to scale both the system and the human systems around it, especially in light of the exploding complexity of functional—and especially non-functional—requirements. While transitioning from monolith to microservices may seem like a daunting task, if you can mobilize developers around the idea and get all hands on deck, not only will the transition be manageable, but it will simplify the process of scaling the organization. This takes a heavy burden off the backs of developers, and ultimately, builds a more efficient system that better serves customer needs.

As co-founder and CTO of Sumo Logic, Christian Beedgen draws on 18 years of experience creating industry-leading enterprise software products. Since 2010 he has been focused on building Sumo Logic’s multi-tenant, cloud-native machine data analytics platform, which is widely used today by more than 1,600 customers and 50,000 users. Prior to Sumo Logic, Christian was an early engineer, engineering director, and chief architect at ArcSight, contributing to ArcSight’s SIEM and log management solutions.