Neosperience Cloud journey from monolith to serverless microservices with Amazon Web Services
In this article, we’ll run through all the constraints that we faced through years and made Neosperience evolve from a monolith to a set of serverless microservices. Maintaining software while shifting to a completely different architecture is far more complicated than starting a new project from scratch. Even if possible, many aspects need to be considered carefully, from technical debt management to customer feature releases. It’s more or less like building a car while driving the 24 Hours of Le Mans.
We did it, and our story can inspire others facing the same challenges we encountered in the last decade. Let’s start from the beginning.
Neosperience is a company that developed a software cloud leveraging cutting edge technologies through the last decade; it means we have to evolve your product both on the spatial and temporal dimension.
Evolutionary dimensions
Evolving a software architecture through Space means our architecture require needs to support many different features at the same time. Each of them has specific requirements in terms of scalability and different usage behavior.
For example, let’s consider our Gamification and Nudging Engine and our Image Memorability feature. The former is a service ingesting a massive number of small transactions per second. Clients invoke our actionPerformed endpoint every time a player fulfills an action. Then a set of conditions are checked, and additional steps need to be performed updating leaderboards, delivering badges and assigning prizes. The latter is a façade service in front of a GPU-powered machine learning model. After a client uploads an image, a 20-second task starts to compute memorability heatmap.
The requirements of these services are often very different or even in contrast: scalability for small transactions and reliability on the handling of big images to be processed.
The second dimension, evolving your architecture through Time, requires your code being able to support changes every time a new wave of innovation hits the ground. In marketing automation, our target market, it happens every 4–6 months. Evolving through time brings its roots from the often over-used and misunderstood concept of “being future proof.” Buzzwords apart, it’s a challenging goal to achieve because it means you have to support innovation while never discharging and rewriting your codebase from scratch. Ever.
A good starting point is to focus on consistency between our functional requirements and our selling purpose. To achieve this, we started from our vision then moved to other “collateral” aspects.
Neosperience business focus evolved with the Digital Customer Experience (DCX) market; therefore, flexibility and speed of releases became mandatory. Neosperience Cloud is a B2B2C cloud platform that provides marketers a set of Machine Learning tools. Our goal is to make them able to understand what makes their customers unique by providing an actionable psychographic profile. Tailored profiles empower brands to engage their users in new and effective ways to grow their customer base.
This vision brings on the table an additional requirement: since Neosperience follows a B2B2C business model, our software, must comply with the further need of being easily integrable with third-party platforms.
The age of the monolith
Matching all these constraints with a cloud architecture has not been an easy task, as it required implementing many design patterns while adapting them to the multi-tenant context. Martin Fowler’s Patterns of Enterprise Application Software since 2002 suggested a set of principles that provided architectural advantages to address some of their requirements. Nonetheless, we started our journey back in 2008 and initially had to set for the average optimum architecture that allowed us to mark a line for development. Now we call that period “the age of the monolith.”
This architecture worked fine for a couple of years, allowing for a jump start of the whole Neosperience ecosystem, with pretty fast initial time-to-market.
It was a multi-region deployment of different silos built with the same technology: a vast collection of SpringMVC endpoints written in Java and deployed on Apache Tomcat. Our entire codebase was packaged within a single war file and delivered by our Jenkins CI system with a straightforward workflow. Service coordination in this architecture is as simple as a method call and patterns such as Factory and Façade ensure excellent reliability and robustness.
On the other side, this architecture has many drawbacks related to scalability and lifecycle management. Scalability has been a tedious problem and impacts in an unpredictable way. Large customers and well-known brands used our Couponing Engine to support their free giveaway of gifts; this generated a massive number of coupons to be collected, shared, and redeemed over a single weekend. Another worldwide comics publisher used Neosperience to deliver the digital replica of their contents, bringing to our servers a massive number of download within the weekly comic release.
We had to project for scalability and fine-tune our EC2 servers and auto-scaling to manage these constraints. Moreover launching new instances, even in the cloud, requires a few minutes during which our system would be perceived as unresponsive. We had to over-provision our services. Also, since everything is within a single package, we had to scale up and down our entire cloud, even services not used at all.
On the lifecycle side, things weren’t better suited because every new feature release or bug fix required a complete re-deploy of the whole stack. As an example, we had to migrate coupon persistency due to feature upgrade: this meant shutting down the entire cloud for many hours.
Fit function and breaking the monolith
In 2012 it was pretty clear this architecture had to be changed, but we needed a way to decide whether a change should be accepted or rejected. We started asking ourselves what was improving from a release to another. In the end, we had to define a fitness function. That function we considered was the result of four variables:
- adherence to requirements often called business happiness
- time to market, the inverse of development speed
- single scalability, of a single endpoint to scale up and down independently from others
- lifecycle coupling, the impact of a change in a piece of code on services belonging to other domains
We decided to adopt an architectural solution if and only if it improved this function on one or more variables. For a more in-depth discussion about Evolutionary Architectures, a great reference is Neal Ford excellent book. The overall outcome was to start breaking the monolith into smaller pieces.
Separation of concerns
Since moving all Neosperience Cloud to a new architecture would have meant freezing any new feature delivery for many months, we had to choose wisely what to move and where to start. We followed Domain Driven Design principles, which meant we started outlining as much as possible all the boundaries between different concerns. Then we built interfaces between these domains in the form of REST endpoints (thus making all the pieces of Neosperience Cloud communicate using HTTP). Within a few iterations, we managed to make every domain well defined and bounded. Then we deployed every service as a standalone AWS Elastic Beanstalk App. I am not a big fan of Beanstalk now, but it was 2013, and immutable deployments without the need to provision servers were a great innovation. When everything worked like a charm, we upgraded our technology stack to Spring Boot and Spring Cloud, which allowed us to get rid of a lot of boilerplate.
Following this path, we achieved a significant result: deployments are immutable. Such improvement brought us a set of benefits such as rollback, incremental rollout, blue/green deployments, among others.
At this stage, every service still used the same technology stack, which meant we were not (yet) benefitting from microservices even if dealing with smaller components.
Here comes Serverless (and microservices too)
In 2014 Amazon Web Services released a new technology that would have changed the way we develop services: AWS Lamba. The disruption coming from this idea of “cloud code deployment” was initially underestimated by the developer community. A few months later, with the release of Amazon API Gateway, the new wave of innovation started to become the new normal for modern application development.
Adopting Lambda doesn’t mean using a new framework or stack. It is the first step towards a new kind of architecture tailored to business services. Together with these technologies also came to our attention the power of AWS CloudFormation, that puts in the hands of developers the capability to define cloud resources using JSON (and now YAML).
Removing infrastructure/DevOps headache means you are not constrained to a single stack anymore: developers can script deployment logics with their code. This pairs with tools such as CloudFormation to allow scripting not only service deployments, but also setup and configuration of required cloud services in the so-called Serverless Computing.
In 2015 we started evolving our architecture from AWS Beanstalk to serverless microservices, leveraging on AWS CloudFormation, AWS Lambda, and Amazon API Gateway capability to offer breakneck release cycles.
In the meantime, we understood we were at dawn of a new paradigm shift in services architectures, with the introduction of events. Lambda isn’t only a fast and cost-effective way to run code into the cloud. Its capability to be invoked (triggered) by “something happening, when it happens” makes it suitable for on-demand computing.
Moreover, you can have as many functions as you wish since you are only charged for their execution time. It means you can split services into micro and nano services as small as they suit your needs. AWS Beanstalk Applications usually contain several endpoints; you have to pay at least for one Amazon EC2 instance for each application. With AWS Lambda, billing becomes independent from the number of code units: one function running 10 seconds cost as much as twenty functions running for half a second each.
We decided the best granularity for lifecycle management was defined by domain boundaries, thus following DDD principles. So we achieved more efficiency for deployment and scalability. We used one AWS Lambda function to serve only one HTTP method, but we also kept all the same-domain endpoints in the same git repository. Overall, to address project management, we adopted the Serverless Framework (it was a natural choice in 2015 since AWS SAM nor AWS CDK were an option). With this tool, we filled the gap between CI and provisioning of cloud resources.
Even though we thought we found heaven, we couldn’t afford a massive migration and get rid of our technical debt overnight. We used a lazy adoption process, migrating services when our business team requested new features from them, or when technical constrained (scalability, bugs) surfaced. We did not shift our whole codebase to new stack and architecture. This approach means “distributing the technical debt” across all Neosperience Cloud and repaying it only when needed.
This new paradigm set the path for a technological evolution towards a more lightweight stack, shifting from Java / Spring to NodeJS/Python, and the adoption of data-tailored persistence. Every business domain has its requirements in terms of architecture, and this reflects specific needs for databases or data storage. Even if Amazon S3 and MongoDB are versatile enough to support many use cases, we figured that many AWS managed offerings are often the best choice within each microservice. We adopted Amazon SNS and Amazon Kinesis Data Streams to decouple services and handle communication. We used Amazon SQS to implement fan-out patterns and event sourcing architectures (as for Gamification Engine) as well as MongoDB to support sophisticated data model storage (as for content objects). Finally, Amazon Kinesis Firehose supports data ingestion. Upon that, Amazon SageMaker completes the overall picture, allowing machine learning models deployments
Conclusions
Neosperience Cloud evolved through years from a monolithic architecture to a heterogeneous set of smaller modern applications. Today, our platform counts 17 different business domains, with a total of 5 to 10 microservice each of them, glued by a dozen of support services.
Neosperience cloud is multi-tenant, deployed on several AWS accounts, to be able to reserve and partition AWS for each organization (a Neosperience customer). Every deployment includes more than 200 functions and uses more than 400 AWS resources through CloudFormation. Each business domain creates its resources at deploy-time, thus managing their lifecycle through releases.
This evolution improved our fitness function under every aspect: from scalability and lifecycle to time to market that shifted from months down to weeks (even days for critical hotfixes). Infrastructure costs shrunk by orders of magnitude. Developers have full control and responsibility for delivery, and innovation is encouraged because failure impacts only a small portion of the codebase.
My name is Luca Bianchi. I am Chief Technology Officer at Neosperience and the author of Serverless Design Patterns and Best Practices. I have built software architectures for production workload at scale in AWS for nearly a decade.
Neosperience Cloud is the one-stop SaaS solution for brands aiming to bring Empathy in Technology, leveraging innovation in machine learning to provide support for 1:1 customer experiences.