How to Scale Your Platform Service and Engineering Team

Introduction

The complexity of scaling a platform service and engineering team cannot be overstated. These elements are pivotal within any tech-driven organization, and their growth should be harmoniously aligned with overall business objectives. Beyond simply adding resources, effective scaling demands a strategic approach to ensure business growth is not compromised by inefficiencies.

Recognizing the Triggers for Scaling

To comprehend the necessity for a platform service to scale or adapt to accommodate growth, it's pivotal to first consider an organization from a conceptual viewpoint. Core elements that influence the platform service within an organization include its goals, defined strategies to accomplish those goals, and an operating model geared towards optimizing efficiency. The model below typically stipulates ways of working, accountabilities, and requisite organizational capabilities. Moreover, the organization hosts the market for the platform, comprising both existing and potential platform users.

Nested within the organizational ecosystem, the platform service delivers a suite of products or features, each with their unique interfaces and usage patterns. This service is spearheaded by a team (or multiple teams in an enterprise context), which adheres to certain processes—spanning planning, delivery, and operations—and leverages specific technologies, such as public cloud, container orchestration, or observability tooling. Model displays triggers for scaling your platform service and engineering team

Usage is a primary determinant of the platform's importance within the organization. As usage surges and the platform evolves into a mission-critical entity, ensuring availability becomes paramount, necessitating optimized time-to-recovery. While automation is vital for keeping Mean Time to Recovery (MTTR) low, sufficient staffing is needed to maintain pager duty rotation and deliver a genuine 24/7 platform service. This requirement varies significantly based on the geographical spread of the enterprise, with global organizations necessitating different strategies compared to platform services that only offer incident support during business hours.

The breadth of the platform's feature set often correlates with its usage. A platform that facilitates the entire developer journey will necessitate an expansive set of tools, services, documentation, and code blueprints, all of which require maintenance. Expanding the feature set of platform services may also necessitate team growth due to increased user support demands. That said, scalability must underscore any approach to support.

One should exercise caution when contemplating team scaling due to code complexity or performance bottlenecks. These issues often signal the need for re-design or re-architecture rather than changes in team size. The team should be primed to design for scale from inception, echoing our earlier discussion on setting up a platform team.

Another trigger for scaling is changes in the business context of the organization. If the business is expanding its digital services, it may necessitate a change in the platform services' feature set or require the platform to modify the interface through which it delivers its services. A typical example we see in most organizations is the need for data and analytics capabilities, which grow in tandem with digital business expansion. A growing business will need to deal with more regulatory requirements, which may also factor into a change in the platform services.

Crucial to note in the platform scaling context is the platform's ability to scale independently of usage levels. The wide availability of modern cloud technology and public cloud allows for vertical and horizontal scaling to manage concurrent users. This reinforces the importance of understanding the platform user market and usage patterns of platform services. The true adversary of a platform is not the unused feature; rather, it's the list of features with sparse usage or an interface that requires manual intervention from the platform team^¹.

[1] Source: “Intercom on Product Management” dives into the analysis of usage of your product features and how you can influence adoption https://www.intercom.com/resources/books/intercom-product-management

The Role of Standardization in a growing platform service

In the previous section, we highlighted the necessity of maintaining a lean and effective feature set for a scalable platform service. This reduces the support and maintenance workload for the platform team. However, the breadth of this feature set is dictated not just by the platform service's scope but also by the diverse needs of the users. For instance, your platform service might offer a streamlined process for users to deploy their application in a cloud environment. Yet, the diversity of deployment needs arises from the myriad types of applications that users want to deploy, ranging from containers and Java packages to sets of serverless functions.

In the context of standardization, an ideal scenario for a platform would be to adhere to a company-wide standard that dictates how applications are built and packaged. This would enable the platform to offer services that conform to this standard. However, as an organization expands its digital services, it will invariably use a diverse set of application types, such as web applications, APIs, MLOps flows, and mobile applications. These applications could be sourced from SaaS, Commercial-Off-The-Shelf, or developed in-house. As this diversity grows, the platform service must make considered decisions about how to navigate this changing organizational landscape.

In response to the varied application landscape, a platform service could adopt one or more of the following approaches:

Cater to the Expanding Heterogeneity: In this scenario, the platform would likely need to expand its feature set to accommodate the growing diversity of applications. This would necessitate scaling the platform service due to the increased maintenance and support burden.
Aim to Reduce Variance: A generic best practice from Lean manufacturing involves working with multiple stakeholders to define standardized solution architectures. These standards could dictate how an application is packaged, how it integrates with other services, and which cloud services it should use. While this approach might require more time and effort initially, it could reduce the overall cost of software engineering and limit the need to scale the platform service and team.
Distinguish Between Supported and Unsupported Application Type: Here, the platform could choose to support only a subset of application types while allowing users to utilize certain platform functions for the rest. This prevents the need to scale the platform team due to increased support for a broader range of applications, as the responsibility to address certain issues falls on the users. This scenario fosters innovation within software engineering teams as they are not bound by company standards.

Maintaining alignment and standards can become a complex task as the platform service scales. This alignment must account for growing users, stakeholders, and the broader organizational impact. Regular communication, clear guidelines, and iterative feedback loops can foster this alignment.

Likewise, upholding standards during growth requires an unwavering commitment to quality, even as demand surges. As the platform scales, standardization might evolve, progressing from simple code templates or wikis to more sophisticated shared services or fully configured 'golden paths'.

How change affects the engineering team and way-of-working

As a platform service scales, the platform engineering team's dynamics and interactions with users undergo significant changes.

Shifts in User Interaction

Automating manual processes and enabling self-service options become imperative for efficiency as the platform gains more users. However, while these investments enhance the scalability of the service, maintaining a human touch remains vital. Platform teams must judiciously invest time to ensure equitable treatment for all users.

In the early stages, platform services often focus on gaining user trust, emphasizing individualized onboarding and enabling. As users increasingly trust and adopt the platform, direct, extensive interactions with individual teams become unsustainable. One thing to note is that any interaction with a (potential) platform user is also an implicit feedback channel for the platform team. Your way of getting feedback may need to change to accommodate more platform users, whether for feedback on existing features or gathering ideas for new ones.

Rethinking onboarding and enabling becomes crucial as the platform service scales in adoption and, thus, usage.

Key Processes for Platform Adoption: Onboarding and Enabling

Onboarding

Objective: Facilitate seamless integration of users onto the platform.
Challenges: Automating platform adoption can be challenging without a streamlined onboarding process. This involves granting tool access, setting up environments, and more.
Scaling Strategy: Automation becomes pivotal, from providing accounts to validating user requests. It’s also essential to standardize the services offered. To enhance user experience, the platform should invest in user-friendly wikis, interactive wizards, and intuitive GUIs.

Enabling

Objective: Assist users in maximizing their engineering potential, enabling optimal use of platform services.
Methods: Guidance can range from cloud-native architectures to Infrastructure-as-Code and building CI/CD pipelines.
Scaling Strategy: As the user base grows, direct interactions become less feasible. Thus, producing self-paced courses or delegating to a specialized enabling team becomes crucial. For instance, Instruqt, a Xebia Product company, excels in offering bite-sized learning tracks to foster tool adoption and enhance engineering skills organization-wide.

Release Management

Effective release management gains importance with scale. Initially, the emphasis is on quality delivery. As the service grows, leveraging Release Engineering best practices, a staple in Site Reliability Engineering becomes valuable. For instance, canary releases can be employed where new features are rolled out to a select group for validation. This approach offers an immediate feedback mechanism and facilitates easier rollbacks if needed.

Observability

While monitoring the platform is important from day one, knowing what is going on with the system becomes vital as the platform service grows in usage. Any minor issue with the platform can quickly blow up in the face of the platform team, resulting in service outages and hurting the reputation of the platform service.

The platform team needs to adopt the best practices from SRE to inject tracking in the right places and set up alerts on key Service Level Indicators. An early warning helps the platform team prevent incidents and allows the team to optimize platform design proactively.

Specialization and organizational design

Over time, a single team might struggle to support an entire developer journey due to a broadening feature set or growing security and compliance requirements. The solution? Multiple specialized platform services address different parts of the developer journey, such as cloud landing zones, CI/CD, or observability platforms.

Lastly, an Internal Developer Platform portal can be a game-changer when scaling. Serving as a single interface for the various platform services, it ensures a unified and delightful Developer Experience.

Conclusion

Scaling a platform service and engineering team has intricate challenges and immense opportunities. It's not just about expanding resources; it requires a nuanced approach grounded in understanding usage patterns, ensuring efficient standardization, and maintaining the delicate balance between automation and human touch.

In this digital era, the success of an organization is intertwined with its platform's scalability and the resilience of its engineering team. As the context wherein the platform service resides changes, the engineering team needs to adapt or risk becoming an inefficient and costly burden for the organization.

We love knowledge-sharing. Find more insightful thoughts about Platform Engineering here!

Navigating Growth: How to Scale Your Platform Service and Engineering Team