Episode: How Spotify Builds at Scale in the Age of AI - Niklas Gustavsson VP Engineering Spotify
Links und Kontaktmöglichkeiten
- Niklas Gustavsson on LinkedIn https://www.linkedin.com/in/protocol7/
Links und Empfehlungen aus der Episode
Das Transkript der Episode
Hi, hallo and herzlich willkommen zu Beyond Code Podcast, dem Interview Podcast mit den Machern und Experten aus der Tech-Szene. Mein Name ist Felix Becker. Schön, dass du wieder dabei bist. Die heutige Folge wird eine internationale Folge mal wieder sein und ich werde sofort ins Englische wechseln und der Rest der Episode wird Englisch sein. Welcome to the new episode of the Beyond Code Podcast, the podcast where we talk with experts and leaders from the tech industry.
Today's guest is someone who has helped shape the engineering behind one of the largest audio platforms in the world. Niklas Gustafsson is chief architect and VP of engineering at Spotify. Over the past decade, he has been deeply involved in building and scaling Spotify's engineering platform, shaping its architecture, creating a world-class developer experience with Backstage, and enabling hundreds of teams to deliver features to hundreds of millions of users around the globe.
Spotify has also become famous for its engineering culture. The Spotify model with Squats and Tribes has influenced organizations across the entire tech industry. And more recently, Spotify has been exploring how AI is changing the way software is built. And Spotify is already seeing some impressive cool results. And without any further ado, Niklas, welcome to Beyond Code. It's great to have you on the show.
Thanks for having me. Great to be here.
Thank you, Nick. We have a question that I ask every guest. When was the last time you wrote code and what code was it?
Yeah, as things are these days, I have a few Claude sessions running in the background here, so let me actually check what they're doing. So have one that is exploring porting an old legacy service of ours onto our more modern way of doing backend services. And then I have a few sessions ongoing that are doing some pretty mundane and boring basal stuff in our monorepo.
But yeah, it's always something ongoing in the background these days. It's a little bit different from what it used to be.
That's amazing. So you basically have AI running your job in the background while we have a great conversation. How important is it for you to be still technically hands-on in your VP, some sort of management role?
Yes.
So, yeah, it's a good question. So I think I can answer it from two perspectives. One is from like my own personal perspective. And the other one is looking at it from the role that I have within Spotify. They're gonna be fairly aligned, I think. But from my personal perspective, I thoroughly enjoy coding both in the...
traditional way and in the AI way. it's something that I always do also outside of work a fair amount. At work, the type of role that I have, is ⁓ although I do have a long fancy title, I'm still an IC. I contribute to our like deeply into our tech strategies. having a
Deep awareness of how Spotify actually works under the hood is immensely important for me to be able to do my job. I try to set off some share of my time that I do contribute code that share will vary over time depending on what I'm focusing on. But there's always some share that where I am contributing directly to our production software.
Yeah, that's great to hear. Before we start into the real nitty-gritty details of the conversation, I introduced a little ⁓ experiment in the last podcast where I asked my guests a question to bring over to the next guest without them knowing who will be the next guest. And my last episode was with Matthias Patzack, executive in residence from AWS, former CTO.
of AutoScout, we talked in episode 11 a lot about an AI strategy and he brought the question for you, how do you organize yourself and how do you organize yourself as a leader?
⁓ Yeah, it's a good question. ⁓ this goes back a little bit to the previous question as well, I think. I try to think a little bit how I spread my time in between more long-term strategic work and more short-term ⁓ reactive work. There's always things popping up every day that I need to react to and manage. So...
⁓ My goal is to have those split roughly 50-50 so that I can spend fair amount of time ⁓ both thinking about and driving our larger technology strategies. some of the stuff that I mentioned or alluded to before is connected to that. the AI developer stuff, the way that we have been
investing into managing our software in a more scalable way over the last few years. All of those things fall into that bucket. And then there's things popping up every day. There might be incidents or there might be decisions that we need to make more quickly. So I try to focus on time on that as well.
⁓ In terms of very practically, the routine that I have just for myself is that I am completely ruled by my to-do app. So I have collect everything that I need to do into a to-do app. I'm sure everyone does that. And then I try to plan my days accordingly.
And yeah, everything goes in there and then I ⁓ do weekly and daily run over those tasks. Unfortunately, there's always more incoming tasks that I have bandwidth for, so I need to manage that as well.
Great. So yeah, like always, ⁓ more work than time, but this sounds great. When I look into my family and friends, everyone has a Spotify account. Everyone is a customer of yours. Can you share a little bit about the massive scale Spotify runs on? Maybe from an engineering perspective, how large ⁓ is Spotify?
Yeah, sure. So maybe it's good to set a little bit of context. So we run a very distributed architecture. This is something that we've
It was a very intentional decision that was made prior to me joining Spotify, but at the very, very beginning, they decided that this was a bunch of people that had, that was part of the founding team, but they've never built any service like this before. So they didn't really know what they were doing, but they made some really ⁓ good and sort of lucky decisions early on. And one of those was, ⁓ we think it would be a good idea to try to like,
decompose architecture into fairly small components. I think they came out of this with the Unix principles. so we started, architecture always looked like that. Like we have the notes from the very first design meeting they had back then. And like they figured out these, like, I don't know, at that point, handful of backend services that they were gonna build. And...
We've essentially scaled that architecture since. So what used to be a handful of ⁓ backend services in a fairly monolithic desktop application back then, now is many thousands of smallish components. ⁓
And that's true for backend, that's true for our data pipelines, that's now true for our apps as well. They are internally decomposed into many small components and that enables us to then distribute that out over our teams, our squads as we call them. So we have a fairly strong ownership model for those components. So each squad owns a set of components potentially across clients and backends and data and so on.
⁓ And this has enabled us to scale and like teams can operate very independent of each other. So if you're a team and you own some backend service, ⁓ you own that from like ideating on the product that you're building and how you're going to design that to implementing that component to operating it and being on call for it. And that means that
We have teams that are very deep domain experts into the thing they're trying to build and they can design their software around it. We have a lot of guardrails and we might come back to this, but we have like internal standards for how we want our components to be built. But the sort of design and business logic of it, we rely very heavily on the domain experts in our team.
But that means, and I'm sure we'll come back to that, that we have a lot of software to manage and we need to think about how we manage that escape. So to try to answer your question on some numbers. So we manage a fair amount of traffic at any given point. ⁓ We have many, many, many millions of clients connected. I think it's like a hundred million or so concurrently connected clients typically.
Thank
We serve, I think, 11, 12 million requests per second of traffic to our backend. And then that fans out to many, many more requests within our backend. We run close to 3,000 individual services in production. And we do sort of similar 3,000-ish deployments to production every day.
Well.
And then we can talk
about the infrastructure as well, but it's like it's fairly large at this point. I can tell you sort of fun anecdotes. So literally when I started Spotify, I started 2011 and when I started...
Sure.
There was a workshop where our CTO back then, who's still at Spotify but not as a CTO these days, he did a workshop internally on how we could scale our user system. And I think when I joined, we have maybe like one or two million users, something like that, if I remember correctly. And the workshop was about how can we scale our user system so we can manage a hundred million users. And that just blew my mind when I was joining on like, that is just a scale that...
Both seems like unimaginable, but also quite exciting to reason about. ⁓ But it also feels like we're never gonna get to 100 million. Like, why are we even doing this workshop? And of course today we have many, more than 100 million users in our user system. ⁓ he was much more good at predicting that than I was. ⁓
And that there's been a journey that we've been on. So like a lot of the systems have like had to scale up over many, many years and have scaled up multiple orders of magnitude. And a lot of our systems, not the particular user system, we've actually redesigned that a few years ago, but many of our systems have evolved ever since. So like our playlist system was built around when I joined the version of it that we still run.
And a lot of systems have been able to scale up. like that speaks a little bit to this, decomposing your architecture has been very useful for us. That might not be true for every company and every problem for, but for us and like the way that we want, the way that are, we organize ourselves and the way that our traffic works, it works really well for us.
Oh, that's amazing, amazing scale. One question that comes into my mind is when you are fairly large distributed, small services, lots of small teams, and all of a sudden you're from end customer perspective, having product idea that shares across multiple teams and how do you...
bring these ideas to the teams, give them the independence to run at a fast pace and make their own decision, but also have the smooth look and feel from the end customer perspective.
Yeah, that's a great question. So, and it touches a little, it's a good sort of way of looking at a contrarian part of this. So now I described our underlying architecture and like the way that we distribute that over many teams. But end of day, we're shipping one user experience to our users. And we don't want that user experience to be fragmented because we have all of these teams. ⁓ So,
We are at the same time that we're a highly distributed company, both in our architecture and in our teams, we're also highly synchronized company. So we spend a lot of time making sure that the thing that we're building is coherent across our user experience and the strategies that we have internally. So we run this fairly elaborate. ⁓
process to plan the work that we're doing and we spend a lot of time applying our humans and domain experts to really debate through the strategies that we're going to be executing on before we then go out and build them. So in terms of...
like when we're going to build a new feature for our users, there's typically many teams involved and they spend time upfront debating through what we're going to build. And then we distribute out the work onto those teams. So different teams will build different parts of that experience. And we run those as typically run those as internal programs where you have the, know, the strategy product strategy lead for that. And someone that coordinates those launches.
So that's true for like building new products. Then of course teams do a ton of other things just maintaining and making sure that the software is in a good state and it scales and manage incidents and whatnot. And a lot of that happens just distributed within the teams. So again, a lot of the deployments that I mentioned are things that are coming out of the teams like day-to-day work. And another big share is things that are shipping in terms of new features for users.
Yeah, sure. The daily work is autonomously within the team. They run on a fast pace. And if you decide to bring in new features, then the collaboration is done. Is this something that the project managers are doing, or is there also some involvement of the teams because you take time from the daily operation there?
It's both. yeah, so for the larger, we call them Bets company, Bets. ⁓
Hmm?
For those larger, there's typically a program manager attached to that that will do a lot of the coordination. But then, of course, teams, end of the day, the team will decide how they design that particular feature within their part of the product. So if you're on that playlist team that I mentioned before with the playlist service, which actually used to be a single service, now is many different services to support the playlist ecosystem. But...
If there's a feature that we're building that involves somehow modifying the playlist system, they are the ones to design how that change is done in that system. ⁓ But they will participate in that larger feature that we're launching or product that we're launching. ⁓ it used to be that we were much more, our teams were much more autonomous around like what products we built, but that ended up being a fairly fragmented user experience. So today we try
to do more of the synchronization that I talked about.
Great. And when you look into e-commerce, lots of people share their war stories around Black Friday and things like this. What is your Black Friday moment at Spotify? What is an event that you really prepare? How do you prepare for something like this? And how do you operate at such a large peak scale?
Yeah, so generally speaking, we have fairly stable and predictable traffic compared to, you mentioned e-commerce, but I think even more so if you have services that are highly dependent on viral or social events, we have less of that. we have, typically we have this like,
two peaks a day. So there's a morning peak, typically when people commute, and then there's a larger evening peak where people enjoy music and podcasts and whatnot. And that is very stable day over day. And we can predict that, you know, years in advance what the traffic is going to look like. So we can do fairly stable capacity planning and things like that. And then
We have, I'm gonna say two major events that we need to prepare a little bit differently for. So one is our, one is actually New Year's Eve. So apparently people do a lot of partying on New Year's Eve, so they will use Spotify quite a bit and it's very coordinated. So you can see exactly when midnight hits in different parts of the world and whatnot in our traffic.
So that is an event that we used to do a lot of capacity planning. had teams on standby during New Year's Eve. Since then, we have improved our infrastructure. So we handle that pretty much fully automatically these days. But it used to be a big event at Spotify. And we used to have, back in the days, incidents almost every New Year's Eve, where some part of our system would get overloaded.
And the other event is our Wrapped campaign. So this is a marketing campaign that we do early December and it is a huge, it has a huge impact in terms of user interest in that. And we see a huge peak both for the marketing.
campaign itself, but also it of course drives a lot of traffic to Spotify. So that is, would say today, that is by far the event that we spend most time planning ahead of time to make sure that we can do it. And part of the excitement about Wrapped is that we do this like, you know, instantaneous coordinated launch around the world. So we immediately get this enormous ⁓
peak of traffic for it. that one is still a bit exciting. The New Year's Eve one is less stress inducing these days, but the wrapped one is still pretty exciting every year.
Is your system pretty elastic and auto scales automatically or do you have to have some certain preparation for events like this and how do you prepare your teams and how are you handling this?
Yeah, no, everything auto
scales these days, including databases and services and whatnot. and that's a big part of why the New Year's Eve event is much less manual work these days. We can just manage that by our auto scaling.
And the auto scaling is also true for the daily pattern. So we will have more Kubernetes pods running at peak than we do quite, I think it's like two X of difference or so. And we're talking millions of pods here. So like, it's a pretty big difference if we can scale it up and down during the day. So yeah, everything is auto scaled.
Great, fantastic. And ⁓ let's switch over to the developer side of things. And ⁓ I read an article these days from your CEO or co-CEO. Spotify says its best developers haven't written a line of code since December thanks to AI. Can you take us a bit behind the scenes? How does Spotify actually leverage AI in the engineering workflow?
Sure. ⁓ Yeah, so we have been a fairly long, long relatively speaking time user of AI developer tooling since the early days of Copilot and so on. And we are today, ⁓ like we have very high adoption of those tools internally. I've never seen adoption like we've seen for
Like we've gone through different adoption journeys of copilot and cursor and Claude code. And all of those have been the fastest adoption curves that I've seen for any tool that we've launched at Spotify. ⁓ yeah, it has a lot of usage internally. And in terms of the, what you quoted there from what Gustav said.
That is true. Like I'm, don't know if I'm in that group, but I'm definitely a person that hasn't actually written a single line of code manually since, since last year sometime. And I mentioned before in the, in the first question that I have a few cloud sessions running in the background. And that is how I write all my code these days.
Fantastic. In the article mentioned also a service or thing called Honk. Can you explain a little bit what Honk is, what you are using it for and how it helps you?
Mm-hmm.
Yeah, this, let me tell a little bit of a backstory first and maybe it's useful to set some background here. So a few years ago, we've seen this immense growth of software within Spotify and we saw very clearly that our developers were spending more more time on maintaining that code. ⁓
we could both see that in our metrics, but we could also see it in the feedback from our engineers. Like they were just tired of having to migrate from version seven to eight of some framework or whatever it was, or managing security incidents or whatever the case would be. So a few years ago, we started investing into automating as much of that as we could. So we did something we call fleet management internally.
We published a few blog posts on this if people want to read more closely on it. But essentially what fleet management is, is having an ability to write code that will modify other code or configuration or whatever it might be, and then be able to schedule that over all your source code. So.
I might come back to this, like when we first initially did this, we had essentially one Git repository per component. And I mentioned before that we have thousands of components. So that meant that we had thousands of Git repositories. So we implemented this to be able to essentially like orchestrate running these.
We call them shifts. like with code mods or whatever you might call them across like thousands of repositories producing PRs for the changes and then ⁓ having verification that would allow us to automatically merge the vast majority of those changes. Some changes requires manual review and approval, but most of them we could just put in merge.
So that's something we've been doing and we've merged millions of those types of PR since then and done hundreds of migrations that are more or less fully automated. But one challenge with this is that it is very hard to write these scripts to modify code. One way of looking at it is the code has a very wide API surface, if you like.
when you're using code, you might, I didn't necessarily realize that myself for the code that I was writing, but imagine just having a simple like, know, ⁓ method in an API, but there's many, many ways of using that method. You can call it directly. You can invoke it as a Lambda. Like there's many, many ways of doing it. ⁓
And that turns into a lot of complexity when you try to do these migration scripts. So what we've learned was, and others have learned this before us, like this was not a earth shattering. Like we had talked to Google and other companies that had done this longer than we had, and they had discovered the exact same thing. There's this thing called Hiram's Law, which came out of Google, ⁓ an engineer there who sort of discovered the same thing. ⁓
Can you elaborate
a little bit more on that law? I don't know, maybe users don't know.
Yeah, so, so Hiram's
law, says, I'm not going to be able to quote it off the top of my head exactly, but essentially it says that every behavior of your API, someone will depend on every possible behavior of your API. So you can imagine like you have this, you think you have this like well-defined API, but then there are
quirks of the implementation of that and every one of those quirks, if you have large enough scale, someone is gonna have a dependency on that quirk. And that turns out to be very true. So, you know, we have our backend is mostly Java and
I know.
we have many of these code migrations that tries to attempt to migrate some way of using Java to some other way that might be replacing some deprecated API or standardizing on some code conventions or whatever it might be. And we find so many edge cases to these all the time. So it is fascinating how challenging it is to modify code. So.
So essentially what this then the effect of this is that again, we've done many, many of these migrations and like millions of PRs, but there's a pretty clear like glass ceiling that you run into in terms of the complexity, like how complicated you can make those migrations. You can migrate a fairly simplistic API, but if the API gets a little bit more complex, your migration scripts are going to be super complicated. So.
Exponential
basically.
Yeah,
pretty much. Yeah. So that meant that it severely limited the types of migrations that we can do and like anything that was a little bit more advanced, we would still need to do, we could maybe want to make part of it and then some would need to be manual and whatnot, but it limited us. So of course, when we started using LLMs for many other purposes, one of the immediate things that we were ⁓
starting to experiment was like, we use this to like, you know, increase the level of how complicated migrations and changes we can do through our fleet management infrastructure. So we've done many iterations of that. And out of that came this thing that we call honk. Like that was what we imagined honk doing originally was just like a properly like
production grade version of all of those experiments that we had done. So Honk essentially started out as a wrapper around an LLM and an agent and adding on verification after that. So basically using like an LLM as a judge to verify that the change that came out of that.
first LLM aligns with the prompt because we were seeing that the LLMs originally this was some time ago, they would do a lot of weird, unintentional from the prompts perspective changes to the code. we added that in and then we've been iterating on this. think we're on the like sixth iteration of honk at this point. ⁓
So that was the original intent. we still use, like that's a huge use case for Honk for us today is to do exactly those like more complicated ⁓ migrations. So ⁓ you can imagine in Java world that that might be, for example, taking, there's several like these value-based libraries, sorry, value object libraries within Java. So things like auto value, auto matter.
immutables and whatnot. And the vast majority of that is now replaced by records in Java. So what we do now is that we use honk to just migrate our code from those old libraries to use records. So that's one of those examples. And one that, again, like was just too complicated to try to do with these like deterministic strips that we were using before. So we also discovered pretty quickly that honk was also
And sorry, Honk runs then on top of this infrastructure that we built for fleet management. So that's how we can schedule a job in Honk and whatnot. One thing that we then pretty quickly realized was that Honk was also great for, you you can run your Claude or cursor or whatever locally, but sometimes you just want to, you're on Slack talking about something and you just want to make a quick change to something. So we enabled Honk to be,
invoked on Slack ⁓ and in GitHub. So you can be in a Slack discussion, like you're debating, like, how do we want to change this? And then you just say, like, Honk, go make this change. And Honk will go off and then come back, you know, some minutes later with, here's the PR for the thing you asked me about.
So this like background agent type of use case turned out to be very, very useful as well. So today, Honk supports both of these and we leverage it for both of these cases quite heavily today. That was a long winded answer to your question, but hopefully it light a little bit of color to what we're doing.
That's No, totally interesting.
Thank you for sharing the details. I see a lot of stuff still on a client that people have like the tools you mentioned on a client. ⁓ The engineers are steering everything, it's checking back. hear lots of stuff right now like in the industry is moving from the client onto the server. So basically how much
traffic do you see with agents on the server right now? What is truly autonomous for you? I mentioned, you mentioned that fleet management, what you have and where's the point of control? Where's the point where you have the engineers still in the loop? it only, I mean only in quotes, the PR or where do you include humans? I imagine if you have like a bot in Slack and you toss an idea, maybe you get off, have your first coffee, have an idea, toss a tongue.
And then see a PR. Do you feel good with that and how do you handle it?
Yeah, so it's a great question. And yeah, so we still believe on having humans in the loop for these and relying on our human judgment for like what changes to make. So we still, we keep our developers accountable for the changes they make, even if those changes happens to be made with Claude or Kerser or Honk, like that doesn't much matter. And then we rely on humans for all our coder. So every PR, well,
let me qualify that a little bit. I mentioned before that we have four of these like large scale deterministic automated changes that we do. We do auto merging for I think 85 or so percent of those. And the thinking is there is like the human review happens on the...
script that we use. and because that script is deterministic, we can trust that and then allow ourselves to auto merge that. But for all human author changes or fully AI author changes, like those coming from Honk, we do manual review of those.
I will say that the amount of PRs have significantly increased as we've been adopting these tools. So we are looking into also using AI to increasingly review PRs and support developers in that part as well. But yes, we have humans in the loop there for now.
Yeah, because what I see is people, when they let the agents run in the background, they get PRs with 10,000 lines of code in the PR and more, especially if they run in a loop and correct themselves. And at some point, this is not human handle, handlebar, handlebar anymore. So can you share a little bit how you think about the AI to review the...
the code that AI has written and still be accountable.
Yeah, let me first talk about the first part of what you said in terms of PR is becoming larger and so on. So, ⁓ or containing, you know, more AI slot type of code. So as I mentioned, we keep our developers accountable for the changes that they generate. So,
when that happens, that is something where the recipient of that PR is fully allowed to just close that PR as like, this is not a high quality enough change or is not reviewable. And then the creator of that change will have to go back. This happened prior to AI, that was a pattern that happened before that and happens now as well. So I don't think...
things has not changed substantially in terms of like the principles of that and. ⁓
Yeah, I'm not seeing that being a big problem for us today. The volume of PRs is a big challenge for us. So in terms of then having AI support us in reviewing, so I think this is a...
I don't think we're doing anything dramatically different from other companies here. So we have in the same way that we have AI agents helping us to write code. have agents also helping us review and code. And this is both implemented as.
agents doing that automatically in on PR. So when you make a PR, there will be an agent that comes by and reviews that and potentially post some feedback on it depending on, if the agent finds any any feedback. And that's something that we're still fine tuning to make sure that we, you know, have the right level of ⁓ quality on those reviews, and that we see that they're actionable, and so on.
So there's one pattern. The other pattern that is very popular as well is to use your local agent to do reviews, ⁓ help you with reviews. So I use this very frequently where I, when I'm reviewing some PR, I will ask my local Claude to take a look at that PR and review it for me. And then I will, you know, review the review that Claude comes up with and see which parts I will give feedback on the original PR for. So
⁓ Yeah, this is increasingly the case for all PRs within Spotify that we apply this type of AI help for the review phase as well.
Did you see any other changes? I imagine if you have like a ⁓ more code coming in from AI, ⁓ more PRs coming into ⁓ from AI, how had you to adapt to your engineering pipeline workflow when ⁓ code is suddenly... ⁓
much faster produced and then much higher scale. Is there something else that you had to adapt on to handle these masses?
I mean, one, it's a pretty mundane thing, but we are seeing some scalability effects on our CI systems and things like that. So that's something that we're managing at the moment is just being able to scale those up and manage a significantly higher rate of changes to our system. So CI deployment systems and similar. ⁓
Then on the softer side of this, we're also seeing that, of course, the way that we work and collaborate changes quite dramatically. For ⁓ example, the interaction between our product managers and our developers is changing quite a bit. So it used to be that as developer you would get... ⁓
PRDs are similar as input for the work that you were doing. And that's increasingly being replaced with early prototyping instead, because it's so easy to prototype something and we have multiple internal tools to do prototyping using or with the help of AI as well. So a lot of the ways that we work and like the whole process for how we build products is just fundamentally changing as we speak. ⁓
Yeah, I would expect that if we talk again in six months, it's going to look very different than it does today. And today looks very different than what it did six months ago. So we're in this change at the moment.
Very interesting. ⁓ Hearing
that means ⁓ you see a difference in white coding, prototyping and AI guided engineering. Is that a different type of word? Can you share the thoughts on that?
Yeah, very much so.
Yeah, so prototyping. ⁓
we largely use prototyping as a throwaway type of mechanism. So you quickly prototype something to have, so it used to be that we did like, you know, fairly static designs for something that we were going to build because that was the tools enabled us to do quickly. And now, because we can instead,
prototype that, then we could even prototype it within our apps or so on. And that means that we can, it's very different to be able to touch and feel and use a product ⁓ with your own data within our app experience compared to looking at a static.
⁓ image of what the feature would look like. that's what we're seeing very rapidly, that that's the way that teams choose to prove out ideas that they have. And then of course we can also use this for user research, for example, so we can put that in front of users during our user research to get early feedback from users as well.
great. And do you guys also experiment with saving intent alongside the source code, meaning prompts, part of the context, specs, or something that guides the AI, or is it the single source of truth is still the source code for you?
The source code is still the single source of truth for us and we were not doing that. We are doing some experiments around trying to collect insights and decisions that agents do in runtime. ⁓ We are constantly building out the ways that we manage the context that we give to agents. So the better.
prompting we can do of the agents, the better we think the agents work. So that's something we're very actively investing in as well.
Great. Do you have a number how many agents are running during the workday at Spotify? Just for curiosity because I imagine it's a huge number.
We do have the instrumentation.
don't actually know the number off the top of my head. So I don't know, but, ⁓ I would imagine that, you know, developers have a few agents running, ⁓ on their laptops. And then we have, ⁓ many of many honk agents running as well. So there's going to be quite a few running at any given point.
Okay.
Yeah, but this guides us directly to the next topic about measuring. How do you measuring ⁓ engineering impact and how do you measure the quality of AI at Spotify? Or what do you measure at all?
Yeah, yeah,
so let me, that's also a good opportunity to let me take a bit of a step back. So this is something that we've also been investing in for many years and we've out, basically instrumented all our infrastructures. This is both our, I talked about CI before. So for example, that's one thing that we instruments. First of all, all our code that,
lives in GitHub, we ingest into, so we use BigQuery for most of this, which is this Google technology for ad hoc analysis. So we ingest all our source code and all the changes that goes into that into BigQuery.
We instrument all our systems. So that's both the systems that we use during development, again, CI and things like that, and our production systems. And that also goes into BigQuery. So we have lots and lots of data sets in BigQuery on every deployment that we do, every change to the code that we make. ⁓
the all our production systems like, you know, every pod that is running in Kubernetes and whatnot we have in BigQuery. And that then allows us to, on top of those like core data sets, build metrics that we track across our ⁓ infrastructure and development practices. So one of those things will be like, that's how we do analysis of code quality or ⁓ various PR metrics, for example. ⁓ So.
Typically the way you would implement this would be with something like Sonar or something that you run as part of your build but we just have seen that that doesn't work at the scale that we operate that so there's been a much better way of doing that and you can do both know predefined set of metrics that we that we collect and measure every day and you can do ad hoc analysis on that so I was just doing some of that earlier today actually. ⁓
So that's generally like the infrastructure we have to measure these types of things. And we track many, many different metrics across architecture, code health, productivity, those types of things. ⁓ I should be...
clearance and say that we don't track those on an individual level, we do it at the aggregate level. So we look at, know, how are we doing as a company, and then we break that down into to our high level teams. So that's very effective for us to be able to reason about where do we have
things that we need to invest in improving and also to see that we can track improvements over time. So then since you asked about AI and quality, so yeah, so we look at a number of metrics for code quality and we can compare that between ⁓ changes that we can attribute to being authored by AI versus not and so on. So that's something that we do and.
that attribution has been a little bit fussy on our side, but it's something that is rapidly improving now that we're getting better APIs for ⁓ attributing AI changes. So yeah, so we measure and compare these all the times.
And do you think that's important in the age of AI where the chances of slop AI code is high to have such a measuring that you'll be able to have data-driven decisions on the quality that you see or does it give you a better feeling if you.
I think it's very important. I will say that code quality is incredibly hard to measure in an objective way. So I will not claim that we have nailed that. And I'm not aware that anyone has perfectly nailed that either, but we're certainly trying and we're looking at those metrics and some of those will be...
you know, we might not be able to track very small changes, but we can see trends, larger trends over time, for example, with fairly high confidence. But yeah, I think it's super important. And the same is true on the productivity side, for example, to know that the things that we're doing actually has the effect that we expect them to.
So yeah, something we always do across our developer experience and infrastructure investments and whatnot. They're always leveraging this data.
What are your key metrics for productivity?
So there's a number of metrics. So ⁓ we do a regular engineering survey where we send out the survey to all our engineers. ⁓ So that's one part in there. For example, one thing that is in there is self-reported productivity. So ⁓ that's one aspect of measuring. And then we look across a large number of ⁓ indicating metrics. So that might be things like
We talked about before like PR frequency. We look at the time to change. So What's the time that it takes from for me? from starting to code on something to that shipping to our users and then we look at lots and lots of you know contributing metrics to that so what's the CI times or deployment times and things like that And then we try to you know
look across these to see the trends that we're looking for.
Very interesting. ⁓ one thing that came to my mind is usually you have the planning phase where you have boards and have your tickets and plan out your stuff. And you mentioned a couple of minutes ago that people were discussing on Slack and were just, say, yeah, that's a great idea. Let's build it. And honk goes on and builds this and, how
Is that a starting point for your measurement or is there still something that goes into the board and has this classical ⁓ workflow? Because I imagine there could be a lot of side quests where people have great discussions and they just send out agents and following these measurements is at least for me hard to measure instead of having a ticket and a board and trace all the...
information until the thing lands in production.
Yeah, for good and bad, we're not a very process heavy company and we're not a very ticket heavy company. I would actually love to be able to to
you know, measure the intent more than we're able to do. So, but because the way that things gets described in like how teams plan their work is pretty messy for us because there's a lot of changes there. We haven't standardized how that works. And like you said, a lot of changes comes out of people discussing something on Slack or discussing something in their team. ⁓
rather than that going through a formalized process where we're able to capture the intent and the plans early on. So a lot of the metrics and instrumentation we have is starts when coding starts because that's really when we can start instrumenting things. And yeah, that has pros and cons. I know other companies are able to do more of that because they have a closer...
integration between how they plan work, how they describe that in a ticket or whatever, and then connect that to the code change that that results in. But we're unfortunately not able to do that at any high precision at least.
Okay.
That's interesting. And there's also a new role coming closer to source code. mean, when you have product managers, they are now able with the tools ⁓ to prototype or establish at least some code. And they are usually not ⁓ on that engineering level where we have the DevOps engineers. How do you incorporate this group ⁓ in the whole workflow? Or is that not so new for Spotify?
to have.
We have lots of PMs that contribute code as well. So it's not that distinction is not that hard in our case. And ⁓ we have PMs that hasn't done that. I'm gonna say a lot of that boundary is also breaking down now because as we talked about before.
contributing code has become dramatically easier over the last few months. And that's true both for the prototyping part where you just super quickly flash something out and you don't need to do the code review and think about how this is gonna be deployed to hundreds of millions of users and for making those changes. So.
Yeah, that I would expect that we will have much more changes to our production software coming from non-traditional, non-engineering roles in the future. And that's already today increasing.
that's great to hear. I would like to make another switch ⁓ talking about backstage. That's another big success story for Spotify. And ⁓ what does backstage mean for you and your developer teams and how, what was the original idea to came up with something like a backstage?
Hmm?
Ah, yeah, good. Another opportunity for telling some history. Yeah, so Backstage started as a very, trying to solve a small but important problem for us. It used to be called System-SED. I don't actually know exactly why we call it that, but... And the problem that we needed to solve many, many years ago was essentially...
And this comes back to this collaboration that we've been talking about now and incidents and things like that. Basically, the question that we need to be able to answer is like, who should I go talk to about some particular piece of code? So there's an incident, some backend service has fallen over, I need to find the team that owns that backend service or ⁓ the person that owns that backend service maybe back in the day. Or.
⁓ I'm building something new. want to call the playlist API. Who should I go talk about about that API? need some changes to it. So the very first thing that we started to build was just that service directory of like who owns a particular piece of software. ⁓
We started building that out and one of the first things that we did, which was, I don't know if it was clever or just accidentally clever, but we then pretty quickly connected getting capacity for your thing, like getting servers to run your backend service on or data pipeline on. We connected that into this service discovery database. So if you were to...
get 10 servers to run your backend service, you had to register it in this service directory. And that meant that we created a very strong incentive for everyone to register their software in there. So essentially in very short period of time, got 100 % coverage of having the metadata, like who's the owner. ⁓
What's the ⁓ state of this? this in production? Is this experimental? What's the reliability tier for this thing? Like all of that type of metadata we got for a hundred percent of our software in very short period of time. And then I think that the, that was the first sort of accident, sort of accidentally clever part. And then maybe the actual clever part came because one thing that we were struggling with, and I think many companies struggle with is that we had
tons of different tools that our developers were using. And these were usually pretty shitty. Like there was like, you know, many, many, many poor websites that had spun up to look at logs from deployments or managed capacity or whatever it was.
Dashboards all over the place.
Yeah, was, I mean, all was well intended. Like people were trying to help our developers solve problems, we solved in this like super fragmented way all over the place. And the experience as a developer was to, we were just jumping around between like 12 different tools to do a single thing. And all of them sucked in various ways. But now since we have this like catalog over all our software, we
it was pretty easy to then start just building in functionality around that. So while we have this place where we have a record of all our backend services, why not build a deployment view as part of that tool? So very quickly, this tool transitioned into being just like plug-in based way where all our...
⁓ infrastructure engineers could build those plugins and just jack them in there instead of building their own custom tool on the side. So we started shutting down all of these like crappy ⁓ fragmented tools and just moving them into backstage and backstage became this we usually call it like a single pane of glass that the developer can use. So I can go in there, I can find everything that
relates to my backend service where I can go take a look at your data set and find the scheme of that or whatever it might be and see all the statuses of it in this single place. And that was a real both productivity boost but also in terms of just developer happiness, it was a huge lift for us. ⁓ And now of course we've... ⁓
And.
Externalized that's we made it available the the core piece we made available as an open source project and we're also ⁓ selling like a Managed version of this and we said we're selling additional products on top of backstage
Yeah, that's a recent announcement that you also have an offering where you as Spotify serve backstage as a service. I want to double click on that plugin situation. ⁓ How do your engineering teams actually contribute plugins or do they see it's more the task of the platform team to ⁓ have the plugins for us and maintaining the plugins or is this something that, hey cool, I can...
Just provide my own plugin. I'll be happy to maintain it. How is the balance between things that the platform team does and teams and plugins that come from the, from the, from the other, the, the feature teams.
Yes, it's both. I'm going to say that we have many plugins. I don't actually know the number, but much more than 100 plugins today. My guess would be that the vast majority of those are from our platform team, because that's their job.
But there are definitely plugins that are provided by other teams. There are more specialized plugins. So there might be a team that manages their system, and they might have some admin UI for that. And that typically ends up being a backstage plugin as well. for things that are exposed to our engineers, it's pretty much all in backstage these days.
Yeah, great. Can you share a story or a thing where you have like a Spotify specific use case or integration that's not in the standard products that you added recently or that you use very or depend very heavily on it?
Unfortunately, I was going to say we have a lot of custom plugins in there. Some of those we're planning to make available externally in the future. Some of them might make less sense to make available because they're very custom for what we do. ⁓
I don't know if I have a good example of something that was added recently, but there are certainly lots of custom things. So let me take one. This was not added recently, but it's a pretty interesting example of something that you can do. So we have Spotify, something called Connect, which is how you can transfer playback between different devices. So let's say...
in my apartment I have a bunch of speakers and I can take Spotify and transfer playback of whatever I'm playing to play on those speakers for example. And this can get pretty complex like users can have many devices those devices have many different capabilities and it gets very hard for us to debug when we have issues in Connect.
So something that the Connect team built, again, this is a team that is not in our platform organization, but a feature team.
So they built the way to essentially trace ⁓ connect traffic, connect interactions in a backstage tool. So I can go in and see my interactions in there. And I can log in with my user, connect my user to that tool. And then I could see live like all of those interactions, how they happen, which is a super powerful way to debug when I have an issue with connect, for example. So that's one example of a tool that is...
very specialized. We will obviously never externalize because it would make no sense for anyone else. ⁓ and that is deeply integrated into backstage.
Thank
and their whole fleet, ⁓ ship of fleets. ⁓ Are they also ⁓ centered around backstage? So can I?
Yeah, fleet shift.
Yes, yeah, so,
so Fleet Shift, there's essentially three pieces to Fleet Shift. One is what I talked about before in terms of these like scripts to make the actual change. They end up getting packaged into a Docker container and that's how they are deployed and then run.
There's a manifest that essentially is a bunch of metadata around how this script is going to be run. ⁓ In our case, that's a Kubernetes custom resource definition. And then there's the UI, which is in backstage where you can go in and you can look at your shift and how that ⁓ you can see an overview of all the changes that that script has made. So.
If you are a owner of a shift, ⁓ you will typically use that to debug like how is my shift going? have, it has opened 2000 PRs. How many of those PRs have been merged? How many of them have failed? What's the error log when they failed? ⁓ Either in when they transformed the code or in CI when the code is being built. So yeah, all of that happens in backstage as well.
⁓ amazing. ⁓ let me do one more, ⁓ switch to another success story from, from Spotify. Lots of people were talking about the Spotify model with tribes and chapters and squats. And, it's, it's on the market for some, some days now it's aged pretty well. ⁓ how did it changed for you? How did you involved?
the Spotify model, how you adapt, how you're working today in the age of AI and what is still left over from the original Spotify model.
Yeah, that's an interesting question. I don't think I don't know that this came through properly when we when that was published back in the days. But one thing that is always true about how we work is that that has been in constant change. So I haven't actually read the stuff that was published for quite a while now. But I'm going to say almost everything that was in that is different within Spotify today.
Part of that has been because we've learned, like we constantly run experiments on like, there other ways that we can organize ourselves or do things? Part of that is ⁓ that we have grown as a company, we've scaled up quite a bit. So we've needed to add in more structures to operate the way that we work.
And then maybe the biggest thing is, I was talking before about this, like the level of synchronization that we do and the planning that we do and how we build products over many teams. in that original...
There was a lot of talk about like the degree of autonomy that our teams have. And I'm going to say that that has changed quite a bit since what we published. there's now, we talked about this before, but there's this mix of, there's parts that the team is autonomous around, and then there's more guardrails around our technology standards, but also around the products and features that we're building. So that is quite different from when we wrote that.
those documents and I think a large part of the autonomy comes down to just scale. So they used to be, you know, a single team owning a big part of our user experience. And of course they could be like super autonomous because everything they did lived within that team. But today, both because we're more ambitious as a company, we want to build more ⁓ ambitious products.
but also because there's just many more teams because it's broken down in more fine-grained. It means that we need to synchronize and coordinate over typically over many teams for the things that we want to build. yeah, lots and lots of things have changed since we published that.
That's interesting. So maybe it's time for version two of that and write up some new ⁓ guidance how you work today. hear collaboration is a huge part of it. How do you manage know-how transfer? it ⁓ because of the standards and the guardrails you have not so hard to ⁓ transfer know-how or how do you organize the collaboration?
Yeah, I don't know about the publishing part.
One thing that we learned from ⁓ publishing it originally was this thing of like how that has taken on its life on its own. ⁓ And again, the way that we work nowadays is very different from...
What was, I think this was from like 2015 or whatever. Like, so publishing something now would suffer the same faith of like, we would publish a snapshot and then in six months we worked very differently. And of course with the AI stuff that we talked about before that the rate of change is going up even more. So I don't know that I, I suspect that we will probably not publish anything anytime soon. That would be my, my bet.
Yeah,
I think the core message is that the change is just massive and be a learning organization that's able to change is the basic idea.
Yeah, I think that's a good, good summary. The other caveat that I would add as well is we of course try to design things that make sense for our context. Like the way that we want to design our organization, the type of product that we're building, like building a audio streaming service is very different from building other products. ⁓
Yeah, I think it's always useful. Like I talk a lot to other companies. I try to learn from how they do things. ⁓ And then you can take like, you need to take those learnings and try to figure out like, in which way does this apply to what we're doing at Spotify. And just taking...
company's model wholesale and try to implement that at your company, I don't think it's ever going to be a good idea. So that's probably also a reason why like publishing, ⁓ doing an attempt to publish a snapshot or that would describe like our complete model again, didn't make a ton of sense back then and wouldn't make more sense.
All right. Thank you very much. We spoke almost ⁓ more than an hour now and ⁓ when people try to find you on the internet or try to reach you out, what's the best way to find you and connect with you?
Yeah, I'm not a very active user on social media or anything like that, so probably not very useful to follow me. ⁓ Spotify has an engineering blog, that's probably where I would point people. ⁓
So lot of stuff that we've talked about today, like the fleet management stuff and all of that, we publish on our engineering blog and we try to be fairly open about how we work. So that would be my best ⁓ recommendation for where to go read stuff that we're doing. ⁓ And yeah, we publish fairly frequently there. So hopefully that should be a good description of what we do.
Very cool. will put it in the show notes that everybody has the engineering block. Yeah. In the beginning, we, I ask you a question from the former guest and, ⁓ now it's, it's your time. Do you have a question that I can take over to my next interview and ask my next guest without knowing who it will be?
Sure thing. So I love war stories. So we have this within Spotify in terms of we do a yearly Halloween event where we tell stories about incidents that we've had or things like that. yeah, ⁓ I would love to hear like what's the most ⁓ exciting, stressful, ⁓ painful
story that they've gone through in terms of, you know, managing large-scale services online. So, always love those types of stories.
Fantastic question. That's also something that I'm really interested in. Yeah, we will find out in the next episode what my guests will answer to that. Thank you so much, Niklas. It was great talking to you. Thank you so much for sharing all the insights around Spotify and your person, yourself. It was pleasure to have you on a podcast. Thank you.
Cool, thanks for having me. This was a great discussion.
And to everyone else, thank you for listening. think it really exciting to hear what kind of Spotify is doing. I a lot. I thought it was really great. If you liked please like the podcast, subscribe to the podcast and we'll you in the next episode. Until then, thank you, ciao.