Wait and Hope

After 15 years, today is my final working day at Mozilla.

When people leave Mozilla, they frequently exercise their privilege to send one final email to the entire company saying goodbye. I’ve elected not to do that and am instead posting my thoughts here. Call it hubris, but there aren’t many people left at Mozilla who can appreciate what 15 years means. Most of my colleagues have already moved on.

2020 has been hard. Layoffs at Mozilla, and the threat of more layoffs, made this a particularly rough year. As a manager, putting on a brave face for others has left me emotionally spent at the end of every week. This is on top of the malaise associated with a decade of declining market share (and associated relevance) for Firefox.

As I reach the end of my tenure at Mozilla, inevitably I look back to try to figure out what I could have done differently to make Mozilla more successful. Did I miss a window of opportunity somewhere to help Firefox succeed? Might this year have been avoided, or its impact softened?

In broad strokes, sure, I could have worked longer or harder, pushed to get projects completed faster or to a higher standard. More specifically, if we had accelerated our transition from tinderbox to buildbot, or from buildbot to Taskcluster, could we have kept better pace with competitors? Maybe we could have recognized the scaling needs sooner and avoided migrating our entire continuous integration infrastructure twice?

The safe answer is that, yes, there are many things I could have done differently, but hindsight is also 20/20.

When I started this reminiscence, I felt like maybe my impact had decreased over time. It was tempting to think that my influence peaked back in 2005 when it was just 25 of us hacking together on Firefox under the Can Bridge in Ellis St.

But that’s absolutely not true.

Mozilla, at its core, is about people. The manifesto is an invitation. This is a long game; the changes that Mozilla wants to affect in the world aren’t best measured in quarterly earnings reports.

As a manager at Mozilla, I’ve had the opportunty to hire dozens of people. I’ve helped interns develop into kick-ass engineers. I’ve touched the careers of countless people and hopefully instilled some fundamental values along the way. Many of those people are no longer with Mozilla. This is a good thing, both for them and for Mozilla.

The world needs more Mozilla. In an industry largely bereft of introspection and in many cases lacking a moral compass, the Mozilla diaspora has some serious work to do. At the end of the day, if all I’ve done is helped spread Mozilla values out into the wider world, I’m happy with that legacy.

Mozilla has gone through big changes this year. I don’t know if those changes are enough for it to be successful, but I am hopeful. As part of the old guard, I am happy to step aside at this juncture to create space and opportunity for the new guard in my stead.

I’m starting a new adventure as a Senior Development Manager at Unity in January. I’ll be taking my Mozilla values with me.

Mozilla Unity transition

Taskcluster: CI for Engineers

My team at Mozilla has been working towards something special for over two years.

When I joined the team, we felt that we had a pretty good internal product in Taskcluster, the task execution framework that supports Mozilla’s continuous integration (CI) and release processes. It served Mozilla’s CI needs well, and was scaling admirably compared to previous solutions.

But could it be more?

Would people outside of Mozilla benefit from Taskcluster, and could they deploy it? Perhaps more importantly, could we develop a community of users around Taskcluster that would be self-sustaining?

We were determined to find out.

We started by taking a hard look at the Taskcluster platform and found a few big impediments to wider adoption. First, we would need to reduce the setup complexity. We would also need to reduce the number of cloud accounts required to get started. At the time, Taskcluster required at least two separate cloud providers (AWS and Azure) and a Heroku account to launch.

Over the past year, we removed the need for Azure as a back-end data store and removed the need for Heroku for deployments. Now if you have a Kubernetes environment setup, you’re ready to install Taskcluster. You’ll still need AWS S3 access for artifact storage, but we’re working to make that configurable too.

While we were making all these changes behind the scenes, we were thinking about how we would actually try to garner more interest in Taskcluster outside of Mozilla. In true Mozilla fashion, we have always been developing Taskcluster in the open, but that doesn’t necessarily mean we were discoverable. How could we specifically target the kinds of users who would benefit the most from Taskcluster?

Out of the blue in August, a developer from a mobile game company contacted us to let us know that she had successfully deployed Taskcluster, and had a few suggestions for improvements, complete with patches.

Just like that, Taskcluster was in the wild.

Is Taskcluster right for you?

From talking with Ricky, the co-founder and principal programmer at Well Played Games who was the first to successfully deploy Taskcluster outside Mozilla, we learned a lot about the decision points that might lead someone to choose Taskcluster:

Taskcluster has given us more flexibility than any of the CI solutions we’ve used in the past. It is well engineered, letting us easily pick and choose the components we need, and quickly replace any that don’t suit our use cases. Its native support for Kubernetes meshes perfectly with our tech stack.
Ricky Taylor, Co-Founder, Well Played Games

So, is Taskcluster right for you? The short answer is “maybe.”

If your build and test pipeline is straightforward, there are simpler solutions out there for you. If you only support one platform, there is probably a more targetted solution for your use case.

However, if your CI needs are more complex, Taskcluster may be exactly what you need.

Here are some examples of use cases where Taskcluster might make sense for you:

You already have a person or team of people dedicated to your CI pipeline.
You currently support >1 CI system, probably for different platforms.
You have on-premise or custom hardware that you need to integrate into your CI pipeline.
Your current CI system is hitting a bottleneck or ceiling.
You are considering writing your own bespoke CI system to address any of the above concerns.

All of those are pretty good indications of CI complexity in our experience.

We’ve adopted “CI for Engineers” as the tagline for Taskcluster. Taskcluster will not solve your CI problems on it’s own, out-of-the-box, but a software engineer who understands your CI needs can make it do just about anything.

Better still, your software engineer doesn’t have to go it alone. We’re already building a community of Taskcluster users who can offer support to each other. Ricky from Well Played Games has already contributed new features that have been incorporated into Taskcluster and consulted on others.

If Taskcluster seems like a good fit for your CI needs, we encourage you to join other Taskcluster users and developers in Matrix or in Discourse.

Home page: https://taskcluster.net/
Documentation: https://docs.taskcluster.net/
Code: https://github.com/taskcluster/taskcluster/
Matrix: https://chat.mozilla.org/#/room/#taskcluster:mozilla.org
Discourse: https://discourse.mozilla.org/c/taskcluster/

If you’d like to investigate a live instance, the Mozilla Taskcluster deployment for community projects can be found here (no sign-in required): https://community-tc.services.mozilla.com/

Taskcluster - CI for Engineers

Mozilla Taskcluster continuous integration ci

Managing: team expectations

I don't know what I expected - Arrested Development

One of my biggest challenges when I began managing the Taskcluster team was simply getting my reports to talk to each other in a productive way. Per Conway’s Law, the micro-service architecture of Taskcluster reflected the knowledge silos on the team. Communication was erratic at best. Fortunately, transitions offer an ideal opportunity to establish norms, revisit old ways of work, and perhaps even try something new.

Following the lead of a colleague, at the start of my tenure I sat my new team down for an entire day and hashed out the communications issues. What emerged at the end of that discussion was a document that recorded all the expectations we had for each other as teammates. The document was part aspiration and part contract, but was essential for establishing a baseline of trust that we could use to work together going forward.

Fast-forward to 2020, and Mozilla is going through yet another transition. As the make-up and scope of my team changes again, it is useful to revisit the expectations document to make sure everyone is still on the same page. After consulting with the team, I also decided to publish our Team Expectations doc on Github in the hopes that it might benefit others.

Taskcluster Team Expectations

This is partially self-serving: the Taskcluster team has many community contributors and the occasional intern, and we hope that by sharing our expectations more widely, we’ll foster a better contribution environment around our project.

If you’re interested in performing a similar exercise with your own team, be prepared to devote the time. I’d budget at least 4 hours to this process, depending on your team’s current level of dysfunction. We had the fortune in the before times to be able to do this in-person, but a series of video calls would accomplish the same goal.

Content-wise, the headings we came up with — Accountability, Communications, Planning and The Design Process: RFCs, Implementation and Review, Triage, Dealing with outages — are a good jumping-off point for the discussion but may or may not make sense depending on your field or responsibilities. Having done the process with a few different teams now, it’s important not to over-structure this at the start. There is a lot of value here in digression because that’s where you are most likely to find the areas where expectations are currently mismatched or unmet.

If you do try out, please let me know how it went, especially if you end up evolving the process. Hopefully it meets your expectations. ;)

management Taskcluster expectations Mozilla

New to me: the Taskcluster team

All entities move and nothing remains still.
– Heraclitus, as referenced by Plato

At this time last year, I had just moved on from Release Engineering to start managing the Sheriffs and the Developer Workflow teams. Shortly after the release of Firefox Quantum, I also inherited the Taskcluster team. The next few months were *ridiculously* busy as I tried to juggle the management responsibilities of three largely disparate groups.

By mid-January, it became clear that I could not, in fact, do it all. The Taskcluster group had the biggest ongoing need for management support, so that’s where I chose to land. This sanity-preserving move also gave a colleague, Kim Moir, the chance to step into management of the Developer Workflow team.

Meet the Team

Let me start by introducing the Taskcluster team. We are:

We are an eclectic mix of curlers, snooker players, pinball enthusiasts, and much else besides. We also write and run continous integration (CI) software at scale.

What are we doing?

The part I understand is excellent, and so too is, I dare say, the part I do not understand…
– Socrates, in reference to Heraclitus

One of the reasons why I love the Taskcluster team so much is that they have a real penchant for documentation. That includes their design and post-mortem processes. Previously, I had only managed others who were using Taskcluster…consumers of their services. The Taskcluster documentation made it really easy for me to plug-in quickly and help provide direction.

If you’re curious about what Taskcluster is at a foundational level, you should start with the tutorial.

The Taskcluster team currently has three, big efforts in progress.

1. Redeployability

Many Taskcluster team members initially joined the team with the dream of building a true, open source CI solution. Dustin has a great post explaining the impetus behind redeployability. Here’s the intro:

Taskcluster has always been open source: all of our code is on Github, and we get lots of contributions to the various repositories. Some of our libraries and other packages have seen some use outside of a Taskcluster context, too.
But today, Taskcluster is not a project that could practically be used outside of its single incarnation at Mozilla. For example, we hard-code the name taskcluster.net in a number of places, and we include our config in the source-code repositories. There’s no legal or contractual reason someone else could not run their own Taskcluster, but it would be difficult and almost certainly break next time we made a change.
The Mozilla incarnation is open to use by any Mozilla project, although our focus is obviously Firefox and Firefox-related products like Fennec. This was a practical decision: our priority is to migrate Firefox to Taskcluster, and that is an enormous project. Maintaining an abstract ability to deploy additional instances while working on this project was just too much work for a small team.
The good news is, the focus is now shifting. The migration from Buildbot to Taskcluster is nearly complete, and the remaining pieces are related to hardware deployment, largely by other teams. We are returning to work on something we’ve wanted to do for a long time: support redeployability.

We’re a little further down that path than when he first wrote about it in January, but you can read more about our efforts to make Taskcluster more widely deployable in Dustin’s blog.

2. Support for packet.net

packet.net provides some interesting services, like baremetal servers and access to ARM hardware, that other cloud providers are only starting to offer. Experiments with our existing emulator tests on the baremetal servers have shown incredible speed-ups in some cases. The promise of ARM hardware is particularly appealing for future mobile testing efforts.

Over the next few months, we plan to add support for packet.net to the Mozilla instance of Taskcluster. This lines up well with the efforts around redeployability, i.e. we need to be able to support different and/or multiple cloud providers anyway.

3. Keeping the lights on (KTLO)

While not particularly glamorous, maintenance is a fact of life for software engineers supporting code that in running in production. That said, we should actively work to minimize the amount of maintenance work we need to do.

One of the first things I did when I took over the Taskcluster team full-time was halt *all* new and ongoing work to focus on stability for the entire month of February. This was precipitated by a series of prolonged outages in January. We didn’t have an established error budget at the time, but if we had, we would have completely blown through it.

Our focus on stability had many payoffs, including more robust deployment stories for many of our services, and a new IRC channel (#taskcluster-bots) full of deployment notices and monitoring alerts. We needed to put in this stability work to buy ourselves the time to work on redeployability.

What are we not doing?

With all the current work on redeployability, it’s tempting to look ahead to when we can incorporate some of these improvements into the current Firefox CI setup. While we do plan to redeploy Firefox CI at some point this year to take advantage of these systemic improvements, it is not our focus…yet.

One of the other things I love about the Taskcluster team is that they are really good at supporting community contribution. If you’re interested in learning more about Taskcluster or even getting your feet wet with some bugs, please drop by the #taskcluster channel on IRC and say Hi!

Mozilla Taskcluster management

Experiments in productivity: the shared bug queue

Maybe you have this problem too

You manage or are part of a team that is responsible for a certain functional area of code. Everyone on the team is at different points in there career. Some people have only been there a few years, or maybe even only a few months, but they’re hungry and eager to learn. Other team members have been around forever, and due to that longevity, they are go-to resources for the rest of your organization when someone needs help in that functional area. More-senior people get buried under a mountain of review requests, while those less-senior engineers who are eager to help and grow their reputation get table scraps.

This is the situation I walked into with the Developer Workflow team.

This was the first time that Mozilla had organized a majority (4) of build module peers in one group. There are still isolated build peers in other groups still, but we’ll get to that in a bit.

With apologies to Ted, he’s the elder statesman of the group, having once been the build module owner himself before handing that responsiblity off to Greg (gps), the current module owner. Ted has been around Mozilla for so long that he is a go-to resource for not only build system work but many other projects, e.g. crash analysis, he’s been involved with. In his position as module owner, Greg bears the brunt of the current review workload for the build system. He needs to weigh-in on architectural decisions, but also receives a substantial number of drive-by requests simply because he is the module owner.

Chris Manchester and Mike Shal by contrast are relatively new build peers and would frequently end up reviewing patches for each other, but not a lot else. How could we more equitably share the review load between the team without creating more work for those engineers who were already oversubscribed?

Enter the shared bug queue

When I first came up with this idea, I thought that certainly this must have been tried at some point in the history of Mozilla. I was hoping to plug into an existing model in bugzilla, but alas, such a thing did not already exist. It took a few months of back-and-forth with our reisdent Bugmaster at Mozilla, Emma, to get something setup, but by early October, we had a shared queue in place.

How does it work?

We created a fictitious meta-user, core-build-config-reviews@mozilla.bugs. Now whenever someone submits a patch to the Core::Build Config module in bugzilla, the suggested reviewer always defaults to that shared user. Everyone on the teams watches that user and pulls reviews from “their” queue.

That’s it. No, really.

Well, okay, there’s a little bit more process around it than that. One of the dangers of a shared queue is that since no specific person is being nagged for pending reviews, the queue could become a place where patches go to die. As with any defect tracking system, regular triage is critically important.

Is it working?

In short: yes, very much so.

Subjectively, it feels great. We’ve solved some tricky people problems with a pretty straightforward technical/process solution and that’s amazing. From talking to all the build peers, they feel a new collective sense of ownership of the build module and the code passing through it. The more-senior people feel they have more time to concentrate on higher level issues or deeper reviews. The less-senior people are building their reputations, both among the build peers and outside the group to review requesters.

Numerically speaking, the absolute number of review requests for the Core::Build Config module is consistent since the adoption of the shared queue. The distribution of actual reviewers has changed a lot though. Greg and Ted still end up reviewing their share of escalated requests — it’s still possible to assign reviews to specific people in this system — but Mike Shal and Chris have increased their review volume substantially. What’s even more awesome is that the build peers who are *NOT* in the Developer Workflow team are also fully onboard, regularly pulling reviews off the shared queue. Kudos to Nick Alexander, Nathan Froyd, Ralph Giles, and Mike Hommey for also embracing this new system wholeheartedly.

The need for regular triage has also provided another area of growth for the less-senior build peers. Mike Shal and Chris Manchester have done a great job of keeping that queue empty and forcing the team to triage any backlog each week in our team meeting.

Teh Future

When we were about to set this up in October, I almost pulled the plug.

Over the next six months, Mozilla is planning to switch code review tools from mozreview/splinter to phabricator. Phabricator has more modern built-in tools like Herald that would have made setting up this shared queue a little easier, and that’s why I paused…briefly

Phabricator will undoubtedly enable a host of quality-of-life improvements for developers when it is deployed, but I’m glad we didn’t wait for the new system. Mozilla engineers are already getting accustomed to the new workflow and we’re reaping the benefits *right now*.

Mozilla bugzilla Developer Workflow management

Welcome, Connor!

Connor McDavid — This is *not* our Connor.

This post is *ahem* several months overdue, but I’m happy to welcome Connor Sheehan to the team.

Connor was a two-time intern with the Mozilla release engineering team. In that capacity, he became well acquainted with some of the bottlenecks in our CI system. We’ve brought him onboard to assist gps with stabilizing and scaling our mercurial infrastructure.

Welcome, Connor!

Mozilla hg vcs newhire hiring

Work Week Logistics, Revisited

I’ve written before about how to be productive when distributed teams get together and was anxious to try it out on my “new” (read: six-month-old) team, Developer Workflow. As mentioned in that previous post, we just had a work week in Mountain View, so here’s a quick recap.

Process Improvements

We often optimize work week location around where the fewest people would need to travel to attend. While this does make things logistically easier, it also introduces imbalance. Some people will have traveled very far, while some people will be able to sleep in their own beds. Conversely, the local people may feel they need to go home every night in order to be with their partners/families/cats and may miss out on the informal bonding that can happen at group dinners and such.

We had originally intended to meet in San Francisco, but other conferences had jacked up hotel rates, so we decided to decamp to the Valley. I offered to have the SF residents book rooms to avoid the daily commute up and down the peninsula. They didn’t all take me up on it, but it was an opportunity to put everyone on more equal footing.

Schedule-wise, I set things up so that we had our discussion and planning pieces in the morning each day while we were still fresh and caffeinated. After lunch, we would get down to hacking on code. Ted threw together a tracking tool to help visualize the Makefile burndown. Ted is also great at facilitating meetings, keeping us on track especially later in the week as we all started to fade.

Accomplishments

So what did we actually get done? Like the old adage about station wagon full of tapes, never underestimate the review bandwidth of 4 build peers hacking in a room together for an afternoon. We accomplished quite a bit during our time together.

Aside from the 2018 planning detailed in the previous post, we also met with mobile build peer Nick Alexander and planned how to handle mobile Makefiles. The mobile version of Firefox now builds with gradle, so it was important not to step on each others toes. Another huge proportion of the remaining Makefiles involve l10n. We figured out how to work-around l10n for now, i.e. don’t break repacks, to get a tup build working, and we’ve setup a meeting with l10n team for Austin to discuss their plans for langpacks and a future that might not involve makefiles at all. The l10n stuff is hairy, and might be partially my fault (see previous comment re: cargo-culting), so thanks to my team for not shying away from it.

On a concrete level, Ted reports that we’ve removed 13 Makefiles and ~100 lines of other Makefile content in the past month, much of which happened over the past few weeks. Greg has also managed to remove big pieces of complexity from client.mk, assisted by reviews from Chris, Mike, Nick and other build peers. We’re getting into the trickier bits now, but we’re persevering.

All in all, a very successful work week with my “new” team. I continue to find subtle ways to make these get-togethers more effective.

Mozilla distributed teamweek workweek logistics

Introducing The Developer Workflow Team

I’ve neglected to write about the *other* half of my team, not for any lack of desire to do so, but simply because the code sheriffing situation was taking up so much of my time. Now that the SoftVision contractors have gained the commit access required to be fully functional sheriffs, I feel that I can shift focus a bit.

Meet the team

The other half of my team consists of 4 Firefox build system peers. My team consists of:

When the group was first established, we talked a lot about what we wanted to work on, what we needed to work on, and what we should be working on. Those discussions revealed the following common themes:

We have a focus on developers. Everything we work on is to help developers be more productive, and go more quickly.
We accomplish this through tooling to support better/faster workflows.
Some of these improvements can also assist in automation, but that isn’t our primary focus, except where those improvements are also wins for developers, e.g. faster time to first feedback on commit.
We act as consultants/liaisons to many other groups that also touch the build system, e.g. Servo, WebRTC, NSS etc.

Based on that list of themes, we’ve adopted the moniker of “Developer Workflow.” We are all build peers, yes, but to pigeon-hole ourselves as the build system group seemed short-sighted. Our unique position at the intersection of the build system, VCS, and other services meant that our scope needed to match what people expect of us anyway.

While new to me, Developer Workflow is a logical continuation of build system tiger team organized by David Burns in 2016. This is the same effort that yielded sea change improvements such as artifact builds and sccache.

In many ways, I feel extremely fortunate to be following on the heels of that work. During the previous year, all the members of my team formed the working relationships they would need to be more successful going forward. All the hard work for me as their manager was already done! ;)

What are we doing

We had our first, dedicated work week as a team last week in Mountain View. Aside from getting to know each other a little better, during the week we hashed out exactly what our team will be focused on next year, and made substantial progress towards bootstrapping those efforts.

Next year, we’ll be tackling the following projects:

Finish the migration from Makefiles to moz.build files: A lot of important business logic resides in Makefiles for no good reason. As someone who has cargo-culted large portions of l10n Makefile logic during my tenure at Mozilla, I may be part of the problem.
Move build logic out of *.mk files: Greg recently announced his intent to remove client.mk, a foundational piece of code in the Mozilla recursive make build system that has existed since 1998. The other .mk files won’t be far behind. Porting true build logic to moz.build files and removing non-build tasks to task-based scripts will make the build system infinitely more hackable, and will allow us to pursue performance gains in many different areas. For example, decoupled tests like package tests could be run asynchronously, getting results to developers more quickly.
Stand-up a tup build in automation: this is our big effort for the near-term. A tup build is not necessarily an end goal in-and-of itself — we may very well end up on bazel or something else eventually — but since the Mike Shal created tup, we control enough of the stack to make quick progress. It’s a means of validating the Makefile migration.
Move our Linux builder in automation from Centos6 to Debian: This would move move us closer to deterministic builds, and has alignment with the TOR project, but requires we host our own package servers, CDN, etc. This would also make it easier for developers to reproduce automation builds locally. glandium has a proof-of-concept. We hope to dig into any binary compatibility issues next year.
Weening off mozharness for builds: mozharness was a good first step at putting automated build configuration information in the tree for developers. Now that functionality could be better encapsulated elsewhere, and largely hidden by mach. The ultimate goal would be to use the same workflow for developer builds and automation.

What are we not doing

It’s important to be explicit about things we won’t be tackling too, especially when it’s been unclear historically or where there might be different expectations.

The biggest one to call out here is github integration. Many teams at Mozilla are using github for developing standalone projects or even parts of Firefox. While we’ve had some historical involvement here and will continue to consult as necessary, other teams are better positioned to drive this work.

We are also not currently exploring moving Windows builds to WSL. This is something we experimented with in Q3 this year, but build performance is still so slow that it doesn’t warrant further action right now. We continue to follow the development of WSL and if Microsoft is able to fix filesystem performance, we may pick this back up.

Mozilla developer tools build system tup workflow

See, that’s what the app is perfect for.