Head of ML Infra

Head of ML Infra

This job is no longer open

Working hours

🌎 Given that we are an all-remote company and hire almost anywhere in the world, we don’t have a particular time-zone preference for this role. However, you may need to be available for non-recurring urgent meetings outside of working hours.

Why this job is exciting

We are creating a machine learning team at Sourcegraph, aimed at creating the most powerful coding assistant in the world. Many companies are trying, but Sourcegraph has a unique advantage: Our rich code graph. In the world of prompting LLMs, context is key, and for creating the right context, Sourcegraph’s code data is simply the best you can get: IDE-quality, global-scale, and served lightning fast. Cody is already outperforming the pack, but we aim to take the lead in machine learning advancements on coding assistant quality. You can help us unlock Cody’s full potential, delivering a product that accelerates development in a way we only see every 10-15 years.

To head up this effort, we are looking for a seasoned and deeply technical ML-engineering leader, with a strong AI background and experience with both smaller models and the new LLM ecosystem, who can help us deliver the world’s best coding assistant and ML-powered developer tooling. And if you happen to have an entrepreneurial streak, you’re in luck:  We have an enterprise distribution pipeline, so whatever you build can be deployed straight to enterprise customers with some of the largest codebases in the world, without all the go-to-market hassle you’d encounter in a startup.

Within one month, you will…

  • Meet your team, which consists initially of 3 to 5 ML engineers (2 already on the team)

  • Start building a trusting relationship with your direct reports and peers.

  • Come up to speed on the current state of machine learning in the Cody ecosystem.

  • Be set up for local development and familiar with Cody’s architecture.

  • Define our short-term roadmap for ML Infrastructure on GCP.

  • Ship a substantial feature, experiment, or evaluation.

Within three months, you will…

  • Set up the at-scale infrastructure for running benchmarks that compare coding assistants.

  • Have defined a strategy for how we will address getting GPUs at scale for various personas.

  • Have defined a rough roadmap for how to cost-optimize our ML spend.

  • Have defined our on-prem/self-hosted roadmap and recommended configurations for ML infra.

  • Be up to speed and driving Sourcegraph’s ML Infra strategy.

Within six months, you will…

  • Have hired a world-class team of ML engineers.

  • With the help of our research team, have delivered a  ML-driven quality, benchmarking, and evaluation framework for coding assistants that runs at scale

  • Have established a longer-term roadmap that keeps us aligned with expected advances in LLMs.

  • Be running dozens to hundreds of experiments with prompting, embedding, fine-tuning and other techniques.

About you 

You have been working squarely in ML Infra since LLMs landed, if not longer.

  • You’re deeply familiar with at least one end-to-end system for ML pipelines at scale, and you are broadly familiar with the competition in the space and what options are available, and when.
  • In an ideal world, you are most deeply familiar with GCP’s machine learning stack, and you have a lot of practice operationalizing PyTorch experiments on that stack. It’s also great if you have Apache Spark in general.
  • You should be the kind of person who lives and breathes GPUs, and you should come armed with opinions about how best to deploy and cost-optimize Cody for our various customer classes, from large enterprises to casual hackers, particularly when it comes to the Cloud-side deployments.
  • In a perfect world, you would already be comfortable with options that enterprise customers might want for self-hosted ML infra, for running their own pipelines, e.g. other Cloud-hosted offerings, and/or OSS. Although we are pushing hard to have everything on GCP, the market is evolving rapidly and we could, for instance, come across customers who want to provide their own GPUs.
  • Any familiarity you have with deploying enterprise SaaS is a huge bonus because it is a part of the role. However, it’s something that you can pick up if you are already familiar with Cloud options.
  • Bonus if you have any background in graph theory or anything that would be relevant to our code graph, which plays a key role in the production of both training data and in acting as a source of truth for verifying model outputs.
  • We would love it if you are actively following developments in open-source models and training systems, and can come prepared with opinions about when and to what extent we should adopt them. Or more importantly, how we set up infrastructure that will tell us when they are ready, by evaluating their performance on Cody tasks.

Best of all, we’d love it if you already have an opinion about Cody, have tried it, and already have a vision for how you can help us make it even better!

Level

📊 This job is an M4. You can read more about our job leveling philosophy in our Handbook.

Compensation

💸 We pay you an above-average salary because we want to hire the best people who are fully focused on helping Sourcegraph succeed, not worried about paying bills. You will have the flexibility to work and live anywhere in the world (unless specified otherwise in the job description), and we’ll never take your location or current/past salary information into account when determining your compensation.  As an open and transparent company that values equitable and competitive compensation for everyone, our compensation ranges are visible to every single Sourcegraph Teammate. To determine your salary, we use a number of market and data-driven salary sources and target the high-end of the range, ensuring that we’re always paying above market regardless of where you live in the world.  

💰 The target compensation for this role is $243,000 USD base.

📈 In addition to our cash compensation, we offer equity (because when we succeed as a company, we want you to succeed, too) and generous perks & benefits.

Interview process [~5.5 hour total interview]

Below is the interview process you can expect for this role (you can read more about the types of interviews in our Handbook). It may look like a lot of steps, but rest assured that we move quickly and the steps are designed to help you get the information needed to determine if we’re the right fit for you… Interviewing is a two-way street, after all!

We expect the interview process to take ~5.5 hours in total. 

👋 Introduction Stage - we have initial conversations to get to know you better…

  • [30m] Recruiter Screen with Grace Bohl
  • [45m] Technical Background with Beyang Liu
  • [30m] Hiring Manager Screen with Steve Yegge

🧑‍💻 Team Interview Stage - we then delve into your experience in more depth and introduce you to members of the team…

  • [60m] Resume Deep Dive with Grace Bohl
  • [45m] Technical Deep Dive with Dominic Cooney and Julie Tibshirani
  • [60m] Peer Interview with Erika Rice Scherpelz and Chris Pine
  • [Async] Pairing Exercise with the team

🎉 Final Interview Stage - we move you to our final round, where you meet cross-functional partners and gain a better understanding of our business and values holistically…

  • [30m] Values Interview
  • [30m] Leadership Interview with Quinn Slack
  • We check references and conduct your background check

Please note - you are welcome to request additional conversations with anyone you would like to meet, but didn’t get to meet during the interview process.

This job is no longer open
Logos/outerjoin logo full

Outer Join is the premier job board for remote jobs in data science, analytics, and engineering.