Sr. Software Engineer - Spark Compute Infrastructure, ML Platform

Sr. Software Engineer - Spark Compute Infrastructure, ML Platform

This job is no longer open

Would you like to manage our Spark compute infrastructure and optimize the ML Spark pipelines that power Netflix recommendations? We think of the Netflix service as hundreds of millions of different products serving uniquely personalized experiences to each of our 200+ Million members.  One of the teams powering this effort is the ML Platform Data & Feature Infra team that is responsible for building a scalable and efficient compute infrastructure that is leveraged to train our personalization ML models.

The Opportunity

In this role, you will have the opportunity to manage the Spark compute infrastructure that is used to train ML algorithms that power Netflix personalization. You will drive operational excellence through tooling and automation and will be working closely with ML researchers and engineers to scale their adhoc explorations and manage Production ML pipelines. This role will allow you to gain intimate knowledge of Netflix Personalization, while working for a unique and pioneering company that is redefining how video content is consumed globally.

Here are some examples of the types of things you would work on:

  • Optimize the ML Spark pipelines for both resource and latency efficiency and help do capacity planning for our compute infrastructure
  • Increase research productivity by quickly troubleshooting Spark performance issues and any roadblocks in adoption of our compute infrastructure
  • Build tools and automation to make infrastructure more robust and for reporting cluster cost utilization and efficiency
  • Manage a large scale Spark cluster (several thousands of EC2 instances) that powers the ML production pipelines fueling innovation for Recommendations research
  • Collaborate with our Big Data Platform teams to build, deploy and upgrade our compute infrastructure using the the latest and greatest open source libraries

To learn more, here are some talks/blog posts from the team:

Minimum Qualifications

  • 4+ years of relevant experience managing large scale distributed data systems
  • Strong automation mindset and a passion for root cause analysis and strategies to mitigate issues
  • Experience in big data technologies like Spark, Mesos/YARN/Kubernetes, HDFS or ElasticSearch
  • Experience with performance tuning and debugging scalability issues of Spark applications
  • Excellent communication and people engagement skills
  • Expertise in scripting languages
  • Experience with Cloud Computing platforms like Amazon AWS

Preferred Qualifications

  • Exposure to functional languages like Scala
  • Experience working on Notebooks such as Jupyter or Polynote
  • Experience working on container (Docker) platforms

Netflix is an equal opportunity employer and strives to builddiverse teams from all walks of life. We offer a unique culture of freedom and responsibility with a clear long-term view. We recommend reading through these to understand what working at Netflix is like.

This job is no longer open
Logos/outerjoin logo full

Outer Join is the premier job board for remote jobs in data science, analytics, and engineering.