No More Green Screen

"heck u green screen, now machine learning is my best friend." ~~abraham lincoln

Howard, Reborn.

TL;DR: This week, I wrote the logic for processing real user-given input videos and completely rewrote the first half of the network, and have gotten some amazing results!

Hi, I’m Kai. If you’re new here, let me catch you up to speed:

  1. I’m building a neural network which removes the background from videos of people.

  2. It’s based off of this paper by Soumyadip Sengupta and his team at the University of Washington.

  3. The neural network is named Howard, but if you really like, you can call him Howie.

The way this network functions is simple – it takes two videos as input: one of a person in some environment, and one of just the environment. As long as the camera motion is similar in both shots, the network will output an image sequence where the person has been “cut out” from the environment. If you’ve ever used Zoom’s virtual background feature, it’s like that – but better.

Seasoned Experts vs. One Inexperienced Boi

A week ago, I had just barely gotten a janky, low-quality version of Howard to fully work for the first time. I wrote about it here.

The reason Howard wasn’t working very well is because I have pride issues – by which I mean instead of just implementing the proven algorithm from the academic paper written by folks way smarter than me, I thought to myself “ooh, what if I just tried to come up with my own original architecture? What could go wrong?”

Turns out that in a battle of wits between a teenager with no experience whatsoever and seasoned field experts, the experts tend to win.

This week, I finally just decided to bin my old, terrible architecture and exactly replicate the architecture in Sengupta et. al.‘s paper. And guess what! It works spectacularly!

Here’s how my old architecture faired on some test videos:

Aaand here’s how the new, expert-designed architecture does on those same test videos:

Spot the difference, I dare you.

Making Howard Almost Useable

I’d like to talk about how those videos were actually made, because that was the other thing I did this week.

Up until this week, Howard had only ever tried to operate on fake, synthesized training data. Training data is special since it contatins not just an example of an input Howard might see, but also an example of what the correct output for that input is. This means that Howard can learn from his mistakes while operating on the training data.

In a real use case, though, Howard won’t have the right answer – he’ll just have two video files given to him by a user, and he’ll have to make his best guess at what the right way to cut the subject out is.

So, you say, just feed those videos to him and see what he spits out! Well, that doesn’t quite work for one big reason:

The two user-generated videos very likely have significantly different camera translation, rotation and clip length.

This is a problem because it means that the “naive solution” of just taking frame X from the subject video and frame X from the background video and having Howard crunch them doesn’t work at all.

In machine learning, there’s a concept called the train/test domain gap. The idea is that for many problems worth solving with ML, the data you’d feed to the neural network in a real-world usecase and the data you have available to train that network on come from different statistical distributions. In my case, that turns out to correspond to the fact that there doesn’t really exist a large, accessible dataset of moving camera videos of people standing in their natural environment, videos of just that environment, and videos of the people cut out from their environment. I’d be really happy if that kind of dataset did exist, but it doesn’t, so I have to settle for the next-best thing: synthesized training data.

The data that I actually train Howard on is similar to the ideal data, but still part of a very different distribution – for example, the background behind the person and the background that is fed to the network are only ever so slightly misaligned, whereas in real use-cases, those two would be very likely not aligned hardly at all.

So if I can’t make the training data have the same statistical distribution as the real-world use-case data, then I need a way to do the opposite – make the real-world use-case data have as close of a distribution to the training data as I can. Following the advice of Mr. Sengupta, who kindly provided me with some very useful resources and pointers after I emailed him, I’m accomplishing this by using homographies.

A homography (in the context of this project) is a kind of transformation you can apply to an image. Specifically, it’s a transformation that warps the perspective, translation, rotation, and scale of a given query image such that it aligns with a given test image. This is very handy because it means I can warp the user-given background frames to match more closely with the user-given foreground frames, thus significantly closing the domain gap between the real-world and train data!

To sum it all up, here’s what Howard actually does when I give him a pair of real-world videos I want him to process:

For every source frame from the video of the person in their environment…

Example source frame


  1. Grabs a few background frames that are temporally nearby from the video of just the environment (one example is pictured):

Example background frame

  1. Detects hundreds of unique features in each background frame
  2. Matches those features with features detected in the source frame:

Feature matching being done on a background and source frame

  1. Applies a homography transformation to line up the matched sets of features as close as possible
  2. Picks the background frame from the set that aligns the closest with the source frame after being warped:

Best bg frame after warpage

  1. Feeds the source frame and the selected warped background frame to Howard, who processes them and outputs the result as an image with an alpha channel.

This results in a sequence of RGBA PNG files being output to a directory, which I can then use in any modern compositing software however I like!


This week was great! Even though I made a ton of progress, I still have a long ways to go before Howard is ready to be shared fully with the world. That said, I’m totally excited for the next couple of weeks of work, because I do feel like I’m nearing the finish line.

Thanks for reading,


P.S. – The most fun part of all of this actually turns out frequently to be what sort of craziness Howard outputs when I accidentally break him! Here are some of my favorite psychedelic art pieces I inadvertently caused him to make:

Psychedelic art piece 1 Psychedelic art piece 2 Pyschedelic art piece 3

Who knew Howard was such a talented surrealist?

Follow me on TikTok, Twitter, or YouTube

Howard has been upgraded to FHD!

TLD;DR: I implemented a refinement network to upscale Howard’s 480p outputs to 1080p; based closely on the method proposed by Sengupta, et. al.

Hi! I’m Kai. If you’re new here, here’s the rundown:

  1. I’m building a neural network based on this paper that removes the background from videos.
  2. Given a video of a person, and a video of the background, the goal is for the network to output a video of the person “cut out” from the background.
  3. The neural network is affectionately dubbed “Howard”.

Last week, I got to the point where Howard was outputting low-resolution images called alpha matttes. These mattes serve as a guide for how transparent every pixel of the final output should be. Let’s take this image as an example:

example composite

A perfect alpha matte for this image would look like this:

good alpha matte

As you can see, the image is solid white for areas that should be 100% opaque, like the person’s nose, solid black for bits of the background that need to be cut out, and somewhere in between for wispy details like hair.

Now, let’s compare the perfect matte above to one which Howard produced (not using the same input image as above, but the point still stands):

howard’s try

It’s… lackluster. There are lots of ways Howard’s outputs could be made better, but this past week, I decided to focus on one in particular: resolution. The inputs to Howard are 1080p frames of video, but at the beginning of this week, Howard’s outputs were limited to 480p.

480p vs 1080p

Enter the refinement network. The refinement network first upsamples the low-resolution output from Howard’s first half, then takes little tiny square patches, each only 4 by 4 pixels, and refines them. The result of this refinement can be seen below:

Original 480p output:


Refined 1080p output:


So… it’s still a weird blob, but at least it’s in full, glorious 1080p.

Now, the way this refinement network (as proposed in the paper) works is actually quite clever. The main idea is that refining every single 4x4 patch of a 1080p image would be way too much unnecessary work.

This is because alpha mattes are mostly made up of solid white and solid black regions that can be upscaled or downscaled arbitrarily without loss of quality – that is to say, a blurry solid white patch is the same as a sharp solid white patch, because they’re both still just solid white.

So, what Sengupta and his team do to reduce unnecessary computation is make Howard generate not just a coarse 480p alpha matte, but also an image called an error map.

Howard’s coarse generator inputs and outputs

Basically, the brighter the pixel is in the error map, the less confident Howard is saying he is about his results for that area of the image. This is useful to us, because it means we can use the error map to tell the refinement network where to focus its attention! By not having the refinement network waste time refining the solid white or solid black areas, we get our 1080p result nice and crisp while using a fraction of the computation power and system memory to pull it off.

I’m not sure if I ever would have come up with something like this on my own! It’s a super clever technique, and it’s only one of many things I’ve… uh… academically borrowed from Sengupta and his team’s research paper :)


This post is a brief one, but I hope you enjoyed seeing the progress Howard made this week. On the agenda for the coming week, I have plans to work on the system Howard will use to actually accept and ingest video files (so far he’s only ever tried training data). Look out for a post next week regarding the pain I go through as I try to get feature matching and homography solving working!

Thanks for reading,


Follow me on TikTok, Twitter, or YouTube

The Story So Far

4 months of progress towards eliminating green screen from playoff contention.

Hi, I’m Kai. I’m an 18y/o filmmaker who’s using machine learning to help me make better movies.

I like to use visual effects in the short films I make (and publish on TikTok and YouTube). I frequently use green screen in my workflow..

Me in front of a green screen

Green screen is extremely handy because it lets me “cut out” a foreground element (like a person) from the background. This process is known as keying. Then I can composite that foreground on top of any other background I want:

Me in front of a cyberpunk city

The problem is, green screen is a pain to work with. It takes up a lot of space, it must be wrinkle-free and evenly lit, and you’ve got to frame every shot so that the subject is always in front of it. Overall, not ideal. We can do better.

Doing better

This project started after I watched this video from the excellent channel Two Minute Papers on YouTube. The video covers a paper from the University of Washington which shows that deep learning can be used to do the job of green screen when there’s no green screen present.

Sengupta’s research

Example inputs and outputs to the network created by Soumyadip Sengupta and his team.

Fascinated, I soon found an updated v2 of the original paper which could run in real-time. Both v1 and v2 seemed very promising, but they both had one major drawback – they couldn’t handle moving cameras. This was a bit of a dealbreaker for me, because I really like moving camera shots.

I decided I wanted to make my best effort to extend on the research I’d found. The only problem was that I didn’t know barely anything about machine learning.

I spent the next 11-ish weeks learning as much as I could about ML. I started with Andrew Ng’s iconic introductory ML course (thanks for the recommendation, HN!), and worked my way up to Coursera’s Deep Learning and GAN specializations.

After completing these, I began work on replicating the algorithm presented in the v2 paper from the UoW. Even though their code is open sourced on GitHub, I wanted to write my own implementation for 2 reasons–

A.) I knew I’d get a much deeper understanding of the problem if I did.

B.) it would be nice to legally own the copyright to the code I wrote.

After grappling with PyTorch errors for a few week and bumbling about without a clue what I was doing, I finally got my model to output something that wasn’t random noise:

first reasonable output

This was about three weeks ago at the time of writing. Things have been moving quick since then.

Making the thing do the thing

From there, it was a big long process of iterating my network design many times, bringing it closer to Sengupta et. al.‘s architecture. Write a few lines of code, test, debug, repeat.

The training data that I give to Howard (yes, the neural network’s name is Howard. You can call him Howie if you prefer) consists of thousands of images that I create by compositing different green-screen clips of people onto random backgrounds, a form of data synthesis.

first training example second training example

Howard’s goal is to produce what’s known as an alpha matte. An alpha matte is an image used to create the cut-out of the foreground. Wherever the matte is white, the original image is preserved, and wherever the matte is black, the image becomes transparent.

In order to train Howard, I compare his guesses for what he thinks the matte should look like against the ground-truth real matte (which I extracted from the green screen clip when I originally synthesized the data). Based on that comparison, Howard can make changes to himself in order to get better and better at guessing what the matte should be.

Here, you can see some outputs from an early version of Howard, who at the time was a very simplified version of the paper’s architecture. Notice how the mattes are blurry, indistinct, and don’t look anything like the silhouettes of the people in the training data:

first blurry silhouette second blurry silhouette

After a few weeks of tweaking and making Howard more complex (and closer to the paper), he started producing sharper outputs that were much more accurate. Here are some outputs he made a few days ago:

first sharp output second sharp output

When I saw those outputs, I was ecstatic, because they prove that my idea is viable!

The really fun part about this whole process so far, though, has been the outputs Howard generates when I write the wrong code. At one point, I included too many batch normalization layers, which resulted in some interesting black and white detail soups:

bw_detail_soup1 bw_detail_soup2

Later on, I started requiring the network to generate a color foreground as part of its output, producing full RGB psychedelic trippiness in the early stages of training when Howard is still figuring out what he’s supposed to be doing. These are my personal favorites:

psychedelic1 psychedelic2


All in all, these past 4 months have been wild. I’ve gone from knowing almost nothing about machine learning to writing my own implementation of a research paper in PyTorch! One Ask HN post has changed my whole life.

I’ve thoroughly enjoyed the whole process, and I’m very much looking forward to the next couple weeks, because my gut tells me the network is really going to start taking shape soon. I’m on the home stretch, and I’m really looking forward to the day I get the whole darn thing to just work 100%. It’s gonna be epic.

Thanks for reading,

– Kai

P.S. I’m a part of the Pioneer tournament! Look out for my username, mkaic, on the global leaderboard :)