Twitch is a video streaming website designed specifically to allow people who play computer games to share the footage of them playing. It's the electronically intercommunicated version of the crowds that stand around an arcade cabinet as it racks up a new high score. Due to the nature of the content, the demographic of users for this platform is mostly men, specifically sexually frustrated men. Because of this there is a high demand for female streamers, who appeal to both the viewers interests and sexuality. Female streamers tend to gather much larger fan bases much more easily, simply due to sparsity. I'm not here to discuss the ethics behind taking advantage of sexually frustrated men, or endorsing women into selling their bodies...
I'm here because there's obviously some cash to be made with this.
Now, this is by no way the most practical approach to exploiting this market, but it is the most /cyb/, and that's all that really matters in life. I got bored, and well, bad things happen when I'm bored... So I hacked together a quick system to generate fake Twitch cam streamers, using tons of sample data, some copy-pasta from github, and a little bit of something special (the tears of a sad programmer at 4 AM).
tl;dr: Have you read Idoru? I want to do Idoru.
Gathering The Sample Data:
The whole thing starts with the sample data. Twitch streamers are the perfect target, as they give out tons of sample data for us each day. Even better, these streamers normally keep their webcam footage as a small "facecam" window in the bottom corner of the screen. Because these videos are a smaller size, it means we can get away with a soykafyer resolution, and have to train fewer neurons for our network. Streamers also normally sit around almost completely static, behind an unchanging background.
I know, it screams GAN
This data format gives us a massive advantage to the fundamental bottlenecks of deep neural networks, sample data (constant streams), and computational force (low resolution).
Creating Key Frames:
Now, I've already mentioned that I intend to use a GAN to generate the results, and I understand that this might put a bad taste in your mouth, as GANs typically produce images, and most people claiming to have AI that make video, just place randomly generated images next to each other, sometimes if you're lucky they'll shove two frames in the same feature vector, but it's still just a measily 2 frames, and lord knows I don't have the funding to crunch more than that at once. This is not my intent at all, I want a genuine fluent video- not some jumpy slide show. "But Jarlold-sama, how will you achieve this with reasonable computing power?", well how does a turtle eat a duck? One piece at a time!.
But for now, we start with the key frames, the individual images that would make up the soykaf slideshow we normally get. And to do this, we're going to use a Generative Adversarial Network, or a GAN. I'm sure this is well and redundant for most of you, so I'm going to split the paragraph, and if you know what a GAN is already, just skip the next block alright? Good.
Yes I'm aware my writing is trash and I should have payed attention in my high school English classes instead of memorizing Neuromancer, but this is an AI blog- not a time travel blog. Nothing I can do about it now.
Generative Adversarial Networks:
Eeeeh, you use so many big words the little ones in between don't make sense anymore!
Don't worry, it's not as complicated as it sounds. The whole system starts with two deep neural networks, and if you don't know what those are... Leave. Now. Yes, you. Parking lot, car, goodbye!
Right so two deep neural networks. The first one is trained over the repository of images in order to solve a simple classification problem of image-of-thing-we-want-to-generate or not-an-image-of-the-thing-we-want-to-generate. So say we wanted to generate dogs, then we'd train this network to identify pictures of dogs. Just your standard categorical convolutional shenanigans.
The second is the generative network, for this we're just going to need a few layers, (densely connected in our case, but depending on what you want to do there are reasons and ways to use another convolutional network). The network will get a bunch of random data (from
np.random.normal) and try to create
an image from it.
These two neural networks are typically referred to as the "discriminator" and the "generator"
respectively. The discriminator is always trying to come up with ways to tell which images the generator
makes, and the generator is constantly trying to fool the discriminator, thus these two networks are in
constant competition- which has the side effect of generating our images. Making it a generative adversarial
network. Roll credits.
That's All Swell But Does It Actually Fuuarking Work You Dunce?
Right! The most important part of any project, the actually-doing-it stage. So we could start by flinging together a GAN system from scratch in tensorflow, but I'd rather make a proof-of-concept before we go too far. So for efficiency sake, let's grab some random guff's code from GitHub and modify it to fit out purpose. So kudos to this lad, who's code I right lifted. I do intend to make a better suited GAN from scratch of course. But for now all I really did was change the shape of the two networks to match my training data, and dropped in some more hidden layers (Since this use case is a fair bit more complicated). This way we can get a quick taste for what sort of results, if any, we can look forward too.
Alright, here's the part that everyone's been looking forwards too!
(The second gif is the results of using the same process but more frames from the original video)
Well, they aren't the worst results right? I mean they're visibly a twitch streamer girl (at times), you can even see headphones! But there are totally problems too, for one, I need a more diverse dataset. Originally I only downloaded 2 streams, because I have slow winter tundra internet and wanted to start training as soon as possible. However, because of this you can see the images are almost always identical, and bear a striking resemblance to their predecessors. Hell, at one point one of the streamers started to play around with some Christmas lights, and even that managed to make it's way into the results.
I don't know about you, but to me this is super lame.
The other issue, which I may be misunderstanding, are the weird star-like patterns. If I had to guess, this means our convolutions are too small, and the max-pooling is too small. Just a hunch though.
Round 2: Fight!
Alright, if at first you don't succeed, wait 4 months then try again!
Solving The Over Fitting:
To go about solving our over fitting we have a few options to explore:
+ Adding more diverse sample data
+ Adding more sample data
+ Decreasing training time
+ Adding one or more Dropout layers
So obviously two streams really wasn't a very diverse data set, so I should stop being lazy and download more types of sample data. Then of course more sample data (as in more frames) could be a help to weave out irregularities in the data, such as the Christmas lights from before. Decreasing training time isn't an options since the network is far from functional. And adding a Dropout layer has some complexities to it, which are in the next paragraph because I'm not an English major.
If you don't know, a Dropout layer (in Keras anyways) is just a layer that randomly turns of
n% of the neurons between two layers. This makes it pretty tricky for the neural network to
simply memorize a bunch of frames, and instead forces it to actually solve the fuuarking problem.
Which is pretty handy. I'm not sure how well this will work with image generation though, see normally
dropout layers get used in classification (and sometimes regression) problems, because there network
shouldn't be memorizing anything. But here, in our generative network, we're expecting some amount
of memorizing to happen, right? If our only inputs are noise, then the network has to memorize artifacts
that are in all the pictures (like eyes, or headphones?). I could be wrong of course, and it could be that
the network has extrapolated the rules that cause these artifacts (which I really hope is the case because
that'd be rad as fuuark). So hey, only way to know is to test right? I'm going to add a couple
low-percentage dropout layers, and see if that works.
S C I E N C E !
Solving The Star Pattern Things:
So maybe I'm wrong and these patterns are just a spray-and-pray way that the generative network is using to avoid stitching together smaller output pieces, or maybe Twitch streamers are just way more sparkly than I thought. But for sake of soykaf and giggles, I'm going to try increasing the size of the convolution and the pooling. I mean it would make sense, this network was first designed for lower resolution images, and now it's running higher ones- so it should need larger convolutions right?
For sake of clarity, when I say "larger convolutions" I mean "convolutions that take larger chunks of context around each pixel".
Okay It's Cooking:
So I wrote and uploaded this while I was waiting for the network to train (which might take multiple days on my out-of-date hardware), so if you're seeing this: I appreciate your frequent checking of my website, but also come back in a few days. Or E-mail me so we can play Minecraft while I wait, my contact info's in the about page!
Not-Okay It's Not-Cooking:
Obviously it's time to do away with the copy-pasted code we used for the proof of concept and write a network properly from scratch. I threw one together in Keras and started iterating over our new more-diverse dataset, then went to fiddle around with different network architectures. And that's when I had a couple thoughts about the project that made me rip it all to pieces (and build it back much better!)
Firstly: Why am I diversified the data set? I don't want to regenerate entirely random Twitch streamers each frame, I want to generate the same Twitch streamer in different posses, so I un-diversified the dataset before continuing the training.
Secondly, I figured out how to get rid of those star patterns, they're caused by a mix between lack of neurons and average pooling (I think I wrote max pooling before that's wrong I used that in the discriminator model, but that's what I get for hacking a network together instead of making my own ) anyway, what I assume is happening is that the model is unsure of the resolution of the image due to a lack of neurons, so it uses the pooling to create a "low resolution" version, which lacks details but at least covers the general shape of the image. This means that our law of diminishing returns is on steroids, because the network can't really optimize neurons when they're all in use. While experimenting I figured out the nicest model I got was from using 3 densely connected layers in the generator (and some heavy dropout layers to hopefully lower the chances of over fitting, but really a little overfitting is fine, as long as the output image can be controlled the normal distribution we feed it, which might require data from a more animated streamer).
Anyway, I re-coded it from a blank document, and now it's cooking for real with some nice results.
GAN vs Autoencoders:
So I re-made the GAN and ran it for fun, and the results are pretty neato- but I've come to a decision making point. At first I was making a GAN network, with the plan to generate a new Twitch streamer entirely. The general flow of the completed network would be something like this:
[ GAN ] --> [ Recurrent-Convolutional to make it a video] --> Twitch stream
The pros of this is we can generate a brand new streamer, who doesn't look like any other. But the negatives are a lack of control, we can't really choose what position, facial expression, or reaction the streamer would show.
The other idea I had was to make an autoencoder network to feed into the recurrent one, something like this:
[ Position Specifying Data] --> [ Auto-Encoder ] --> [ Recurrent-Convolutional ] --> Twitch stream
But this presents us with a new problem, if we train the auto-encoder on too specific of a data set, then we're going to end up with a streamer from our database.
We could vary the data so the key components of the autencoder can control certain appearance, but that brings a lot of entropy and can make our network less stable (for example, check the difference in consistent results from here versus here) that design would look something like this:
[ Position And Appearance Specifying Data] --> [ Auto-Encoder ] --> [ Recurrent-Convolutional ] --> Twitch stream
I'm not entirely certain which network design to go for. I feel like having non-unique streamers is a good way to get a lawsuit, and not having control over the network prevents it from reaching it's full utility. Ultimately, the design that makes the most sense to me would be the last, an autoencoder that builds both position and appearance, but that will probably produce results that need light manual grooming before they can be passed into the recurrent network.
Who's Ready To Make Some Science?:
Of course, I am a data-scientist, and I'm not going to make a decision without running tests and analyzing the results :P So below you can see the results of the first steps to the three proposed designs. Note none of them are trained all the way through, because that would take a lot of time, ~2/3rds of which would be wasted.
GAN (Dense & Undertrained):
Specific Auto-Encoder (Color):
I haven't done this one, I'm reading a
paper on regularization first.
Other GAN Related Things To Read:
Generative Adversarial Networks paper from Researchers at Cornell University
A Bunch of GAN in Tensorflow Examples