And just like that out of the blue Google drops its latest AI tool Lumiere Lumiere is at its core a text to video AI model you type in text and the AI neural Nets translate that into video but as you’ll see Lumiere is a lot more than just text to
Video it allows you to animate existing images creating video and the style of that image or painting as well as things like video in painting and creating specific animation sections within images so let’s look at what it can do the science behind it Google published a paper talking about what they improved
And I’ll also show you why the artificial AI brains that generate these videos are much more weird than you can imagine so this is lumere from Google research A Spacetime diffusion model for realistic video generation we’ll cover SpaceTime diffusion model a bit later but right now now this is what they’re
Unveiling so first of all there’s text to video this is the video that are produced by various prompts like US flag waving on massive Sunrise clouds funny cute pug dog feeling good listening to music with big headphones and Swinging head Etc snowboarding Jack Russell Terrier so I got to say these are
Looking pretty good if these are good representations of the sort of style that we can get from this model this would be very interesting so for example take a look at this one astronaut on the planet Mars making a detour around his base this is looking very consistent
This looks like a tablet this looks like a medicine tablet of some sort floating in space but I got to say everything is looking very consistent which is what they’re promising in their research it looks like they found a way to create a more consistent shot across different
Frames temporal consistency as they call it here’s image to video so as you can see that this is nightmarish but that’s that’s the scary looking one but other than that everything else is looking really good so they’re taking IM images and turning them into animations little
Animations of a bear walking in New York for example Bigfoot walking through the woods so these were started with an image that then gets animated these are looking pretty good here are the Pillars of Creation animated right there that’s uh pretty neat kind of a 3D structure they’re showing styliz generation so
Using a Target image to kind of make something colorful or animated take a look at this elephant right here one thing that jumps out at me is it is very consistent there’s no weirdness going on in a second we’ll take a look at other leading AI models that generate video
And I got to say this one is probably the smoothest looking one here’s another one so as you can see here here’s the style reference image so they want this style and then they say a bear twirling with delight for example right so then it creates a bear twirling with delight
Or a dolin leaping out of the water in the style of this image here’s the same or similar prompts with this as the style reference now this as a the style reference I got to say it captures the style pretty well here’s kind of that neon phosphorus glowing thing and they
Introduce A Spacetime unit architecture and we’ll look at that towards the end of the video but basically it sounds like it creates sort of the idea of the entire video at once so while other models it seems like kind of go frame by frame this one has sort of an idea of
What the whole thing is going to look like at the very beginning and there’s a video stylization so here’s a lady running this is the source video and the various craziness that you can make her into the same thing with a dog and a car and a bear cinemagraphs is the ability
To animate only certain portions of the image like the smoke coming out of this train this is something that Runway ml I believe recently released and looks like Google is hot on their heels creating basically the same ability then we have video and painting So if a portion of an
Image is missing you’re able to use AI to sort of guess at what that would look like I got to say so here where the hand comes in that is very interesting cuz that seems kind of advanced cuz notice in the beginning he throws the Green
Leaf in the missing portion of the image and then you see him coming back to the image that we can see throwing a green leaf or two so it makes the assumption that hey the things there will also be green leaves interestingly enough though
I do feel like I can spot a mistake here the leaves that are already on there are fresh looking as opposed to the cooked ones like they are on this side so it knows to put in the green leaves as the guy is throwing them for them to be
Fresh because it matches the fresh leaves here but it misses the point that hey these are cooked leaves and these are fresh but still it’s very impressive that it’s able to sort of to sort of guess at what’s happening in that moment and this is where if you’ve been
Following some of the latest AI research this is where these neural Nets get a little bit weird well again come back to that at the end but how they are able to predict certain things like what happens here for example like no one codes it to
Know that this is probably a cake of some sort nobody tells it what this thing is it guesses from clues that it sees on screen but how does that is really really weird let’s just say that this is pretty impressive so here we’re able to change the clothes that the
Person is wearing throughout these shots while you know notice the hat and the face they kind of remain consistent across all the shots whereas the dress is changed based on a text prompt as you watch this think about where video production for movies and serial TV
Shows Etc where that’s going to be in 5 to 10 years will something like this allow everyday people sitting at home to create stunning Hollywood style movies with whatever characters they want whatever settings they want with’ generated video and AI voices we can create a movie starting Hugh Hefner as a
Chicken for example so really fast this is another study called Beyond surface statistics out of hardw so this has nothing to do with the Google project that we’re looking at but this paper tries to answer the question of how do these models how do they create images
How do they create videos as you can see here it says these models are capable of synthesizing high quality images but it remains a mystery how these networks transform let’s say the phrase car in the street into a picture of a car in a street so in other words when we type in
This when a human person says draw a picture of a car in a street or a video of a car in a street how does that thing do it how does it translate that into a picture do they simply memorize superficial correlations between pixel values and words or are they learning
Something deeper such as the underlying model of objects such as cars roads and how they are typically positioned and there’s a bit of a argument going on in the scientific Community about this so some AI scientists say all it is is just sort of surface level statistics they’re just memorizing where these little
Pixels go and they’re able to kind of reproduce certain images Etc and some people say well no there’s something deeper going on here something new and surprising that these AI models are doing so what they did is they created a model that was fed nothing but 2D images
So images of cars and people and ships Etc but that model it wasn’t taught anything about depth like depth of field like where the foreground of an image is or where the background of an image is it wasn’t taught about what the focus of
The image is what a car is ETC and what they found is so here’s kind of like the decoded image so this is kind of how it makes it from step one to finally step 15 where as you can see you can see this
Is a car so a human being would be able to point at this and say that’s a car what in the image is closest to you the person taking the image you say well probably this wheel is the closest right this is the the kind of the foreground
This is the main object and that’s kind of the background that’s far far away and this is close right but the reason that you are able to look at this image and know that is because you’ve seen these objects in the real world in the 3D world you can probably imagine how
This image would look if you’re standing off the side here looking at it from this direction this AI model that made this has no idea about any of that all it’s seeing is a bunch of these 2D images just pixels arranged in a screen and yet when we dive into try to
Understand how it’s building these images from scratch this is what we start to notice so early on when it’s building this image this is kind of what the the depth of the image looks like so very early on it knows that sort of this thing is in the foreground it’s closer
To us and this right here the blue that’s the background it’s far from us now looking at this image you can’t possibly tell what this is going to be you can’t tell what this is going to be till much much later maybe here we can kind of begin to start seeing some of
The lines that are in here but that’s about it you you see like the wheels and maybe you could guess of what that is but here in the beginning you have no idea and yet the model knows that something right here is in the foreground something’s in the background
And towards the end it knows that this is closer this is close and this is far this is Salient object meaning like what is the focus what is the main object so it knows that the main object is here it doesn’t know what a car is it doesn’t
Know what an object is it just knows like this is the the focus of the image again only towards much later do we realize that yes in fact this is the car and so this is the conclusion of the paper our experiments provide evidence that stable diffusion model so this is
An image generating model AI although solely trained on two-dimensional images contain an internal linear representation related to scene geometry so in other words after seeing thousands or millions of 2D images inside its neural network inside of its brain it seems like and again this is a lot of
People sort of dispute this but some of these research makes it seem like it’s developing its neural net that allows it to create a 3D representation of that image even though it’s never been taught what 3D means it uncovers a salent object or sort of that main Center
Object that it needs to focus on versus the background of the image as well as information related to relative depth and these representations emerge early so before it starts painting the colors or the little shapes or the the wheels and the Shadows it first starts thinking
About the 3D space on which it’s going to start painting that image and here they say these results add a Nuance to the ongoing debates and there are a lot of ongoing debates about this about whether generative models so these AI models can learn more than just surface
Statistics in other words is there some sort of understanding that’s going on maybe not like human understanding but is it just statistics or is there something deeper that’s happening and this is Runway ml so this is the other one of the leading sort of text 2 image
AI models and you might have seen the images so as you can see here this is what they’re offering people have made full movies maybe not hour long but maybe 10 minutes 20 minute movies that are entirely generated by AI so as you can see here it’s it’s similar to what
Google is offering although I got to say after looking at Google’s work and then this one Google’s does seem just a little bit more consistent I would say there seems to be a little bit less shifting and and shapes going on it’s just a little bit more consistent across
Time time and they have a lot of the same thing like this stylization here from a reference video to this image that’s like the style reference but the interesting thing here is this is in the last few months looks like December 2023 Runway nml introduced something they call General World models and they’re
Saying we believe the next major advancement in AI will come from systems that understand the visual world and its Dynamics they’re starting a long-term research effort around what they call General World models so their whole idea is that instead of the video AI models creating little Clips here and there
With little isolated subjects and movements that a better approach would be to actually use the neural networks and them building some sort of a world model to understand the images they’re making and to actually utilize that to have it almost create like a little world so for example if you’re creating
A clip with multiple characters talking then the AI model would actually almost simulate that entire world with the with the rooms and the people and then the people would talk talk to each other and it would just take that clip but it would basically create much more than
Just a clip like if a bird is flying across the sky it would be simulating the wind and the physics and all that stuff to try to capture the movement of that bird to create realistic images and video so they’re saying a world model is an AI system that builds an internal
Representation of an environment and it uses it to simulate future events within that environment so for example for Gen 2 which is their model their video model to generate realistic short video it has developed some understanding of physics and motion card still very limited struggling of complex camera controls or
Object motions amongst other things but they believe and a lot of other researchers as well that this is sort of the next step for us to get better at creating video at teaching robots how to behave in the physical world like for example the nvidia’s foundation agent
Then we need to create bigger models that simulate entire worlds and then from those worlds they pull out what we need whether that’s an image or text or a robot’s ability to open doors and pick up objects all right but now back to Lumiere A Spacetime diffusion model for
Video generation so here they have a number of examples for the text to video of image to video stylized generation Etc and so in lumier they’re trying to build this text video diffusion model that can create videos that portray realistic diverse and coherent motion a pivotal challenge in video synthesis and
So the new thing that they introduces the SpaceTime unet archit tecture that generates entire temporal duration of the video at once so in other words it sort of thinks through how the entire video going to look like in the beginning as opposed to existing video models so other video models which
Synthesize distant key frames followed by temporal super solution basically meaning they do it one at a time so they start with one and then create the others and they’re saying that makes Global temporal consistency difficult meaning that the object as as you watch a video of it right it looks a certain
Way on the first second of the video but by second five is just completely different and so here basically they’re comparing these two videos so imagin and rs so The Lumiere model as you can see here here sample a few clips and they’re looking at the XT slice so the XT slice
You can basically think of that as so for example in stocks you have you know the price of stock over time right so it kind of goes like this here the x is the spatial Dimension so where certain things are in space on the image versus
T temporal the time so the X here is basically where we might be looking at the width of the image for example of any image in time and T the temporal is like how consistent is across time so as you can see hit this green line so we’re
Just looking at this thing across the entire image and this is what that looks like so as you can see here this is going pretty well and then it kind of messes up and it kind of gets crazy here and then kind of goes back to doing okay
Whereas in Lumiere it’s pretty pretty good I mean maybe some funkiness right there in one one frame but it’s pretty good same thing here I mean this is as you can see here pretty good maybe you can say that there’s a little bit of funkiness here but overall it’s very
Good whereas in this image and video I mean as you can see here there’s kind of like a lot of nonsense that’s happening right and so here you can see like you can’t tell how many legs it has if it’s missing a leg Etc whereas in The Lumiere
I mean I feel like the you know you can see each of the legs pretty distinctly and their position and it’s remains consistent across time or at least consistently easy to see where they are but I got to say I can’t wait to get my
Hands on it it looks like as of right now I don’t see a way to access it this is just sort of a preview but hopefully they will open up for testing soon and we’ll be able to get our hands on it and check it out and here interestingly
Enough they actually compare how well their performs against the other state-of-the-art models in the in the industry so the two that I’m familiar with is Pika and genan 2 those are the two that I’ve used and they’re saying that their video video is preferred by
Users in both text to video and image to video generation so blue is theirs and the Baseline is the orange one so it seems like there are pretty big differences in every single one this seems like video quality I mean it beats out every single other one of these
Which which I believe this text alignment which here means probably how well the image how true it is to The Prompt right so if you type in a prompt how accurately it represents it so it looks like maybe image is the closest one but it beats out most of the other
Ones by quite a bit and then video quality of image to video it seems like it beats them out as well with Gen 2 probably being the next best one and here they provide a side-by-side comparison so for example the first prompt is a sheep to the right of a wine
Glass so this is Pika which which not great CU there’s no wine glass here’s Gen 2 consistently putting it on the left anime diff which just has two glasses and maybe a reflection of a sheep image and video same thing so the glasses on the left zero scope no
Glasses that I can see although they have sheep and of course R so the Lumi the Google one is it seems like a nail it in every single one the glass is on the right although I got to say Gen 2 is is great although it confused the left
And right but other than that I mean same if image and video actually although I feel like Gen 2 the quality is much better of the sheep cuz that’s you know that’s a good-looking sheep I should probably rephrase that that’s a well rendered sheep how about that
Versus imagin I mean that’s a weird looking thing there that could almost be a horse or a cow if you just look at the face and Google is again excellent here’s teddy bear skating in Time Square this is Google this is imag again weirdness happening there and that’s gen
Two again pretty good but I mean the the thing is facing away although here I just noticed so they they took skating to mean ice skates whereas here it looks like these are roller skates skateboard Etc and so it looks like in the study they just showed you two to things they
Say do you like the left or the right more based on motion and better quality well I got to say if you’re an aspiring AI cinematographer then this is really good news consistent coherent images that are able to create near lifelike scenes at this point I mean I’m sure
There’s other people that’ll complain about stuff but you got to realize how quickly the stuff is progressing just to give you an idea this is about a year ago or so this is what a I generated video looked like so can you tell that is improved just a little bit that’s
About a year I’m not sure exactly when this was done but I’m going to say a year year and a half ago and I mean this thing gets nightmarish so when I’m talking about weird blocky shapes things not being consistent across scenes like what are we even looking at
Here is this a mouth is this a building and here’s kind of uh something from about 4 months ago from Pika Labs so as you can see here it’s much better it’s much more consistent right as you can see here humans again maybe they look a
Little bit weird but it’s better it can put you in the moment if you’re telling a story that’s not necessarily about everything looking realistic something like this can be created pretty easily and since it’s new it’s novel people might be this might be a whole new movement a new genre of film making
That’s new exciting and never before seen and most importantly it’s easy to create with a you know at home with a few AI tools and anybody out there with creative abilities with creative talent to tell the stories that they have in their mind without being limited financially by Capital they’re going to
Be able to create AI voices they’re going to be able to create AI footage maybe even have Chad GPT help them with some of the story writing and once more the sort of the next generation of things that we’re seeing that people are working on is things like the similation
Where you create the characters and then you sort of let them loose in a world they get simulated with these they get sort of simulated so the stories kind of play out in the world and then you sort of pick and choose what to focus on which scenes and which characters you
Want to bring to the front so you basically act as the World Builder you build the worlds the characters the narratives and AI assists you in creating the visuals the voices Etc and you can be 100% in control of it or you can only control the things that you
Want and the AI generates the rest so to me this if you’re interested in movie making and you like these sort of styles that by the way quickly will become much more realistic I would be really looking at this right now because right now is the time that it’s sort of emerging into
The world and getting really good and it’s going to get better by next year it’s going to be a lot better well my name is Wes rth and uh thank you for watching
source