_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   The state of AI for hand-drawn animation inbetweening
       
       
        brcmthrowaway wrote 1 day ago:
        If an AI could ever capture the charm of the original hand drawn
        animation, then it's over for us
       
          yosefk wrote 1 day ago:
          If an AI can't make animators more productive, it's as close to over
          for hand drawn animation as it has been for the last decade
       
        meindnoch wrote 1 day ago:
        Convert frames to SVG, pass SVG as text to ChatGPT, then ask for the
        SVG of in-between frames. Simple as.
       
        Solvency wrote 1 day ago:
        i don't get it. we essentially figured this out in the first Matrix by
        having a bunch of cameras film an actor and then used interpolation to
        create a 360 shot from it.
        
        why can't this basic idea be applied to simple 2d animation over two
        decades later?
       
          criddell wrote 1 day ago:
          What was interpolated in the Matrix? I was under the impression they
          were creating 1 second of 24 fps video by combining images shot on 24
          individual cameras.
       
        bschmidt1 wrote 1 day ago:
        Really cool read, I liked seeing all the examples.
        
        I wonder if it would be beneficial to train on lots of static views of
        the character too - not just the frames - so that permanent features
        like the face gets learned as a chunk of adjacent pixels, so when you
        go to make a walking animation, the relatively low amount of training
        data on moving legs in comparison to the high repetition of faces would
        cause only the legs to blur unpredictably, where the faces would be
        more in tact - the overall result might be a clearer looking animation.
       
          yosefk wrote 1 day ago:
          Almost certainly a good idea. I'm about to start trying things in
          this direction
       
        TiredGuy wrote 1 day ago:
        The state of the art of 3d pose estimation and pose transfer from video
        seems to be pretty accurate. I wonder if another approach might be to
        infer a wireframe model for the character, then tween that instead of
        the character itself. It would be like the vector approach described in
        the article but with much, much fewer vertices, then once you have the
        tween, use something similar to pose transfer to map the most recent
        character's frame depiction to the pose.
        
        Training on a wireframe model seems like it would be easier, since
        there are plenty of wireframe animations out there (at least for
        humans) you could use and remove in-between frames to try inferring
        them.
       
        wantsanagent wrote 1 day ago:
        Frankly I'm surprised this isn't much higher quality. The hard thing in
        the transformers era of ML is getting enough data that fits into the
        next token or masked language modeling paradigm, however in this case,
        inbetweening is exactly that task and every hand-drawn animation in
        history is potential training data.
        
        I'm not surprised that using off the shelf diffusion models or
        multi-modal transformer models trained primarily on still images would
        lead to this level of quality, but I am surprised if these results are
        from models trained specifically for this task on large amounts of
        animation data.
       
          yosefk wrote 1 day ago:
          They're indeed not diffusion models, though they are trained on
          animation data as well as specifically designed for it (the raster
          papers at least.) I'm very hopeful wrt diffusion, though I'm looking
          at it and it's far from straightforward.
          
          One problem with diffusion and video is that diffusion training is
          data hungry and video data is big. A lot of approaches you see have
          some way to tackle this at their core.
          
          But also, AI today is like 80s PCs in some sense: both clearly the
          way of the future and clumsy/goofy, especially when juxtaposed with
          the triumphalism you tend to hear all around
       
        atseajournal wrote 1 day ago:
        Animation has to be the most intriguing hobby I'm never planning on
        engaging with, so this kind of blog post is great for me.
        
        I know hand-drawn 2D is its own beast, but what's your thought on using
        3D datasets for handling the occlusion problem? There's so much
        motion-capture data out there -- obviously almost none of it has the
        punchiness and appeal of hand-drawn 2D, but feels like there could be
        something there. I haven't done any temporally-consistent image gen,
        just playing around with StableDiffusion for stills, but the
        ControlNets that make use of OpenPose are decent.
        
        3D is on my mind here because the Spiderverse movies seemed like the
        first demonstration of how to really blend the two styles. I know they
        did some bespoke ML to help their animators out by adding those little
        crease-lines to a face as someone smiles... pretty sure they were
        generating 3d splines however, not raster data.
        
        Anyway, I'm saving the RSS feed, hope to hear more about this in the
        future!
       
          wongarsu wrote 1 day ago:
          Maybe there is also value in 2d datasets that aren't hand drawn. A
          lot of TV shows are made in Toon Boom or Adobe Animate (formerly
          Macromedia Flash). Those also do automatic inbetweening, but with a
          process that's closer to CSS animations: everything you want to move
          independently if it's own vector that can be moved, rotated and
          squished, and the software just interpolates the frames in between
          with your desired easing algorithm. That's a lot of data that's
          available on those original project files that's nontrivial to infer
          from the final product
       
            yosefk wrote 1 day ago:
            I doubt you can learn much out of tweened flat cutouts beyond
            fitting polynomials to data points. The difficulty with full
            animation is rotation & deformation you can't do at all with
            cutouts. (Puppet warp/DUIK cutouts are much less rigid than Flash
            but the above still applies)
       
          yosefk wrote 1 day ago:
          The 2nd paper actually uses a 3D dataset, though it explicitly
          doesn't attempt to handle occlusion beyond detecting it.
          
          I sort of hope you can handle occlusion based on learning 2D training
          data similarly to the video interpolation paper cited at the end. If
          3D is necessary, it's Not Good for 2D animation...
          
          AI for 3D animation is big in its own right; these puppets have 1
          billion controllers and are not easy for humans to animate. I didn't
          look into it deeply because I like 2D more. (I learned 3D modeling
          and animation a bit, just to learn that I don't really like it...)
       
        empath-nirvana wrote 1 day ago:
        The results are actually shockingly bad, considering that I think this
        should be _easier_ than producing a realistic image from scratch, which
        ai does quite well.
        
        I don't have more than a fuzzy idea of how to implement this, but it
        seems to me that key frames _should_ be interchangeable with in between
        frames, so you want to train it so that if you start with key frames
        and generate in-between frames, and then run the in-between frames
        through the ai, it should regenerate the keyframes.
       
          the8472 wrote 1 day ago:
          Animation is much lower framerate than live video, motion can be
          extremely exaggerated and the underlying shape can depend on the
          view, i.e. be non-euclidean.
          Additionally there are fewer high-frequency features (think leopard
          spots) that can be cues about how the global shape moves (leopard
          outline).
          And of course things are drawn by humans, not captured by cameras,
          which means animation errors will be pervasive throughout the
          training data.
          
          These things combined mean less information to learn a more difficult
          world model.
       
          gertlex wrote 1 day ago:
          I only scrolled through the article, reading snippets and looking at
          pictures, but the pictures of yoga moves were what caught my
          attention of "this is hard".  Specifically, interpolating between a
          leg that's visible and extended, to a leg that is obscured/behind
          other limbs... it will be impressive/magical when the AI correctly
          distinguishes between possibilities like "this thing should
          fade/vanish", and "this thing should fold and move behind/be obscured
          other parts of the image".
       
          6gvONxR4sf7o wrote 1 day ago:
          > I think this should be _easier_ than producing a realistic image
          from scratch
          
          Think of this in terms of constraints. An image from scratch has self
          consistency constraints (this part of the image has to be consistent
          with that part) and it may have semantic constraints (if it has to
          match a prompt). An animation also has the self consistency
          constraints, but also has to be consistent with other entire images!
          The fact that the images are close in some semantic space helps, but
          all the tiny details become so important to get precisely correct in
          a new way.
          
          Like, if a model has some weird gap where it knows how to make an arm
          at 45 degrees and 60 degrees, but not 47, then that's fine for
          from-scratch generation. It'll just make one like it knows how (or
          more precisely, like it models as naturally likely). Same with any
          other weird quirks of what it thinks is good (naturally likely): It
          can just adjust to something that still matches the semantics but
          fits into the model's quirks. No such luck when now you need to get
          details like "47 degrees" correct. It's just a little harder without
          some training or modeling insight into how an arm at 45 degrees and
          47 degrees are really "basically the same" (or just that much more
          data, so that you lose the weird bumps in the likelihood).
          
          I wouldn't be surprised if "just that much more data" ends up being
          the answer, given the volume of video data on the internet, and the
          wide applicability of video generation (and hence intense research in
          the area).
       
          BlueTemplar wrote 1 day ago:
          Same, I would have thought that edge detection would have been among
          the first problems to get solved !
       
          yosefk wrote 1 day ago:
          It's counterintuitive but less so considering that it's way easier
          for a human to draw something from scratch than to inbetween 2 key
          frames as well!
          
          (I guess we're used to machines and people struggling at opposite
          things so this is counter counter intuitive, or something...)
          
          Animation key frames are not interchangeable with inbetween frames
          since the former try to show the most body parts in "extreme"
          positions though it's not always possible for all parts due to so
          called overlapping action. This is not to say you can't generate
          plausible "extremes" from inbetweens; acting wise key frames
          definitely have the most weight.
          
          AI being good at stills is true, though it takes a lot of prompting
          and cherry picking quite often; most results I get out of naively
          prompting the most famous models are outright terrifying.
       
        aredox wrote 1 day ago:
        Now that's a great use for AI! Inbetweening has always been a thankless
        job usually outsources to sweatshop-like studios in Vietnam and - even
        recently - North Korea (Guy Delisle's "Pyongyang" comic is about his
        experience as a studio liaison there).
        
        And AI has less room to hallucinate - it is more a kind of
        interpolation - even if in this short curt example, the AI still
        "morphs" instead of cleanly transitioning.
        
        The real animation work and talent is in keyframes, not in the
        inbetweening.
       
          commieneko wrote 1 day ago:
          As an animator for 40 plus years, I can tell you that in-betweening
          is a very difficult job. The fact that it's often cheaply outsourced
          is more of a factor that the people paying for the animation simply
          don't care about the quality. The results are seldom good.
          
          As to how much poor quality in-betweening hurts the performance to
          the audience is a complicated discussion. Animation that is _very_
          bad can often be well accepted if other factors compensate (voice
          acting, design, direction, etc.)
          
          A good in-betweener is not simply interpolating between the keys. For
          hand drawn animation at least, there's a lot more going on than that.
          
          We'll leave out any discussion of breakdowns here. For one it's a
          difficult concept, much more difficult than 'tweening to explain. The
          other is that different animators will give different opinions on
          what a breakdown is or does.
          
          I will say, though, I think that properly tagged breakdown drawings
          could significantly improve the performance of ai generated
          in-betweens.
          
          Anyone who is seriously interested in the process should read the
          late, great Richard William's book, _The Animator's Survival Kit_.
          This is especially true for those who want to "augment" the process
          with machine learning. The book is very readable, even for
          non-artists. And he gets into the nitty gritty of what makes a good
          performance, and the mechanics behind it.
          
          Edit: Another good resource, and relevant to 3D animation as well, is
          Raf Anzovin's _Just To Do Something Bad_ blog. He has many posts on
          what he calls "ephemeral rigging" that are absolutely fascinating. Be
          aware that the information is diffused through out the blog and not
          presented in a form for teaching. His opinions are fairly
          controversial in the field. But I think he is onto something. ( [1] )
          
   URI    [1]: https://www.justtodosomethingbad.com/
       
            yosefk wrote 1 day ago:
            Post author here - would be very interesting to hear more of your
            thoughts on this! It's not easy to find a pro animator willing to
            consider the question given the current level reached by AI methods
       
              commieneko wrote 1 day ago:
              I would suggest reading the Williams book as a place to start.
              Thomas and Johnson's _The Illusion of Life_ is also a must.
              Thomas and Johnson were two of Disney's 9 old men. The Preston
              Blair book, simply called _Animation_ is good.
              
              The thing about animation is that it is not about interpolation.
              It's about the spacing between drawings. The methods developed by
              animators were not at all mathematical, but something that they
              felt out by experimentation (trial and error).
              
              The math that does enter into it are directly related to the
              frame rates. If animation had started in modern times, with frame
              rates of 30 fps or 60 fps, it would have been a very different
              animal. And much harder!
              
              At 12 fpt or 24 fps you have a very limited range of "eases" that
              can be done. So while eases do figure into it, its the arcs, the
              articulation, and the perceived mass of the parts of the
              character that make it seem alive. Looking only at the contours
              and the in-betweens misses all the action.
              
              An awareness of the graphic nature of the drawings, the
              stylizations of figures and faces are also critical. Cartooning
              is its own artform and it is tied directly to the way human
              brains make sense of what the eye sees. Getting more realistic
              often takes you further from your destination.
              
              Storytelling is also a core part of good animation. Making a
              character seem to think and react, like it is alive can be done
              by a good animator. But you won't get there by imitating the real
              world directly. Rotoscoping has very limited use in good
              character animation and storytelling. It's all about abstracting
              out what the brain feels is important and what it expects. You
              can get away with murder if you caricature the right details.
              
              When I've worked with training new animators, one of the points I
              stress is that it is articulation and the perceived mass of the
              character that really sells a performance. The best art style in
              the world is nearly useless if the viewer doesn't buy into the
              notion that they are watching a thinking person reacting with a
              physical body to events in an interactive world.
              
              My feeling is that you will get further if you build articualated
              rigs and teach the ai to make it move. 2D or 3D. There is footage
              of tiny AI driven robots in a Google eperiement that are learning
              to play soccor. The ai is learning to make them move and solve
              problems (running around the soccar field and scoring goals.)
              Very natural looking behavior (animation!) develops almost
              automatically from that.
              
              Trying to solve the problem by dealing with lines, contours, and
              interpolation seems very far away from the important parts of
              animtion.
              
              Just my two cents worth.
              
              Get a copy of the Williams book, it's on Amazon. Read his
              thoughts, he explains things much better, and more
              entertainingly, than I do. Sharpen up your pencil and start
              making some simple walks. Simple stick figures and tube people
              work just fine. And you may find that you enjoy the art form.
              Even if you don't become an animator yourself, the exercises will
              deepen your appreciation and understanding of the art form.
       
          yosefk wrote 1 day ago:
          Actually inbetweening is really hard (and I think requires talent)
          and used to be a big way to learn enough to become a key animator.
          And I would worry about AI eliminating this learning route if
          classical animation wasn't struggling to survive at all
       
            aredox wrote 1 day ago:
            It's hard like translating novels - you have to match someone
            else's style, which is why it's thankless.
            
            I don't know if that's really a good pathway to become a key
            animator - how many inbetweeners are there for one key animator?
       
        chris_st wrote 1 day ago:
        Not sure who would fund this research? Perhaps the Procreate Dreams
        folks [0]. I'm sure they'd love to have a functional auto-tween
        feature.
        
        0:
        
   URI  [1]: https://procreate.com/dreams
       
        nicklecompte wrote 1 day ago:
        Great read - I learned quite a bit about this. The only quibble I had
        is at the end, and it's a very general layperson's view:
        
        > But what’s even more impressive - extremely impressive - is that
        the system decided that the body would go up before going back down
        between these two poses! (Which is why it’s having trouble with the
        right arm in the first place! A feature matching system wouldn’t have
        this problem, because it wouldn’t realize that in the middle
        position, the body would go up, and the right arm would have to be
        somewhere. Struggling with things not visible in either input keyframe
        is a good problem to have - it’s evidence of knowing these things
        exist, which demonstrates quite the capabilities!).... This system
        clearly learned a lot about three-dimensional real-world movement
        behind the 2D images it’s asked to interpolate between.
        
        I think that's an awfully strong conclusion to draw from this paper -
        the authors certainly don't make that claim. The "null hypothesis"
        should be that most generative video models have a ton of yoga
        instruction videos shot very similarly to the example shown, and here
        the AI is simply repeating similar frames from similar videos. Since
        this most likely wouldn't generalize to yoga videos shot at a skew
        angle, it's hard to conclude that the system learned anything about 3D
        real-world movement. Maybe it did! But the study authors didn't come to
        that conclusion, and since their technique is actually
        model-independent, they wouldn't be in a good position to check for
        data contamination / overfitting / etc. The authors seem to think the
        value of their work is that generative video AI is by default
        past->future but you can do future->past without changing the
        underlying model, and use that to smooth out issues in interpolation. I
        just don't think there's any rational basis for generalizing this to
        understanding 3D space itself.
        
        This isn't a criticism of the paper - the work seems clever but the
        paper is not very detailed and they haven't released the code yet. And
        my complaint is only a minor editorial comment on an otherwise
        excellent writeup. But I am wondering if the author might have been
        bedazzled by a few impressive results.
       
          yosefk wrote 1 day ago:
          You're technically correct, there's no basis to argue that a "3D
          representation" was learned as opposed to "enough 2D projections to
          handle the inputs in question." That said, the hand which did not
          exist in either of the original 2D frames makes an appearance. I
          think calling wherever it was pulled out of "the 3rd dimension" is
          not wrong; it was occluded in both 2D inputs and functionally you had
          to know about its existence in the 3rd dimension to show it even if
          you technically did it by learning how pixels look across frames.
          
          You can also see much more 3D ish things in the paper, with 2 angles
          of a room and video created moving the camera between them. Of course
          in some sense it adds to my point without detracting from yours...
       
            nicklecompte wrote 1 day ago:
            My problem is that this behavior can be attained by a shallower and
            non-generalizable understanding. Instead of realizing that the hand
            is blocked, perhaps the system's model is equivalent to the hand
            disappearing and reappearing with a "swipe." This understanding
            would not be obtained by a 3D modeling of human anatomy, but rather
            a hyperfocused study of yoga videos where the camera angle is just
            like the one shown in the paper (it is mostly the cliched camera
            angle that raises my suspicions). An understanding like this would
            not always generalize properly, e.g. instead of a hand being
            partially occluded in a skew video it visibly pops in and out, or
            the left hand strangely blends into the right.
            
            There's a general issue with generative AI drawing "a horse riding
            an astronaut" - art generators still struggle to do this because
            they just can't generalize to odd scenarios. I strongly suspect
            this method has a similar issue with "interpolate the frames of
            this yoga video with a moving handheld camera." AFAIK these systems
            are not capable of learning how 3D people move when they do yoga:
            they learn what 2D yoga instructional videos look like, and only
            incidentally pick up detailed (but non-generalizable) facts about
            3D motion.
       
        kleiba wrote 1 day ago:
        A blog post from the same guy that used to maintain the C++ FQA!
        
   URI  [1]: https://yosefk.com/c++fqa/
       
          AlexandrB wrote 1 day ago:
          One of my favourite bits of content written about C++. Highly
          recommended.
       
        xrd wrote 1 day ago:
        This was really fun. It captured a lot of thinking on a topic I've been
        interested in for a while as well.
        
        The discussion about converting to a vector format was an interesting
        diversion. I've been experimenting with using potrace from inkscape to
        migrate raster images into SVG and then use animation libraries inside
        the browser to morph them, and this idea seems like it shares some
        concepts.
        
        One of my favorite films is A Scanner Darkly, and that used a technique
        called rotoscoping which I recall was a combination of hand tracing
        animation and computers then augmenting it, or vice versa. It sounded
        similar. The Wikipedia page talks about the director Richard Linklater
        and also the MIT professor Bob Sabiston who pioneered that derivative
        digital technique. It was fun to read that. [1]
        
   URI  [1]: https://en.m.wikipedia.org/wiki/Rotoscoping
   URI  [2]: https://en.m.wikipedia.org/wiki/Bob_Sabiston
       
          whywhywhywhy wrote 1 day ago:
          > and that used a technique called rotoscoping
          
          Technically it's "interpolated rotoscoping" using a custom tool
          called Rotoshop, which takes vector shapes drawn over footage then
          smoothly animates between the frames giving a distinct dream-like
          look to it.
          
          Rotoscoping is where you work to a traditional animation framerate
          drawing over live action but each frame is a new drawing and doesn't
          have the signature shimmery look Scanner Darkly and Waking Life so I
          think it's worth pointing out the distinction.
          
   URI    [1]: https://en.wikipedia.org/wiki/Rotoshop
       
        shaileshm wrote 1 day ago:
        Great article!
        
        This is one of the most overlooked problems in generative AI. It seems
        so trivial, but in fact, it is quite difficult. The difficulty arises
        because of the non-linearity that is expected in any natural motion.
        
        In fact, the author has highlighted all the possible difficulties of
        this problem in a much better manner.
        
        I started with some simple implementation by trying to move segments
        around the image using some segmentation mask + ROI. That strategy
        didn't work out, probably because of some mathematical bug or data
        insufficiency data. I suspect the later.
        
        The whole idea was to draw a segmentation mask on the target image,
        then draw lines that represent motion and give options to insert
        keyframes for the lines.
        
        Imagine you are drawing a curve from A to A. You divide the curve into
        A, A_1, A_2... B.
        
        Now, given the input of segmentation mask, motion curve, and whole
        image, we train some model to only move the ROI according to the motion
        curve and keyframe.
        
        The problem with this approach is in sampling the keyframe and matching
        consistencies --making sure RoI represents same object-- across
        subsequent keyframes.
        
        If we are able to solve some form of consistency, this method might be
        able to give enough constraints to generate viable results.
       
          yosefk wrote 1 day ago:
          I currently shelved 3K more words of why it's hard if you're
          targeting real animators. One point is that human inbetweeners get
          "spacing charts" showing how much each part should move, even though
          they understand motion very well, because the key animator wants to
          control the acting
       
       
   DIR <- back to front page