Spatio-temporal graph based video representation learning
Current video understanding systems accurately recognize patterns in short video clips.
However, they fail to capture how the present connects to the past or future in a world of never-ending visual stimuli.
Existing video architectures tend to hit computation or memory bottlenecks after processing only a few seconds of video content.
So, how do we enable accurate and efficient long-term visual understanding? An important first step is to have a model that practically
runs on long videos. To that end, we propose a novel video representation method based on Spatio-Temporal Graphs Learning (SPELL) to equip
it with long-term reasoning capability.