🚧 ⛏️ 🛠️ 👷

Long-form video representation learning (Part 5: Modeling visual relationships and actions)

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…

Long-form video representation learning (Part 4: latest and greatest in long-form video reasoning)

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…

Long-form video representation learning (Part 3: Sparse Transformers for video representation)

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…

Long-form video representation learning (Part 2: video as temporal graphs)

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…

Long-form video representation learning (Part 1: video as object-centric spatio-temporal graphs)

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…

Graph and geometric learning in 2024

Click the title for the details.

We interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part I focusing on theory & new architectures, see also Part II on applications….

Open MatSci ML ToolKit 1.0

Click the title for the details.

Intel Labs recently released the Open MatSci ML Toolkit version 1.0 on August 31, making training of advanced AI models on materials data more accessible for materials evaluation and discovery…..

Bundle-Adjusting Accelerated Neural Graphics Primitives

Click the title for the details.

To address the current restrictions on knowing accurate camera poses a-priori as well as lengthy training time, we propose a novel approach called Bundle-Adjusting Accelerated Graphics Primitives (BAA-NGP) that can learn to estimate camera poses and optimize the radiance field simultaneously with 10 to 20 times speedup…

GraVi-T: A software library for long-term video understanding based on spatio-temporal graphs

Click the title for the details.

GraVi-T is an open-sourced toolbox for long-term video understanding based on spatio-temporal graph-based representations.

Temporal Learning of Sparse Video-Text Transformers

Click the title for the details.

We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost.

Spatio-temporal graph based video representation learning

Click the title for the details.

Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…