Blogs
🚧 ⛏️ 🛠️ 👷
Foundation Models in Graph & Geometric Deep Learning
Click the title for the details.
Foundation Models in language, vision, and audio have been among the primary research topics in Machine Learning in 2024 whereas FMs for graph-structured data have somewhat lagged behind. In this post, we argue that the era of Graph FMs has already begun and provide a few examples of how one can use them already today…..
Long-form video representation learning (3-series blog)
Click the title for the details.
Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…
Graph and geometric learning in 2024
Click the title for the details.
We interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part I focusing on theory & new architectures, see also Part II on applications….
Open MatSci ML ToolKit 1.0
Click the title for the details.
Intel Labs recently released the Open MatSci ML Toolkit version 1.0 on August 31, making training of advanced AI models on materials data more accessible for materials evaluation and discovery…..
Bundle-Adjusting Accelerated Neural Graphics Primitives
Click the title for the details.
To address the current restrictions on knowing accurate camera poses a-priori as well as lengthy training time, we propose a novel approach called Bundle-Adjusting Accelerated Graphics Primitives (BAA-NGP) that can learn to estimate camera poses and optimize the radiance field simultaneously with 10 to 20 times speedup…
GraVi-T: A software library for long-term video understanding based on spatio-temporal graphs
Click the title for the details.
GraVi-T is an open-sourced toolbox for long-term video understanding based on spatio-temporal graph-based representations.
Temporal Learning of Sparse Video-Text Transformers
Click the title for the details.
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost.
Spatio-temporal graph based video representation learning
Click the title for the details.
Current video understanding systems accurately recognize patterns in short video clips, but fails to process a video content over a few seconds due to computation and memory bottleneck. We propose a video representation method based on a spatio-temporal graph learning (SPELL) to equip it with long-term reasoning ability…