Remove Arts Remove Benchmark Remove Proposal
article thumbnail

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Google Research AI blog

The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D ), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech ). Unconstrained audiovisual speech recognition.

Model 103
article thumbnail

Unsupervised and semi-supervised anomaly detection with data-centric ML

Google Research AI blog

Using data-centric approaches, we show state-of-the-art results in both. In “ SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch ”, we propose a novel semi-supervised AD framework that yields robust performance even under distribution mismatch with limited labeled samples. We consider methods with both shallow (e.g.,

Data 112
professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Foundation models for reasoning on charts

Google Research AI blog

In light of these challenges, we propose “ MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering ”. We also propose “ DePlot: One-shot visual language reasoning by plot-to-table translation ”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables.

Chart 117
article thumbnail

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

Google Research AI blog

Various techniques such as image-text pre-training , knowledge distillation , pseudo labeling , and frozen models, often employing convolutional neural network (CNN) backbones, have been proposed. To address this, we propose cropped positional embeddings (CPE). We are also releasing the code here.

Train 71
article thumbnail

Google Research, 2022 & beyond: Algorithms for efficient deep learning

Google Research AI blog

We proposed a theoretical architecture that would “remember events” in the form of sketches stored in an external LSH table with pointers to modules that process such sketches. We have proposed a new constrained optimization algorithm for automating hyperparameter tuning. T5-Large models have <1% nonzero entries.

Research 118
article thumbnail

Recent advances in deep long-horizon forecasting

Google Research AI blog

A number of neural network–based solutions have been able to show good performance on benchmarks and also support the above criterion. However, other work has suggested that even linear models can outperform these transformer variants on time-series benchmarks. Left: MSE on the test set of a popular traffic forecasting benchmark.

article thumbnail

Sparse video tubes for joint video and image vision transformers

Google Research AI blog

We demonstrate that this model is scalable, can be adapted to large pre-trained ViTs without requiring full fine-tuning, and achieves state-of-the-art results across many video classification benchmarks. Importantly, it outperforms all state-of-the-art methods trained jointly on image+video datasets.

Video 92