Benchmark and Model - Nonprofit Technology

A new AI test is outwitting OpenAI, Google models, among others

Mashable Tech

MARCH 25, 2025

are nowhere near achieving AGI (Artificial General Intelligence), according to a new benchmark. The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. According to the ARC-AGI leaderboard , OpenAI's most advanced model o3-low scored 4 percent.

Test

Test Model Google Benchmark

OpenAI’s o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

TechRepublic

APRIL 22, 2025

The FrontierMath benchmark from Epoch AI tests generative models on difficult math problems. Find out how OpenAIs o3 and other AI models performed.

Benchmark

Benchmark Generation Model Test

Building Resilient Funding Models: Essential Tips for Nonprofit Finance Professionals

sgEngage

NOVEMBER 20, 2024

Finance professionals can create models to forecast future revenue, allowing you to anticipate growth potential across various streams. Set performance benchmarks (e.g., This model isn’t just for gyms or museums—it can work for advocacy groups, community organizations, and more. The good news?

Professional

Professional Fund Model Build

Webinars

The Everyday Donor: Unlocking Prospecting Segments Through Behavior Analysis

MORE WEBINARS

DeepSeek upgrades V3 model with more parameters, open-source shift

TechNode

MARCH 25, 2025

DeepSeek released an updated version of its DeepSeek-V3 model on March 24. The new version, DeepSeek-V3-0324, has 685 billion parameters, a slight increase from the original V3 models 671 billion. The company has not yet released a system card for the updated model. 72B and Llama-3.1-405B,

Open Source

Open Source Model Open License

The 2025 BMW M5 Touring review: Way more power, way too much weight

Ars Technica

MARCH 31, 2025

For decades, its been the benchmark by which all big, fast four-doors have been judged, but after spending a week with the all-new $125,275 G99-generation M5 Touring, I cant help but wonder if that era is coming to a close. Read full article Comments

Review

Review Benchmark Generation Guide

What Are Foundation Models?

NVIDIA AI Blog

FEBRUARY 11, 2025

Like the prolific jazz trumpeter and composer, researchers have been generating AI models at a feverish pace, exploring new architectures and use cases. In a 2021 paper, researchers reported that foundation models are finding a wide array of uses. Earlier neural networks were narrowly tuned for specific tasks. See chart below.)

Foundation

Foundation Model Language Train

OpenAIs o3 and o4-mini hallucinate way higher than previous models

Mashable Tech

APRIL 19, 2025

By OpenAI 's own testing, its newest reasoning models, o3 and o4 -mini, hallucinate significantly higher than o1. OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 ” Evaluation benchmarks are tricky. GPT-4o scored 1.5 percent, GPT-4.5 UPDATE: Apr.

Model

Model Benchmark Evaluation Rate

Contextual AI’s new AI model crushes GPT-4o in accuracy — here’s why it matters

VentureBeat

MARCH 4, 2025

Contextual AI unveiled its grounded language model (GLM) today, claiming it delivers the highest factual accuracy in the industry by outperforming leading AI systems from Google, Anthropic and OpenAI on a key benchmark for truthfulness. The startup, founded by the pioneers of retrieval-augmented

Model

Model Benchmark Language Google

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

DeepMind Blog

DECEMBER 17, 2024

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations

Benchmark

Benchmark Evaluation Language Model

One of our favorite Samsung microSD cards drops to an all-time-low price

Engadget

MARCH 13, 2025

The 512GB model is down to just $33, which is a record-low price and one heck of a deal. We called the sequential and random read speeds respectable in our benchmark tests. To that end, the 512GB model can fit over 200,000 photos in 4K and over 300,000 images in smaller formats.

Benchmark

Benchmark Time Camera Advice

Apple Mac Studio M4 Max review: A creative powerhouse

Engadget

MARCH 13, 2025

The Mac Studio is Apples ultimate performance computer, but this years model came with a twist: Its equipped with either an M4 Max or an M3 Ultra processor. While the M3 Ultra model appears highly capable for creative pros and engineers, it starts at $4,000 and goes way up from there.

Review

Review Test Comparison Model

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

Google Research AI blog

FEBRUARY 17, 2023

In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. While human evaluation is the best way to be sure of model quality, it is often slow and expensive.

Awareness

Awareness Benchmark Language Evaluation

Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges

Google Research AI blog

MARCH 30, 2023

The key to both is a deeper understanding of ML data — how to engineer training datasets that produce high quality models and test datasets that deliver accurate indicators of how close we are to solving the target problem. Despite the importance of data, ML research to date has been dominated by a focus on models. LAION or The Pile ).

Benchmark

Benchmark Challenge Data Train

Amtrak’s new trains will have bigger windows, comfier seats, and higher speeds

Fast Company Tech

MARCH 31, 2025

[Image: Amtrak] Nicer seats, bigger views In an announcement released earlier this month , Amtrak revealed a first look at the specs and interiors of its Airo design, and theyre a marked improvement to the rail services existing models. In 2024, Amtrak saw a record ridership of 32.8 million passengers , up from 28 million the year before.

Train

Train Training Sacramento France

Running Code and Failing Models

DataRobot

FEBRUARY 10, 2021

Even if all the code runs and the model seems to be spitting out reasonable answers, it’s possible for a model to encode fundamental data science mistakes that invalidate its results. These errors might seem small, but the effects can be disastrous when the model is used to make decisions in the real world.

Model

Model Benchmark Metrics Train

Learning from peers to refine your nonprofit funding strategy

Candid

JANUARY 29, 2025

Benchmarking against peers can help you refine your assumptions about what being financially sustainable could look likeor develop entirely new assumptions. In this article, well outline a three-step process adapted from our report Finding Your Funding Strategy: Benchmarking 101 , tailored to U.S.-based But where should you start?

Fund

Fund Strategy Learning Nonprofit

Foundation models for reasoning on charts

Google Research AI blog

MAY 26, 2023

Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation.

Chart

Chart Model Foundation Language

Kolena, a startup building tools to test AI models, raises $15M

TechCrunch

SEPTEMBER 26, 2023

Kolena, a startup building tools to test, benchmark and validate the performance of AI models, today announced that it raised $15 million in a funding round led by Lobby Capital with participation from SignalFire and Bloomberg Beta.

Test

Test Model Tools Raise

The most innovative companies in artificial intelligence for 2025

Fast Company Tech

MARCH 18, 2025

Previously, the stunning intelligence gains that led to chatbots such ChatGPT and Claude had come from supersizing models and the data and computing power used to train them. o1 required more time to produce answers than other models, but its answers were clearly better than those of non-reasoning models.

Companies

Companies Model Train Training

Google Research, 2022 & Beyond: Language, Vision and Generative Models

Google Research AI blog

JANUARY 18, 2023

I will begin with a discussion of language, computer vision, multi-modal models, and generative machine learning models. Language Models The progress on larger and more powerful language models has been one of the most exciting areas of machine learning (ML) research over the last decade. Let’s get started!

Language

Language Model Generation Research

PaLM-E: An embodied multimodal language model

Google Research AI blog

MARCH 10, 2023

Posted by Danny Driess, Student Researcher, and Pete Florence, Research Scientist, Robotics at Google Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of languages to those that can produce images based on text descriptions.

Language

Language Model Train Training

With Evals, OpenAI hopes to crowdsource AI model testing

TechCrunch

MARCH 14, 2023

Alongside GPT-4 , OpenAI has open sourced a software framework to evaluate the performance of its AI models. Called Evals , OpenAI says that the tooling will allow anyone to report shortcomings in its models to help guide improvements. It’s a sort of crowdsourcing approach to model testing, OpenAI explains in a blog post.

Model

Model Test Benchmark Open Source

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Google Research AI blog

MARCH 6, 2023

Posted by Yu Zhang, Research Scientist, and James Qin, Software Engineer, Google Research Last November, we announced the 1,000 Languages Initiative , an ambitious commitment to build a machine learning (ML) model that would support the world’s one thousand most-spoken languages, bringing greater inclusion to billions of people around the globe.

Language

Language Arts Model University

Barkour: Benchmarking animal-level agility with quadruped robots

Google Research AI blog

MAY 26, 2023

Yet, while researchers have enabled robots to hike or jump over some obstacles , there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables.

Benchmark

Benchmark Policy Skills Course

Google Analytics and Benchmarks: HAWK

M+R

APRIL 16, 2024

Hey data friends, it’s our favorite time of the year, the birds are singing, the flowers are blooming, you can sip your iced coffee outside and read Benchmarks ! Instead of tracking sessions , GA4 uses an event-based data model. This made the Benchmarks’ website data much more difficult to analyze. What does that mean?

Benchmark

Benchmark Analytics Google Comparison

Hippocratic is building a large language model for healthcare

TechCrunch

MAY 16, 2023

” The tranche, co-led by General Catalyst and Andreessen Horowitz, is a big vote of confidence in Hippocratic’s technology, a text-generating model tuned specifically for healthcare applications. Hippocratic’s benchmark results on a range of medical exams. “The language models have to be safe,” Shah said.

Language

Language Model Build Train

AI is coming for the laptop class

Recode by Vox

MARCH 13, 2025

The newest reasoning models from top AI companies are already essentially human-level, if not superhuman, at many programming tasks , which in turn has already led new tech startups to hire fewer workers. Fast AI progress, slow robotics progress If youve heard of OpenAI, youve heard of its language models: GPTs 1, 2, 3, 3.5,

Laptop

Laptop Classes Job Model

Deci raises $9.1M to optimize AI models with AI

TechCrunch

OCTOBER 27, 2020

Deci , a Tel Aviv-based startup that is building a new platform that uses AI to optimize AI models and get them ready for production, today announced that it has raised a $9.1 Using its runtime container or Edge SDK, Deci users can also then serve those models on virtually any modern platform and cloud. Image Credits: Deci. ”

Model

Model Raise Israel Learning

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Google Research AI blog

JUNE 2, 2023

Building audiovisual datasets for training AV-ASR models, however, is challenging. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. LibriSpeech ). LibriSpeech ). Unconstrained audiovisual speech recognition.

Model

Model Audio Avatar Phase

Google’s Gemini 2.5 Pro is Better at Coding, Math & Science Than Your Favourite AI Model

TechRepublic

MARCH 26, 2025

Pro is a multimodal, reasoning model that outperforms competitors from OpenAI, Anthropic, and DeepSeek on key benchmarks.

Model

Model Benchmark Google News

Intel Core CPU Clock-for-Clock Benchmark Test

TechSpot

OCTOBER 26, 2023

A clock-for-clock (IPC) test of Intel LGA 1700 processors: we're comparing the 12th-gen, 13th-gen, and "new" 14th-gen CPU models to offer insight into their architectural improvements, if any. Read Entire Article

Test

Test Benchmark Model

OpenAI's Hot New AI Has an Embarrassing Problem

Futurism

APRIL 21, 2025

Bucking the Trend OpenAI launched its latest AI reasoning models, dubbed o3 and o4-mini, last week. According to the Sam Altman-led company, the new models outperform their predecessors and "excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis."

Problem

Problem Model Rate Social Network

Porsche Taycan Turbo S leapfrogs Tesla Model S Plaid to become fastest production EV at the Nürburgring

TechSpot

AUGUST 11, 2022

The Nürburgring racetrack has been home to decades of vehicle testing and benchmarking, and a place where automakers continue to push the limits of ICE and EV models for bragging rights. The latest feat has been accomplished by Porsche, whose Taycan Turbo S four-door posted the fastest lap time for.

Model

Model Benchmark Product Test

Family-line selection optimizer

The AI Alignment Forum

APRIL 22, 2025

I would bet that lie-proof benchmarks will be difficult and expensive to make and that the lie-proofing techniques won't easily generalize outside of coding tasks. If you RL+BP+SGD a model to avoid doing X, then the model will learn to avoid X enough that it never gets punished in training. are terribly dishonest creatures.

Benchmark

Benchmark Model Technique Train

Characterizing Emergent Phenomena in Large Language Models

Google Research AI blog

NOVEMBER 10, 2022

Posted by Jason Wei and Yi Tay, Research Scientists, Google Research, Brain Team The field of natural language processing (NLP) has been revolutionized by language models trained on large amounts of text data. Scaling up the size of language models often leads to improved performance and sample efficiency on a range of downstream NLP tasks.

Language

Language Model Train Training

DeepSeek launches new AI model with 671 billion parameters, rivaling GPT-4o

TechNode

DECEMBER 27, 2024

DeepSeek announced the release and open-source launch of its latest AI model, DeepSeek-V3, via a WeChat post on Tuesday. Users can now interact with the V3 model on DeepSeeks official website. version, the new model’s generation speed has tripled, with a throughput of 60 tokens per second. trillion tokens. 72B and Llama-3.1-405B,

Model

Model Open Source Benchmark Interaction

ASUS Zenbook A14 review: A lightweight in every sense

Engadget

MARCH 7, 2025

The A14 is an ideal machine for writing on the go, since you can travel with it effortlessly and it offers a whopping 18 hours and 16 minutes of battery life (according to the PCMark 10 benchmark). In the PCMark 10 battery benchmark, the Zenbook A14 lasted 18 hours and 16 minutes.

Review

Review Laptop Benchmark Video

The best gaming laptops for 2025

Engadget

MARCH 13, 2025

A cheap gaming laptop in this price range will definitely feel a bit flimsier than pricier models, and they'll likely skimp on RAM, storage and overall power. In general, 15-inch laptops will be the best balance of immersion and portability, while larger 17-inch models are heftier, but naturally give you more screen real estate.

Laptop

Laptop Game Test Rate

DeepSeek releases new models Janus-Pro and JanusFlow on Lunar New Year’s Eve

TechNode

JANUARY 29, 2025

By decoupling visual encoding, the model improves adaptability and performance across various tasks. It has outperformed OpenAI’s image-generation model, DALL-E 3, in benchmark tests. Consistent with previous models in the Janus series, Janus-Pro is open-source. Tencent , in Chinese]

Model

Model Oracle Open Source Benchmark

Why Sustainable AI is the Next Step for a Better Digital Future

Forum One

NOVEMBER 26, 2024

In fact, training a single advanced AI model can generate carbon emissions comparable to the lifetime emissions of a car. And with the rapid advancement of generative AI models potentially slowing down , this provides a unique opportunity to take a breath and reimagine and mature our approach.

Digital

Digital Impact United States Integration

Ryzen 7 9800X3D could be the first 3D V-Cache CPU to outpace the standard model's frequency

TechSpot

OCTOBER 11, 2024

The leak comes from the Chinese video platform Bilibili, where someone shared a Cinebench R23 benchmark run supposedly from the unannounced 9800X3D chip. The most eye-catching specification is its 4.7GHz base clock, which would be a staggering 900MHz higher than the standard Ryzen 7 9700X's 3.8GHz base frequency. Read Entire Article

Benchmark

Benchmark Model Video Platform

BIPOC donors are driving bold change in philanthropy 

Candid

DECEMBER 23, 2024

As the only cross-racial high-net-worth donor network dedicated to racial justice, DOCN provides an organizing space for and builds solidarity among these donors so they can become more impactful philanthropists, acting as a model for the broader philanthropic community to give more inclusively.

Donor

Donor Change Philanthropy Benchmark

Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

Google Research AI blog

DECEMBER 16, 2022

Posted by Tal Schuster, Research Scientist, Google Research Language models (LMs) are the driving force behind many recent breakthroughs in natural language processing. Models like T5 , LaMDA , GPT-3 , and PaLM have demonstrated impressive performance on various language tasks. The encoder reads the input text (e.g.,

Language

Language Model Generation Local

DeepMind tests the limits of large AI language systems with 280-billion-parameter model

The Verge

DECEMBER 8, 2021

Language generation is the hottest thing in AI right now, with a class of systems known as “large language models” (or LLMs) being used for everything from improving Google’s search engine to creating text-based fantasy games. One key finding of the paper is that the progress and capabilities of large language models is still increasing.

Language

Language Model Test System

A new AI test is outwitting OpenAI, Google models, among others

OpenAI’s o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

Webinars

Trending Sources

Building Resilient Funding Models: Essential Tips for Nonprofit Finance Professionals

Webinars

DeepSeek upgrades V3 model with more parameters, open-source shift

The 2025 BMW M5 Touring review: Way more power, way too much weight

What Are Foundation Models?

OpenAIs o3 and o4-mini hallucinate way higher than previous models

Contextual AI’s new AI model crushes GPT-4o in accuracy — here’s why it matters

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

One of our favorite Samsung microSD cards drops to an all-time-low price

Apple Mac Studio M4 Max review: A creative powerhouse

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

Data-centric ML benchmarking: Announcing DataPerf’s 2023 challenges

Amtrak’s new trains will have bigger windows, comfier seats, and higher speeds

Running Code and Failing Models

Learning from peers to refine your nonprofit funding strategy

Foundation models for reasoning on charts

Kolena, a startup building tools to test AI models, raises $15M

The most innovative companies in artificial intelligence for 2025

Google Research, 2022 & Beyond: Language, Vision and Generative Models

PaLM-E: An embodied multimodal language model

With Evals, OpenAI hopes to crowdsource AI model testing

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Barkour: Benchmarking animal-level agility with quadruped robots

Google Analytics and Benchmarks: HAWK

Hippocratic is building a large language model for healthcare

AI is coming for the laptop class

Deci raises $9.1M to optimize AI models with AI

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Google’s Gemini 2.5 Pro is Better at Coding, Math & Science Than Your Favourite AI Model

Intel Core CPU Clock-for-Clock Benchmark Test

OpenAI's Hot New AI Has an Embarrassing Problem

Porsche Taycan Turbo S leapfrogs Tesla Model S Plaid to become fastest production EV at the Nürburgring

Family-line selection optimizer

Characterizing Emergent Phenomena in Large Language Models

DeepSeek launches new AI model with 671 billion parameters, rivaling GPT-4o

ASUS Zenbook A14 review: A lightweight in every sense

The best gaming laptops for 2025

DeepSeek releases new models Janus-Pro and JanusFlow on Lunar New Year’s Eve

Why Sustainable AI is the Next Step for a Better Digital Future

Ryzen 7 9800X3D could be the first 3D V-Cache CPU to outpace the standard model's frequency

BIPOC donors are driving bold change in philanthropy

Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

DeepMind tests the limits of large AI language systems with 280-billion-parameter model

Stay Connected

BIPOC donors are driving bold change in philanthropy