Back to Insights
October 15, 2025

Comparative Performance Analysis of Leading AI Models in ToltIQ

This report highlights the distinctive traits of leading AI models utilized at ToltIQ. The models included in this benchmark were ChatGPT 5, ChatGPT 4.1, Gemini 2.5 Pro, and Claude 4 Sonnet. We analyzed their performance when leveraged in the ToltIQ platform and scored their responses based on six different evaluation criteria: accuracy, prompt...

Download Report
Comparative Performance Analysis of Leading AI Models in ToltIQ

Overview

This report evaluates leading AI models utilized at ToltIQ, including ChatGPT 5, ChatGPT 4.1, Gemini 2.5 Pro, and Claude 4 Sonnet. The analysis assessed their performance when leveraged in the ToltIQ platform using prompts based on Amazon public filings designed to mirror Private Equity analytical tasks.

Authors

  • Steiner Williams - Private Equity AI Researcher (Led the benchmarking exercise, data analysis, and reporting.)
  • Maya Boeye - Head of AI Research (Provided conceptual guidance, supervised the research process and report review.)

Methodology and Criteria

Models were evaluated using six weighted criteria:

  • Accuracy (30%)
  • Relevance (20%)
  • Reasoning (20%)
  • Problem Solving (10%)
  • Industry Relevance (10%)
  • Human Opinion (10%)

Human evaluators manually scored responses using rubrics without knowing which model produced each response.

Model Analysis

Model Score
ChatGPT 5 9.15/10
Claude 4 Sonnet 8.01/10
Gemini 2.5 Pro 7.73/10
ChatGPT 4.1 7.64/10

ChatGPT 5

Score: 9.15/10

Strongest performer distinguished by meticulous numerical handling with exact decimal places. Frequently uses disclaimers when contextual information appears incomplete and produces highly detailed tables with comprehensive headers.

Private Equity Applications: LBO validation, CIM comparisons, and IC memo drafting where precision and completeness are critical.

Claude 4 Sonnet

Score: 8.01/10

Emphasizes risks and limitations more than peers. Structures responses with moderate depth, addressing all prompt elements systematically without unnecessary elaboration.

Private Equity Applications: Red team analysis, regulatory diligence, and operational reviews to stress-test assumptions and uncover vulnerabilities.

Gemini 2.5 Pro

Score: 7.73/10

Most comprehensive in providing historical context and background. Delivers narrative-driven analysis rather than structured tables. Reverses table rows and columns but maintains accuracy; human evaluators prefer this formatting convention.

Private Equity Applications: Sector due diligence highlighting industry evolution, scenario modeling for exit planning, and portfolio monitoring through narrative framing.

ChatGPT 4.1

Score: 7.64/10

Concise and action-oriented with strong emphasis on strategic recommendations. Employs approximation symbols (e.g., "~38–40%") prioritizing directional accuracy over absolute precision. Heavy reliance on bulleted lists for easy scanning.

Private Equity Applications: Deal screening, rapid market sizing, and executive summary preparation where brevity matters more than full precision.

Conclusion

"ChatGPT 5 sets the benchmark for precision and structured analysis in Private Equity, while the other models provide complementary strengths across diligence, scenario planning, and risk assessment."

ToltIQ maintains model agnostic access, with ChatGPT 5 as the default for high-precision functions, Claude 4 Sonnet applied in specific due diligence workflows, and Gemini 2.5 Pro in Beta testing for narrative-driven use cases.