How Accurate Are AI-Generated Engineering Calculations?

CalcTree

April 20, 2026

Real-world test of AI-generated engineering calculations. Where AI gets it right, where it fails, and how engineers can safely use it.

CalcTree

April 20, 2026

There’s a lot of noise right now about AI in engineering.

Most of it focuses on content generation and speed. Faster drafting, faster modelling, faster documentation etc. But in structural engineering, speed is important, but not as important as accuracy.

So we ran a test to figure out not what AI can produce, but rather give us a clear view of how it fails. So that we can answer the question:

Can AI generate calculations that an engineer would actually be comfortable signing off on?

The experiment

One of our team members built a large set of structural calculations using AI (Anthropic’s Sonnet 4.5). The starting point was grounded in real material: existing spreadsheets, prior calculations, and internal design assumptions. From there, AI was used to assemble these into structured calculations and translate them into functional CalcTree calculations.

Then, every calculation was manually reviewed against the applicable codes, with detailed notes on where the AI got things wrong.

Findings

An impressive baseline

We generated and reviewed 60 structural calculations using this workflow.

22 were fully correct
27 had minor issues (formatting, small inconsistencies)
11 had fundamental issues (logic, missing checks, or incorrect application)

At a high level, that’s ~80% usable with minor fixes.

But in structural engineering, “mostly right” isn’t enough. The bar is effectively 100%, because the output needs to be signed off.

Failures at the margins

The failures weren’t obvious. They were subtle, and often sat at the margins.

At a glance, everything looked reasonable; the structure was sound, the equations familiar, the numbers plausible. But under deeper scrutiny, issues emerged.

In some cases, the model misinterpreted intent. In others, it “completed” calculations by introducing factors that weren’t required. There were also consistent gaps around missing checks, partial application of rules, and incorrect parameter selection.

These aren’t catastrophic failures. In most cases, the overall approach was correct, but something important was missing or misapplied, but the AI was almost ‘pretending’, and being overly confident in an attempt to make the answer look right.

Which is exactly why they’re dangerous. It was over-confident.

One of the clearest patterns was how sensitive the results were to input quality.

With structured inputs, outputs improved significantly. When inputs were vague, the model filled in the gaps, often incorrectly.

But even with good context, the outputs were still only “mostly right”.

The real bottleneck: verification

This creates a new kind of bottleneck. Not generation, but getting from “mostly right” to something an engineer can be confident in and actually sign off on.

In practice, that gap is where all the work sits:

checking assumptions
validating required checks
tracing how values were derived
identifying where something “looks right” but isn’t

And that process is slow. Not just because it requires human judgement, but because the tools we use weren’t designed for it.

This isn’t a new issue in AEC with human only work, but AI content generation is currently amplifying it.

Because now you’re reviewing something that is:

fast to generate
mostly correct
Is programmed to look correct
but occasionally wrong in subtle ways

So the problem isn’t generating calculations. It’s reducing the effort required for a human to understand, verify, and trust them.

How this shapes how we build AI at CalcTree

AI that supports judgement, not replaces it

These findings reinforce how we’re approaching AI in CalcTree.

AI shouldn’t behave like a black box that produces answers. It should produce high-quality work, but in a way that supports human judgement.

That means:

doing the heavy lifting
structuring calculations clearly
and making it easy for an engineer to review and verify
Being transparent when it comes to confidence levels of results

The goal isn’t to replace judgement, it’s to help apply it faster while maintaining accuracy.

Design principle: minimise latency between AI and human judgement

The key idea is simple: Minimise the time between AI generating work, and a human understanding and verifying it.

That’s the real bottleneck, not generation or accuracy in isolation, but how quickly someone can:

understand what was done
validate it
and build confidence in it such that they can apply it in areal project with confidence

How CalcTree makes this practical

CalcTree is designed to reduce that gap. Instead of separating generation, review, and documentation, everything sits in one place:

Review in context — comment directly on calculations, capture decisions alongside the logic
Source material side-by-side — verify assumptions against inputs immediately
Full logic visibility — trace how results were derived, not just what they are

Once verified, workflows can be turned into templates and reused so accuracy compounds over time.

Final thought

AI hasn’t removed the need for engineers, if anything, it’s made their role more focused. Because the value isn’t just producing calculations anymore, it’s knowing whether they’re right.

So the question isn’t:

Can AI generate calculations?

It’s:

Would you sign off on them?

And until the answer is yes, the focus shouldn’t be on generating more, it should be on making outputs:

understandable
reviewable
and trustworthy

That’s the layer that actually unlocks adoption.

‍