There’s a lot of noise right now about AI in engineering.
Most of it focuses on content generation and speed. Faster drafting, faster modelling, faster documentation etc. But in structural engineering, speed is important, but not as important as accuracy.
So we ran a test to figure out not what AI can produce, but rather give us a clear view of how it fails. So that we can answer the question:
Can AI generate calculations that an engineer would actually be comfortable signing off on?
The experiment
One of our team members built a large set of structural calculations using AI (Anthropic’s Sonnet 4.5). The starting point was grounded in real material: existing spreadsheets, prior calculations, and internal design assumptions. From there, AI was used to assemble these into structured calculations and translate them into functional CalcTree calculations.
Then, every calculation was manually reviewed against the applicable codes, with detailed notes on where the AI got things wrong.
Findings
An impressive baseline
We generated and reviewed 60 structural calculations using this workflow.
- 22 were fully correct
- 27 had minor issues (formatting, small inconsistencies)
- 11 had fundamental issues (logic, missing checks, or incorrect application)
At a high level, that’s ~80% usable with minor fixes.
But in structural engineering, “mostly right” isn’t enough. The bar is effectively 100%, because the output needs to be signed off.
Failures at the margins
The failures weren’t obvious. They were subtle, and often sat at the margins.
At a glance, everything looked reasonable; the structure was sound, the equations familiar, the numbers plausible. But under deeper scrutiny, issues emerged.
In some cases, the model misinterpreted intent. In others, it “completed” calculations by introducing factors that weren’t required. There were also consistent gaps around missing checks, partial application of rules, and incorrect parameter selection.
These aren’t catastrophic failures. In most cases, the overall approach was correct, but something important was missing or misapplied, but the AI was almost ‘pretending’, and being overly confident in an attempt to make the answer look right.
Which is exactly why they’re dangerous. It was over-confident.
One of the clearest patterns was how sensitive the results were to input quality.
With structured inputs, outputs improved significantly. When inputs were vague, the model filled in the gaps, often incorrectly.
But even with good context, the outputs were still only “mostly right”.
The real bottleneck: verification
This creates a new kind of bottleneck. Not generation, but getting from “mostly right” to something an engineer can be confident in and actually sign off on.
In practice, that gap is where all the work sits:
- checking assumptions
- validating required checks
- tracing how values were derived
- identifying where something “looks right” but isn’t
And that process is slow. Not just because it requires human judgement, but because the tools we use weren’t designed for it.
This isn’t a new issue in AEC with human only work, but AI content generation is currently amplifying it.
Because now you’re reviewing something that is:
- fast to generate
- mostly correct
- Is programmed to look correct
- but occasionally wrong in subtle ways
So the problem isn’t generating calculations. It’s reducing the effort required for a human to understand, verify, and trust them.
How this shapes how we build AI at CalcTree
AI that supports judgement, not replaces it
These findings reinforce how we’re approaching AI in CalcTree.
AI shouldn’t behave like a black box that produces answers. It should produce high-quality work, but in a way that supports human judgement.
That means:
- doing the heavy lifting
- structuring calculations clearly
- and making it easy for an engineer to review and verify
- Being transparent when it comes to confidence levels of results
The goal isn’t to replace judgement, it’s to help apply it faster while maintaining accuracy.
Design principle: minimise latency between AI and human judgement
The key idea is simple: Minimise the time between AI generating work, and a human understanding and verifying it.
That’s the real bottleneck, not generation or accuracy in isolation, but how quickly someone can:
- understand what was done
- validate it
- and build confidence in it such that they can apply it in areal project with confidence
How CalcTree makes this practical
CalcTree is designed to reduce that gap. Instead of separating generation, review, and documentation, everything sits in one place:
- Review in context — comment directly on calculations, capture decisions alongside the logic
- Source material side-by-side — verify assumptions against inputs immediately
- Full logic visibility — trace how results were derived, not just what they are
Once verified, workflows can be turned into templates and reused so accuracy compounds over time.
Final thought
AI hasn’t removed the need for engineers, if anything, it’s made their role more focused. Because the value isn’t just producing calculations anymore, it’s knowing whether they’re right.
So the question isn’t:
Can AI generate calculations?
It’s:
Would you sign off on them?
And until the answer is yes, the focus shouldn’t be on generating more, it should be on making outputs:
- understandable
- reviewable
- and trustworthy
That’s the layer that actually unlocks adoption.
.jpg)

.png)