The 87% Problem: Why Traditional Security Tools Generate Noise
Traditional SAST tools have an 87% false positive rate.
We know because we ran them on 10 open-source project repos.
1,183 findings. Only 152 real vulnerabilities - 12.9%. The remaining 1,031 were false positives. That's 120 alerts on ave to find 15 real vulnerabilities per repo scanned.
Meanwhile, the critical vulnerabilities? Missed entirely.
Example: we scanned NocoDB with Semgrep (SAST), it flagged 222 potential issues. 208 were false positives. It completely missed the SQL injection in the Oracle client - the one vulnerability that could have led to a complete database compromise.
This is the problem every engineering team faces. Now with AI-generated code flooding codebases, it's getting way worse, fast.
What 45 Repositories Taught Us
Of the 45 repos we ran over the past few months, we ran both traditional SAST tools and our semantic AI analysis against 10 actively maintained open-source projects. Not toy projects - real infrastructure that teams depend on. Langfuse, Qdrant, NocoDB, Phase, Cloudreve, Weaviate, Agenta, vLLM, projects with actual users and production deployments.
The results aren't subtle.
Legacy Semgrep (SAST) scanning is mostly (90%) noise. 1,183 alerts, only 152 real, still need to inspect all 1,183.
The noise problem is scaling at the rate of AI generated code (now upto 100x the rate of human developers), in 2025 GitHub reported around 400 billion lines of AI gen code in repos, projected to reach 4 Trillion by 2030.
The time & cost of triaging the noise has become exponential, killing profitability and choking dev cycles.
Legacy scanning does nothing to fix the issues that finds (scan only).
Semantic AI Scanning finds an additional 30-40% of issues not found by legacy SAST scanning.
Automated AI alert triage and automated vulnerability remediation is the ONLY WAY FORWARD.
These SAST comparison figures are from a representative sample of 10 of the 45 repositories in our research, where we ran traditional and semantic AI tools side-by-side.
To make sure we weren't fooling ourselves, we submitted our findings to the maintainers. Of the 41 reports reviewed so far, 37 were accepted and 4 were rejected. That's a 90.24% acceptance rate - by the independent project maintainers, the people who know these codebases inside out, confirming the vulnerabilities we found were real.
The NocoDB is a case in point. Semgrep flagged 222 issues. We reviewed them all: 208 false positives, 14 worth a second look.
But the critical SQL injection in the Oracle client - affecting 17 lines of code across OracleClient.ts, where user-controlled parameters were concatenated directly into SQL queries - Semgrep didn't catch it at all. The flaw required understanding how data moved between files. Pattern matching couldn't see it. Our semantic AI analysis did.
Evidence: The fix shipped in PR #12748.
The Real Cost of Noise
Let's put dollars on this.
A senior engineer costs roughly $100 an hour, fully loaded. Security triage needs experienced developers because junior engineers can't reliably tell false positives from real vulnerabilities. On average they need 10 minutes to manually triage each alert.
Each project wastes around 20 hours or $2,000 per scan cycle investigating alerts that are 85-90% noise when using legacy SAST scanning tools like SonarCube, Snyk, Checkmarx and Aikido.
But the financial cost isn't even the worst part. After a few rounds of crying wolf, developers stop trusting the tools. They start marking real issues as false positives just to clear the backlog. Critical vulnerabilities go to production because the signal is completely buried in noise.
We've seen this cycle play out at company after company:
The tool generates noise.
The team loses trust.
Alerts get ignored.
Real vulnerabilities ship.
Eventually the tool gets turned off entirely - back to zero security coverage.
AI-Generated Code Is Making This Worse
The 2025 Octoverse report from GitHub says 40% of all code merged on the platform now uses AI assistance. That's not just autocomplete suggestions - entire functions, classes, and features generated in one shot.
The security implications are ugly.
Veracode tested more than 100 LLMs across 80 coding tasks and found that 45% of AI-generated code samples failed security tests. Cross-site scripting defences failed 86% of the time. Log injection succeeded 88% of the time. And model size didn't help - bigger models weren't any safer.
This isn't theoretical. We've already seen the real-world consequences.
CVE-2025-48757 - Lovable
A security researcher crawled 1,645 Lovable-powered projects and found 303 insecure Supabase endpoints across 170 sites. No row-level security. Emails, payment data, API keys - all sitting there exposed. CVSS 8.26.
EnrichLead
Built entirely with Cursor AI, zero hand-written code. Within days, researchers found hardcoded API keys on the client side, no authentication on endpoints, and no rate limiting. Someone ran up a $14,000 bill on the founder's OpenAI account. The whole thing had to be shut down.
Tea dating app
A women-only platform where safety was the entire value proposition. Leaked 72,000 photos including 13,000 government IDs, over a million private messages, and GPS location data from an unsecured Firebase database. 59.3 GB went to 4chan. Ten lawsuits followed.
These aren't hypothetical attack scenarios. They're things that already happened. And they happened because the code looked correct on the surface. The login forms looked professional. The endpoints seemed functional. The security just wasn't there - and traditional scanning tools couldn't tell the difference.
What Actually Works
If pattern matching can't keep up, what can?
The answer is understanding what code actually does, not just what it looks like.
When we scanned NocoDB with semantic analysis, we didn't just look for string concatenation near SQL keywords. We traced data flow from HTTP request parameters through the query builder into the Oracle client. We saw that user-controlled values were being concatenated into SQL queries across 17 separate locations. We understood the impact: arbitrary SQL execution, data exfiltration, potential DBA privilege escalation.
The same approach found:
An unsafe
torch.load()call in vLLM that could enable remote code execution through malicious checkpoint files - fixed in PR #32045Nine separate authorization bypasses in Phase, including a double-negative logic error - if not user_id is not None - where Python's operator precedence meant the permission check never actually ran - fixed across PRs #722-731
SSRF vulnerabilities in Langfuse's PostHog integration - fixed in PR #11311
Sandbox escapes in Agenta's RestrictedPython implementation - fixed in release v0.77.1
Timing attack vulnerabilities in Cloudreve's HMAC validation - fixed in release 4.11.0
None of these fit neat patterns. They're semantic bugs - the code compiles, runs, and looks reasonable on a quick review. Understanding them requires knowing what the code is supposed to do, not just scanning for known-bad syntax.
Across all 45 repositories, our semantic analysis found 225 vulnerabilities that traditional tools missed. Maintainers accepted 90.24% of them.
The fixes are merged, the PRs are public, and you can verify every claim.
Every assessment is published with full technical details - code locations, CWEs, PR numbers, and disclosure timelines - at kolega.dev/security-wins.
The data makes the case: if your security tools can't understand what your code means, they can't protect it.