Kernel bugs hide for 2 years on average. Some hide for 20.

Posted on January 8, 2026 by oxm6k

There are bugs in your kernel right now that won’t be found for years. I know because I analyzed 125,183 of them, every bug with a traceable Fixes: tag in the Linux kernel’s 20-year git history.

The average kernel bug lives 2.1 years before discovery. But some subsystems are far worse: CAN bus drivers average 4.2 years, SCTP networking 4.0 years. The longest-lived bug in my dataset, a buffer overflow in ethtool, sat in the kernel for 20.7 years. The one which I’ll dissect in detail is refcount leak in netfilter, and it lasted 19 years.

I built a tool that catches 92% of historical bugs in a held-out test set at commit time. Here’s what I learned.

Key findings at a glance
125,183	Bug-fix pairs with traceable `Fixes:` tags
123,696	Valid records after filtering (0 < lifetime < 27 years)
2.1 years	Average time a bug hides before discovery
20.7 years	Longest-lived bug (ethtool buffer overflow)
0% → 69%	Bugs found within 1 year (2010 vs 2022)
92.2%	Recall of VulnBERT on held-out 2024 test set
1.2%	False positive rate (vs 48% for vanilla CodeBERT)

The initial discovery

I started by mining the most recent 10,000 commits with Fixes: tags from the Linux kernel. After filtering out invalid references (commits that pointed to hashes outside the repo, malformed tags, or merge commits), I had 9,876 valid vulnerability records. For the lifetime analysis, I excluded 27 same-day fixes (bugs introduced and fixed within hours), leaving 9,849 bugs with meaningful lifetimes.

The results were striking:

Metric	Value
Bugs analyzed	9,876
Average lifetime	2.8 years
Median lifetime	1.0 year
Maximum	20.7 years

Almost 20% of bugs had been hiding for 5+ years. The networking subsystem looked particularly bad at 5.1 years average. I found a refcount leak in netfilter that had been in the kernel for 19 years.

Initial Bug Lifetime Distribution
Initial findings: Half of bugs found within a year, but 20% hide for 5+ years.

But something nagged at me: my dataset only contained fixes from 2025. Was I seeing the full picture, or just the tip of the iceberg?

Going deeper: Mining the full history

I rewrote my miner to capture every Fixes: tag since Linux moved to git in 2005. Six hours later, I had 125,183 vulnerability records which was 12x larger than my initial dataset.

The numbers changed significantly:

Metric	2025 Only	Full History (2005-2025)
Bugs analyzed	9,876	125,183
Average lifetime	2.8 years	2.1 years
Median lifetime	1.0 year	0.7 years
5+ year bugs	19.4%	13.5%
10+ year bugs	6.6%	4.2%

Full Dataset Bug Lifetime Distribution
Full history: 57% of bugs found within a year. The long tail is smaller than it first appeared.

Why the difference? My initial 2025-only dataset was biased. Fixes in 2025 include:

New bugs introduced recently and caught quickly
Ancient bugs that finally got discovered after years of hiding

The ancient bugs skewed the average upward. When you include the full history with all the bugs that were introduced AND fixed within the same year, the average drops from 2.8 to 2.1 years.

The real story: We’re getting faster (but it’s complicated)

The most striking finding from the full dataset: bugs introduced in recent years appear to get fixed much faster.

Year Introduced	Bugs	Avg Lifetime	% Found <1yr
2010	1,033	9.9 years	0%
2014	3,991	3.9 years	31%
2018	11,334	1.7 years	54%
2022	11,090	0.8 years	69%

Bugs introduced in 2010 took nearly 10 years to find and bugs introduced in 2024 are found in 5 months. At first glance it looks like a 20x improvement!

But here’s the catch: this data is right-censored. Bugs introduced in 2022 can’t have a 10-year lifetime yet since we’re only in 2026. We might find more 2022 bugs in 2030 that bring the average up.

The fairer comparison is “% found within 1 year” and that IS improving: from 0% (2010) to 69% (2022). That’s real progress, likely driven by:

Syzkaller (released 2015)
KASAN, KMSAN, KCSAN sanitizers
Better static analysis
More contributors reviewing code

But there’s a backlog. When I look at just the bugs fixed in 2024-2025:

60% were introduced in the last 2 years (new bugs, caught quickly)
18% were introduced 5-10 years ago
6.5% were introduced 10+ years ago

We’re simultaneously catching new bugs faster AND slowly working through ~5,400 ancient bugs that have been hiding for over 5 years.

The methodology

The kernel has a convention: when a commit fixes a bug, it includes a Fixes: tag pointing to the commit that introduced the bug.

commit de788b2e6227
Author: Florian Westphal <fw@strlen.de>
Date:   Fri Aug 1 17:25:08 2025 +0200

    netfilter: ctnetlink: fix refcount leak on table dump

    Fixes: d205dc40798d ("netfilter: ctnetlink: ...")

I wrote a miner that:

Runs git log --grep="Fixes:" to find all fixing commits
Extracts the referenced commit hash from the Fixes: tag
Pulls dates from both commits
Classifies subsystem from file paths (70+ patterns)
Detects bug type from commit message keywords
Calculates the lifetime

fixes_pattern = r'Fixes:\s*([0-9a-f]{12,40})'
match = re.search(fixes_pattern, commit_message)
if match:
    introducing_hash = match.group(1)
    lifetime_days = (fixing_date - introducing_date).days

Dataset details:

Parameter	Value
Kernel version	v6.19-rc3
Mining date	January 6, 2026
Fixes mined since	2005-04-16 (git epoch)
Total records	125,183
Unique fixing commits	119,449
Unique bug-introducing authors	9,159
With CVE ID	158
With Cc: stable	27,875 (22%)

Coverage note: The kernel has ~448,000 commits mentioning “fix” in some form, but only ~124,000 (28%) use proper Fixes: tags. My dataset captures the well-documented bugs aka the ones where maintainers traced the root cause.

It varies by subsystem

Some subsystems have bugs that persist far longer than others:

Subsystem	Bug Count	Avg Lifetime
drivers/can	446	4.2 years
networking/sctp	279	4.0 years
networking/ipv4	1,661	3.6 years
usb	2,505	3.5 years
tty	1,033	3.5 years
netfilter	1,181	2.9 years
networking	6,079	2.9 years
memory	2,459	1.8 years
gpu	5,212	1.4 years
bpf	959	1.1 years

Bug Lifetime by Subsystem
CAN bus and SCTP bugs persist longest. BPF and GPU bugs get caught fastest.

CAN bus drivers and SCTP networking have bugs that persist longest probably because both are niche protocols with less testing coverage. GPU (especially Intel i915) and BPF bugs get caught fastest, probably thanks to dedicated fuzzing infrastructure.

Interesting finding from comparing 2025-only vs full history:

Subsystem	2025-only Avg	Full History Avg	Difference
networking	5.2 years	2.9 years	-2.3 years
filesystem	3.8 years	2.6 years	-1.2 years
drivers/net	3.3 years	2.2 years	-1.1 years
gpu	1.4 years	1.4 years	0 years

Networking looked terrible in the 2025-only data (5.2 years!) but is actually closer to average in the full history (2.9 years). The 2025 fixes were catching a backlog of ancient networking bugs. GPU looks the same either way, and those bugs get caught consistently fast.

Some bug types hide longer than others

Race conditions are the hardest to find, averaging 5.1 years to discovery:

Bug Type	Count	Avg Lifetime	Median
race-condition	1,188	5.1 years	2.6 years
integer-overflow	298	3.9 years	2.2 years
use-after-free	2,963	3.2 years	1.4 years
memory-leak	2,846	3.1 years	1.4 years
buffer-overflow	399	3.1 years	1.5 years
refcount	2,209	2.8 years	1.3 years
null-deref	4,931	2.2 years	0.7 years
deadlock	1,683	2.2 years	0.8 years

Why do race conditions hide so long? They’re non-deterministic and only trigger under specific timing conditions that might occur once per million executions. Even sanitizers like KCSAN can only flag races they observe.

30% of bugs are self-fixes where the same person who introduced the bug eventually fixed it. I guess code ownership matters.

Why some bugs hide longer

Less fuzzing coverage. Syzkaller excels at syscall fuzzing but struggles with stateful protocols. Fuzzing netfilter effectively requires generating valid packet sequences that traverse specific connection tracking states.

Harder to trigger. Many networking bugs require:

Specific packet sequences
Race conditions between concurrent flows
Memory pressure during table operations
Particular NUMA topologies

Older code with fewer eyes. Core networking infrastructure like nf_conntrack was written in the mid-2000s. It works, so nobody rewrites it. But “stable” means fewer developers actively reviewing.

Case study: 19 years in the kernel

One of the oldest networking bug in my dataset was introduced in August 2006 and fixed in August 2025:

// ctnetlink_dump_table() - the buggy code path
if (res < 0) {
    nf_conntrack_get(&ct->ct_general);  // increments refcount
    cb->args[1] = (unsigned long)ct;
    break;
}

The irony: Commit d205dc40798d was itself a fix: “[NETFILTER]: ctnetlink: fix deadlock in table dumping”. Patrick McHardy was fixing a deadlock by removing a _put() call. In doing so, he introduced a refcount leak that would persist for 19 years.

The bug: the code doesn’t check if ct == last. If the current entry is the same as the one we already saved, we’ve now incremented its refcount twice but will only decrement it once. The object never gets freed.

// What should have been checked:
if (res < 0) {
    if (ct != last)  // <-- this check was missing for 19 years
        nf_conntrack_get(&ct->ct_general);
    cb->args[1] = (unsigned long)ct;
    break;
}

The consequence: Memory leaks accumulate. Eventually nf_conntrack_cleanup_net_list() waits forever for the refcount to hit zero. The netns teardown hangs. If you’re using containers, this blocks container cleanup indefinitely.

Why it took 19 years: You had to run conntrack_resize.sh in a loop for ~20 minutes under memory pressure. The fix commit says: “This can be reproduced by running conntrack_resize.sh selftest in a loop. It takes ~20 minutes for me on a preemptible kernel.” Nobody ran that specific test sequence for two decades.

Incomplete fixes are common

Here’s a pattern I keep seeing: someone notices undefined behavior, ships a fix, but the fix doesn’t fully close the hole.

Case study: netfilter set field validation

Date	Commit	What happened
Jan 2020	`f3a2181e16f1`	Stefano Brivio adds support for sets with multiple ranged fields. Introduces `NFTA_SET_DESC_CONCAT` for specifying field lengths.
Jan 2024	`3ce67e3793f4`	Pablo Neira notices the code doesn’t validate that field lengths sum to the key length. Ships a fix. Commit message: “I did not manage to crash nft_set_pipapo with mismatch fields and set key length so far, but this is UB which must be disallowed.”
Jan 2025	`1b9335a8000f`	Security researcher finds a bypass. The 2024 fix was incomplete—there were still code paths that could mismatch. Real fix shipped.

The 2024 fix was an acknowledgment that something was wrong, but Pablo couldn’t find a crash, so the fix was conservative. A year later, someone found the crash.

This pattern suggests a detection opportunity: commits that say things like “this is undefined behavior” or “I couldn’t trigger this but…” are flags. The author knows something is wrong but hasn’t fully characterized the bug. These deserve extra scrutiny.

The anatomy of a long-lived bug

Looking at the bugs that survive 10+ years, I see common patterns:

1. Reference counting errors

kref_get(&obj->ref);
// ... error path returns without kref_put()

These don’t crash immediately. They leak memory slowly. In a long-running system, you might not notice until months later when OOM killer starts firing.

2. Missing NULL checks after dereference

struct foo *f = get_foo();
f->bar = 1;              // dereference happens first
if (!f) return -EINVAL;  // check comes too late

The compiler might optimize away the NULL check since you already dereferenced. These survive because the pointer is rarely NULL in practice.

3. Integer overflow in size calculations

size_t total = n_elements * element_size;  // can overflow
buf = kmalloc(total, GFP_KERNEL);
memcpy(buf, src, n_elements * element_size);  // copies more than allocated

If n_elements comes from userspace, an attacker can cause allocation of a small buffer followed by a large copy.

4. Race conditions in state machines

spin_lock(&lock);
if (state == READY) {
    spin_unlock(&lock);
    // window here where another thread can change state
    do_operation();  // assumes state is still READY
}

These require precise timing to hit. They might manifest as rare crashes that nobody can reproduce.

Can we catch these bugs automatically?

Every day a bug lives in the kernel is another day millions of devices are vulnerable. Android phones, servers, embedded systems, cloud infrastructure, all running kernel code with bugs that won’t be found for years.

I built VulnBERT, a model that predicts whether a commit introduces a vulnerability.

Model evolution:

Model	Recall	FPR	F1	Notes
Random Forest	76.8%	15.9%	0.80	Hand-crafted features only
CodeBERT (fine-tuned)	89.2%	48.1%	0.65	High recall, unusable FPR
VulnBERT	92.2%	1.2%	0.95	Best of both approaches

The problem with vanilla CodeBERT: I first tried fine-tuning CodeBERT directly. Results: 89% recall but 48% false positive rate (measured on the same test set). Unusable, flagging half of all commits.

Why so bad? CodeBERT learns shortcuts: “big diff = dangerous”, “lots of pointers = risky”. These correlations exist in training data but don’t generalize. The model pattern-matches on surface features, not actual bug patterns.

The VulnBERT approach: Combine neural pattern recognition with human domain expertise.

┌─────────────────────────────────────────────────────────────────────┐
│                            INPUT: Git Diff                          │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                ┌───────────────┴───────────────┐
                ▼                               ▼
┌───────────────────────────┐   ┌───────────────────────────────────┐
│   Chunked Diff Encoder    │   │   Handcrafted Feature Extractor   │
│   (CodeBERT + Attention)  │   │   (51 engineered features)        │
└─────────────┬─────────────┘   └─────────────────┬─────────────────┘
              │ [768-dim]                         │ [51-dim]
              └───────────────┬───────────────────┘
                              ▼
              ┌───────────────────────────────┐
              │     Cross-Attention Fusion    │
              │     "When code looks like X,  │
              │      feature Y matters more"  │
              └───────────────┬───────────────┘
                              ▼
              ┌───────────────────────────────┐
              │        Risk Classifier        │
              └───────────────────────────────┘

Three innovations that drove performance:

1. Chunked encoding for long diffs. CodeBERT’s 512-token limit truncates most kernel diffs (often 2000+ tokens). I split into chunks, encode each, then use learned attention to aggregate:

# Learnable attention over chunks
chunk_attention = nn.Sequential(
    nn.Linear(hidden_size, hidden_size // 4),
    nn.Tanh(),
    nn.Linear(hidden_size // 4, 1)
)
attention_weights = F.softmax(chunk_attention(chunk_embeddings), dim=1)
pooled = (attention_weights * chunk_embeddings).sum(dim=1)

The model learns which chunks matter aka the one with spin_lock without spin_unlock, not the boilerplate.

2. Feature fusion via cross-attention. Neural networks miss domain-specific patterns. I extract 51 handcrafted features using regex and AST-like analysis of the diff:

Category	Features
Basic (4)	`lines_added`, `lines_removed`, `files_changed`, `hunks_count`
Memory (3)	`has_kmalloc`, `has_kfree`, `has_alloc_no_free`
Refcount (5)	`has_get`, `has_put`, `get_count`, `put_count`, `unbalanced_refcount`
Locking (5)	`has_lock`, `has_unlock`, `lock_count`, `unlock_count`, `unbalanced_lock`
Pointers (4)	`has_deref`, `deref_count`, `has_null_check`, `has_deref_no_null_check`
Error handling (6)	`has_goto`, `goto_count`, `has_error_return`, `has_error_label`, `error_return_count`, `has_early_return`
Semantic (13)	`var_after_loop`, `iterator_modified_in_loop`, `list_iteration`, `list_del_in_loop`, `has_container_of`, `has_cast`, `cast_count`, `sizeof_type`, `sizeof_ptr`, `has_arithmetic`, `has_shift`, `has_copy`, `copy_count`
Structural (11)	`if_count`, `else_count`, `switch_count`, `case_count`, `loop_count`, `ternary_count`, `cyclomatic_complexity`, `max_nesting_depth`, `function_call_count`, `unique_functions_called`, `function_definitions`

The key bug-pattern features:

'unbalanced_refcount': 1,    # kref_get without kref_put → leak
'unbalanced_lock': 1,        # spin_lock without spin_unlock → deadlock
'has_deref_no_null_check': 0,# *ptr without if(!ptr) → null deref
'has_alloc_no_free': 0,      # kmalloc without kfree → memory leak

Cross-attention learns conditional relationships. When CodeBERT sees locking patterns AND unbalanced_lock=1, that’s HIGH risk. Neither signal alone is sufficient, it’s the combination.

# Feature fusion via cross-attention
feature_embedding = feature_projection(handcrafted_features)  # 51 → 768
attended, _ = cross_attention(
    query=code_embedding,      # What patterns does the code have?
    key=feature_embedding,     # What do the hand-crafted features say?
    value=feature_embedding
)
fused = fusion_layer(torch.cat([code_embedding, attended], dim=-1))

3. Focal loss for hard examples. The training data is imbalanced where most commits are safe. Standard cross-entropy wastes gradient updates on easy examples. Focal loss:

Standard loss when p=0.95 (easy):  0.05
Focal loss when p=0.95:            0.000125  (400x smaller)

The model focuses on ambiguous commits: the hard 5% that matter.

Impact of each component (estimated from ablation experiments):

Component	F1 Score
CodeBERT baseline	~76%
+ Focal loss	~80%
+ Feature fusion	~88%
+ Contrastive learning	~91%
Full VulnBERT	95.4%

Note: Individual component impacts are approximate; interactions between components make precise attribution difficult.

The key insight: neither neural networks nor hand-crafted rules alone achieve the best results. The combination does.

Results on temporal validation (train ≤2023, test 2024):

Metric	Target	Result
Recall	90%	92.2% ✓
FPR	<10%	1.2% ✓
Precision	—	98.7%
F1	—	95.4%
AUC	—	98.4%

What these metrics mean:

Recall (92.2%): Of all actual bug-introducing commits, we catch 92.2%. Missing 7.8% of bugs.
False Positive Rate (1.2%): Of all safe commits, we incorrectly flag 1.2%. Low FPR = fewer false alarms.
Precision (98.7%): Of commits we flag as risky, 98.7% actually are. When we raise an alarm, we’re almost always right.
F1 (95.4%): Harmonic mean of precision and recall. Single number summarizing overall performance.
AUC (98.4%): Area under ROC curve. Measures ranking quality—how well the model separates bugs from safe commits across all thresholds.

The model correctly differentiates the same bug at different stages:

Commit	Description	Risk
`acf44a2361b8`	Fix for UAF in xe_vfio	12.4% LOW ✓
`1f5556ec8b9e`	Introduced the UAF	83.8% HIGH ✓

What the model sees: The 19-year bug

When analyzing the bug-introducing commit d205dc40798d:

-    if (ct == last) {
-        nf_conntrack_put(&last->ct_general);  // removed!
-    }
+    if (ct == last) {
+        last = NULL;
         continue;
     }
     if (ctnetlink_fill_info(...) < 0) {
         nf_conntrack_get(&ct->ct_general);  // still here

Extracted features:

Feature	Value	Signal
`get_count`	1	`nf_conntrack_get()` present
`put_count`	0	`nf_conntrack_put()` was removed
`unbalanced_refcount`	1	Mismatch detected
`has_lock`	1	Uses `read_lock_bh()`
`list_iteration`	1	Uses `list_for_each_prev()`

Model prediction: 72% risk: HIGH

The unbalanced_refcount feature fires because _put() was removed but _get() remains. Classic refcount leak pattern.

Limitations

Dataset limitations:

Only captures bugs with Fixes: tags (~28% of fix commits). Selection bias: well-documented bugs tend to be more serious.
Mainline only, doesn’t include stable-branch-only fixes or vendor patches
Subsystem classification is heuristic-based (regex on file paths)
Bug type detection based on keyword matching in commit messages and many bugs are “unknown” type
Lifetime calculation uses author dates, not commit dates, rebasing can skew timestamps
Some “bugs” may be theoretical (comments like “fix possible race” without confirmed trigger)

Model limitations:

92.2% recall is on a held-out 2024 test set, not a guarantee for future bugs
Can’t catch semantic bugs (logic errors with no syntactic signal)
Cross-function blind spots (bug spans multiple files)
Training data bias (learns patterns from bugs that were found, novel patterns may be missed)
False positives on intentional patterns (init/cleanup in different commits)
Tested only on Linux kernel code, may not generalize to other codebases

Statistical limitations:

Survivorship bias in year-over-year comparisons (recent bugs can’t have long lifetimes yet)
Correlation ≠ causation for subsystem/bug-type lifetime differences

What this means: VulnBERT is a triage tool, not a guarantee. It catches 92% of bugs with recognizable patterns. The remaining 8% and novel bug classes still need human review and fuzzing.

What’s next

92.2% recall with 1.2% FPR is production-ready. But there’s more to do:

RL-based exploration: Instead of static pattern matching, train an agent to explore code paths and find bugs autonomously. The current model predicts risk; an RL agent could generate triggering inputs.
Syzkaller integration: Use fuzzer coverage as a reward signal. If the model flags a commit and Syzkaller finds a crash in that code path, that’s strong positive signal.
Subsystem-specific models: Networking bugs have different patterns than driver bugs. A model fine-tuned on netfilter might outperform the general model on netfilter commits.

The goal isn’t to replace human reviewers but to point them at the 10% of commits most likely to be problematic, so they can focus attention where it matters.

Reproducing this

The dataset extraction uses the kernel’s Fixes: tag convention. Here’s the core logic:

def extract_fixes_tag(commit_msg: str) -> Optional[str]:
    """Extract the commit ID from a Fixes: tag"""
    pattern = r'Fixes:\s*([a-f0-9]{12,40})'
    match = re.search(pattern, commit_msg, re.IGNORECASE)
    return match.group(1) if match else None

# Mine all Fixes: tags from git history
git log --since="2005-04-16" --grep="Fixes:" --format="%H"

# For each fixing commit:
#   - Extract introducing commit hash
#   - Get dates from both commits
#   - Calculate lifetime
#   - Classify subsystem from file paths

Full miner code and dataset: github.com/quguanni/kernel-vuln-data

TL;DR

125,183 bugs analyzed from 20 years of Linux kernel git history (123,696 with valid lifetimes)
Average bug lifetime: 2.1 years (2.8 years in 2025-only data due to survivorship bias in recent fixes)
0% → 69% of bugs found within 1 year (2010 vs 2022) (real improvement from better tooling)
13.5% of bugs hide for 5+ years (these are the dangerous ones)
Race conditions hide longest (5.1 years average)
VulnBERT catches 92.2% of bugs on held-out 2024 test set with only 1.2% FPR (98.4% AUC)
Dataset: github.com/quguanni/kernel-vuln-data

If you’re working on kernel security, vulnerability detection, or ML for code analysis, I’d love to talk: jenny@pebblebed.com

Source link