ghsa-979q-6jjc-29jh
Vulnerability from github
Published
2025-07-28 12:30
Modified
2025-07-28 12:30
Details

In the Linux kernel, the following vulnerability has been resolved:

netfilter: nf_conntrack: fix crash due to removal of uninitialised entry

A crash in conntrack was reported while trying to unlink the conntrack entry from the hash bucket list: [exception RIP: __nf_ct_delete_from_lists+172] [..] #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack] #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack] #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack] [..]

The nf_conn struct is marked as allocated from slab but appears to be in a partially initialised state:

ct hlist pointer is garbage; looks like the ct hash value (hence crash). ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected ct->timeout is 30000 (=30s), which is unexpected.

Everything else looks like normal udp conntrack entry. If we ignore ct->status and pretend its 0, the entry matches those that are newly allocated but not yet inserted into the hash: - ct hlist pointers are overloaded and store/cache the raw tuple hash - ct->timeout matches the relative time expected for a new udp flow rather than the absolute 'jiffies' value.

If it were not for the presence of IPS_CONFIRMED, __nf_conntrack_find_get() would have skipped the entry.

Theory is that we did hit following race:

cpu x cpu y cpu z found entry E found entry E E is expired nf_ct_delete() return E to rcu slab init_conntrack E is re-inited, ct->status set to 0 reply tuplehash hnnode.pprev stores hash value.

cpu y found E right before it was deleted on cpu x. E is now re-inited on cpu z. cpu y was preempted before checking for expiry and/or confirm bit.

                ->refcnt set to 1
                E now owned by skb
                ->timeout set to 30000

If cpu y were to resume now, it would observe E as expired but would skip E due to missing CONFIRMED bit.

                nf_conntrack_confirm gets called
                sets: ct->status |= CONFIRMED
                This is wrong: E is not yet added
                to hashtable.

cpu y resumes, it observes E as expired but CONFIRMED: nf_ct_expired() -> yes (ct->timeout is 30s) confirmed bit set.

cpu y will try to delete E from the hashtable: nf_ct_delete() -> set DYING bit __nf_ct_delete_from_lists

Even this scenario doesn't guarantee a crash: cpu z still holds the table bucket lock(s) so y blocks:

        wait for spinlock held by z

                CONFIRMED is set but there is no
                guarantee ct will be added to hash:
                "chaintoolong" or "clash resolution"
                logic both skip the insert step.
                reply hnnode.pprev still stores the
                hash value.

                unlocks spinlock
                return NF_DROP
        <unblocks, then
         crashes on hlist_nulls_del_rcu pprev>

In case CPU z does insert the entry into the hashtable, cpu y will unlink E again right away but no crash occurs.

Without 'cpu y' race, 'garbage' hlist is of no consequence: ct refcnt remains at 1, eventually skb will be free'd and E gets destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.

To resolve this, move the IPS_CONFIRMED assignment after the table insertion but before the unlock.

Pablo points out that the confirm-bit-store could be reordered to happen before hlist add resp. the timeout fixup, so switch to set_bit and before_atomic memory barrier to prevent this.

It doesn't matter if other CPUs can observe a newly inserted entry right before the CONFIRMED bit was set:

Such event cannot be distinguished from above "E is the old incarnation" case: the entry will be skipped.

Also change nf_ct_should_gc() to first check the confirmed bit.

The gc sequence is: 1. Check if entry has expired, if not skip to next entry 2. Obtain a reference to the expired entry. 3. Call nf_ct_should_gc() to double-check step 1.

nf_ct_should_gc() is thus called only for entries that already failed an expiry check. After this patch, once the confirmed bit check pas ---truncated---

Show details on source website


{
  "affected": [],
  "aliases": [
    "CVE-2025-38472"
  ],
  "database_specific": {
    "cwe_ids": [],
    "github_reviewed": false,
    "github_reviewed_at": null,
    "nvd_published_at": "2025-07-28T12:15:29Z",
    "severity": null
  },
  "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nnetfilter: nf_conntrack: fix crash due to removal of uninitialised entry\n\nA crash in conntrack was reported while trying to unlink the conntrack\nentry from the hash bucket list:\n    [exception RIP: __nf_ct_delete_from_lists+172]\n    [..]\n #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]\n #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]\n #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]\n    [..]\n\nThe nf_conn struct is marked as allocated from slab but appears to be in\na partially initialised state:\n\n ct hlist pointer is garbage; looks like the ct hash value\n (hence crash).\n ct-\u003estatus is equal to IPS_CONFIRMED|IPS_DYING, which is expected\n ct-\u003etimeout is 30000 (=30s), which is unexpected.\n\nEverything else looks like normal udp conntrack entry.  If we ignore\nct-\u003estatus and pretend its 0, the entry matches those that are newly\nallocated but not yet inserted into the hash:\n  - ct hlist pointers are overloaded and store/cache the raw tuple hash\n  - ct-\u003etimeout matches the relative time expected for a new udp flow\n    rather than the absolute \u0027jiffies\u0027 value.\n\nIf it were not for the presence of IPS_CONFIRMED,\n__nf_conntrack_find_get() would have skipped the entry.\n\nTheory is that we did hit following race:\n\ncpu x \t\t\tcpu y\t\t\tcpu z\n found entry E\t\tfound entry E\n E is expired\t\t\u003cpreemption\u003e\n nf_ct_delete()\n return E to rcu slab\n\t\t\t\t\tinit_conntrack\n\t\t\t\t\tE is re-inited,\n\t\t\t\t\tct-\u003estatus set to 0\n\t\t\t\t\treply tuplehash hnnode.pprev\n\t\t\t\t\tstores hash value.\n\ncpu y found E right before it was deleted on cpu x.\nE is now re-inited on cpu z.  cpu y was preempted before\nchecking for expiry and/or confirm bit.\n\n\t\t\t\t\t-\u003erefcnt set to 1\n\t\t\t\t\tE now owned by skb\n\t\t\t\t\t-\u003etimeout set to 30000\n\nIf cpu y were to resume now, it would observe E as\nexpired but would skip E due to missing CONFIRMED bit.\n\n\t\t\t\t\tnf_conntrack_confirm gets called\n\t\t\t\t\tsets: ct-\u003estatus |= CONFIRMED\n\t\t\t\t\tThis is wrong: E is not yet added\n\t\t\t\t\tto hashtable.\n\ncpu y resumes, it observes E as expired but CONFIRMED:\n\t\t\t\u003cresumes\u003e\n\t\t\tnf_ct_expired()\n\t\t\t -\u003e yes (ct-\u003etimeout is 30s)\n\t\t\tconfirmed bit set.\n\ncpu y will try to delete E from the hashtable:\n\t\t\tnf_ct_delete() -\u003e set DYING bit\n\t\t\t__nf_ct_delete_from_lists\n\nEven this scenario doesn\u0027t guarantee a crash:\ncpu z still holds the table bucket lock(s) so y blocks:\n\n\t\t\twait for spinlock held by z\n\n\t\t\t\t\tCONFIRMED is set but there is no\n\t\t\t\t\tguarantee ct will be added to hash:\n\t\t\t\t\t\"chaintoolong\" or \"clash resolution\"\n\t\t\t\t\tlogic both skip the insert step.\n\t\t\t\t\treply hnnode.pprev still stores the\n\t\t\t\t\thash value.\n\n\t\t\t\t\tunlocks spinlock\n\t\t\t\t\treturn NF_DROP\n\t\t\t\u003cunblocks, then\n\t\t\t crashes on hlist_nulls_del_rcu pprev\u003e\n\nIn case CPU z does insert the entry into the hashtable, cpu y will unlink\nE again right away but no crash occurs.\n\nWithout \u0027cpu y\u0027 race, \u0027garbage\u0027 hlist is of no consequence:\nct refcnt remains at 1, eventually skb will be free\u0027d and E gets\ndestroyed via: nf_conntrack_put -\u003e nf_conntrack_destroy -\u003e nf_ct_destroy.\n\nTo resolve this, move the IPS_CONFIRMED assignment after the table\ninsertion but before the unlock.\n\nPablo points out that the confirm-bit-store could be reordered to happen\nbefore hlist add resp. the timeout fixup, so switch to set_bit and\nbefore_atomic memory barrier to prevent this.\n\nIt doesn\u0027t matter if other CPUs can observe a newly inserted entry right\nbefore the CONFIRMED bit was set:\n\nSuch event cannot be distinguished from above \"E is the old incarnation\"\ncase: the entry will be skipped.\n\nAlso change nf_ct_should_gc() to first check the confirmed bit.\n\nThe gc sequence is:\n 1. Check if entry has expired, if not skip to next entry\n 2. Obtain a reference to the expired entry.\n 3. Call nf_ct_should_gc() to double-check step 1.\n\nnf_ct_should_gc() is thus called only for entries that already failed an\nexpiry check. After this patch, once the confirmed bit check pas\n---truncated---",
  "id": "GHSA-979q-6jjc-29jh",
  "modified": "2025-07-28T12:30:35Z",
  "published": "2025-07-28T12:30:34Z",
  "references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2025-38472"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/2d72afb340657f03f7261e9243b44457a9228ac7"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/76179961c423cd698080b5e4d5583cf7f4fcdde9"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/938ce0e8422d3793fe30df2ed0e37f6bc0598379"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/a47ef874189d47f934d0809ae738886307c0ea22"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/fc38c249c622ff5e3011b8845fd49dbfd9289afc"
    }
  ],
  "schema_version": "1.4.0",
  "severity": []
}


Log in or create an account to share your comment.




Tags
Taxonomy of the tags.


Loading…

Loading…

Loading…

Sightings

Author Source Type Date

Nomenclature

  • Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
  • Confirmed: The vulnerability is confirmed from an analyst perspective.
  • Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
  • Patched: This vulnerability was successfully patched by the user reporting the sighting.
  • Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
  • Not confirmed: The user expresses doubt about the veracity of the vulnerability.
  • Not patched: This vulnerability was not successfully patched by the user reporting the sighting.


Loading…

Loading…