ghsa-vwg7-hhf5-ff3g

Vulnerability from github

Published

2025-04-18 15:31

Modified

2025-05-02 09:30

Details

In the Linux kernel, the following vulnerability has been resolved:

x86/mce: use is_copy_from_user() to determine copy-from-user context

Patch series "mm/hwpoison: Fix regressions in memory failure handling", v4.

1. What am I trying to do:

This patchset resolves two critical regressions related to memory failure handling that have appeared in the upstream kernel since version 5.17, as compared to 5.10 LTS.

- copyin case: poison found in user page while kernel copying from user space
- instr case: poison found while instruction fetching in user space

2. What is the expected outcome and why

For copyin case:

Kernel can recover from poison found where kernel is doing get_user() or copy_from_user() if those places get an error return and the kernel return -EFAULT to the process instead of crashing. More specifily, MCE handler checks the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space.

For instr case:

If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage.

3. What actually happens and why

For copyin case: kernel panic since v5.17

Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS handling when posion found in get_user() or copy_from_user().

For instr case: user process is killed by a SIGBUS signal due to #CMCI and #MCE race

When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed.

Background: why UNcorrected errors tied to CMCI in Intel platform [1]

Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR.

Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to

CMCI. Just to add to the confusion, Linux does take an action (in

uc_decode_notifier()) to try to offline the page despite the UCNA signature name.

Background: why #CMCI and #MCE race when poison is consuming in

Intel platform [1]

Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read.

Thus:

1) Core is clever and thinks address A is needed soon, issues a speculative read.

2) Core finds it is going to use address A soon after sending the read request

3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A.

Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison).

Why user process is killed for instr case

Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not ---truncated---

Show details on source website

JSON

To clipboard

{
  "affected": [],
  "aliases": [
    "CVE-2025-39989"
  ],
  "database_specific": {
    "cwe_ids": [],
    "github_reviewed": false,
    "github_reviewed_at": null,
    "nvd_published_at": "2025-04-18T07:15:44Z",
    "severity": null
  },
  "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nx86/mce: use is_copy_from_user() to determine copy-from-user context\n\nPatch series \"mm/hwpoison: Fix regressions in memory failure handling\",\nv4.\n\n## 1. What am I trying to do:\n\nThis patchset resolves two critical regressions related to memory failure\nhandling that have appeared in the upstream kernel since version 5.17, as\ncompared to 5.10 LTS.\n\n    - copyin case: poison found in user page while kernel copying from user space\n    - instr case: poison found while instruction fetching in user space\n\n## 2. What is the expected outcome and why\n\n- For copyin case:\n\nKernel can recover from poison found where kernel is doing get_user() or\ncopy_from_user() if those places get an error return and the kernel return\n-EFAULT to the process instead of crashing.  More specifily, MCE handler\nchecks the fixup handler type to decide whether an in kernel #MC can be\nrecovered.  When EX_TYPE_UACCESS is found, the PC jumps to recovery code\nspecified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space.\n\n- For instr case:\n\nIf a poison found while instruction fetching in user space, full recovery\nis possible.  User process takes #PF, Linux allocates a new page and fills\nby reading from storage.\n\n\n## 3. What actually happens and why\n\n- For copyin case: kernel panic since v5.17\n\nCommit 4c132d1d844a (\"x86/futex: Remove .fixup usage\") introduced a new\nextable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the\nextable fixup type for copy-from-user operations, changing it from\nEX_TYPE_UACCESS to EX_TYPE_EFAULT_REG.  It breaks previous EX_TYPE_UACCESS\nhandling when posion found in get_user() or copy_from_user().\n\n- For instr case: user process is killed by a SIGBUS signal due to #CMCI\n  and #MCE race\n\nWhen an uncorrected memory error is consumed there is a race between the\nCMCI from the memory controller reporting an uncorrected error with a UCNA\nsignature, and the core reporting and SRAR signature machine check when\nthe data is about to be consumed.\n\n### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]\n\nPrior to Icelake memory controllers reported patrol scrub events that\ndetected a previously unseen uncorrected error in memory by signaling a\nbroadcast machine check with an SRAO (Software Recoverable Action\nOptional) signature in the machine check bank.  This was overkill because\nit\u0027s not an urgent problem that no core is on the verge of consuming that\nbad data.  It\u0027s also found that multi SRAO UCE may cause nested MCE\ninterrupts and finally become an IERR.\n\nHence, Intel downgrades the machine check bank signature of patrol scrub\nfrom SRAO to UCNA (Uncorrected, No Action required), and signal changed to\n#CMCI.  Just to add to the confusion, Linux does take an action (in\nuc_decode_notifier()) to try to offline the page despite the UC*NA*\nsignature name.\n\n### Background: why #CMCI and #MCE race when poison is consuming in\n    Intel platform [1]\n\nHaving decided that CMCI/UCNA is the best action for patrol scrub errors,\nthe memory controller uses it for reads too.  But the memory controller is\nexecuting asynchronously from the core, and can\u0027t tell the difference\nbetween a \"real\" read and a speculative read.  So it will do CMCI/UCNA if\nan error is found in any read.\n\nThus:\n\n1) Core is clever and thinks address A is needed soon, issues a\n   speculative read.\n\n2) Core finds it is going to use address A soon after sending the read\n   request\n\n3) The CMCI from the memory controller is in a race with MCE from the\n   core that will soon try to retire the load from address A.\n\nQuite often (because speculation has got better) the CMCI from the memory\ncontroller is delivered before the core is committed to the instruction\nreading address A, so the interrupt is taken, and Linux offlines the page\n(marking it as poison).\n\n\n## Why user process is killed for instr case\n\nCommit 046545a661af (\"mm/hwpoison: fix error page recovered but reported\n\"not\n---truncated---",
  "id": "GHSA-vwg7-hhf5-ff3g",
  "modified": "2025-05-02T09:30:30Z",
  "published": "2025-04-18T15:31:38Z",
  "references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2025-39989"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/0b8388e97ba6a8c033f9a8b5565af41af07f9345"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/1a15bb8303b6b104e78028b6c68f76a0d4562134"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/3e3d8169c0950a0b3cd5105f6403a78350dcac80"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/449413da90a337f343cc5a73070cbd68e92e8a54"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/5724654a084f701dc64b08d34a0e800f22f0e6e4"
    }
  ],
  "schema_version": "1.4.0",
  "severity": []
}

CVE-2025-39989 (GCVE-0-2025-39989)

Vulnerability from cvelistv5

Published

2025-04-18 07:01

Modified

2025-05-26 05:25

Severity ?

Summary

In the Linux kernel, the following vulnerability has been resolved: x86/mce: use is_copy_from_user() to determine copy-from-user context Patch series "mm/hwpoison: Fix regressions in memory failure handling", v4. ## 1. What am I trying to do: This patchset resolves two critical regressions related to memory failure handling that have appeared in the upstream kernel since version 5.17, as compared to 5.10 LTS. - copyin case: poison found in user page while kernel copying from user space - instr case: poison found while instruction fetching in user space ## 2. What is the expected outcome and why - For copyin case: Kernel can recover from poison found where kernel is doing get_user() or copy_from_user() if those places get an error return and the kernel return -EFAULT to the process instead of crashing. More specifily, MCE handler checks the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space. - For instr case: If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage. ## 3. What actually happens and why - For copyin case: kernel panic since v5.17 Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS handling when posion found in get_user() or copy_from_user(). - For instr case: user process is killed by a SIGBUS signal due to #CMCI and #MCE race When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. ### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. ### Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). ## Why user process is killed for instr case Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not ---truncated---

References

►

URL

Tags

	https://git.kernel.org/stable/c/5724654a084f701dc64b08d34a0e800f22f0e6e4
	https://git.kernel.org/stable/c/3e3d8169c0950a0b3cd5105f6403a78350dcac80
	https://git.kernel.org/stable/c/449413da90a337f343cc5a73070cbd68e92e8a54
	https://git.kernel.org/stable/c/0b8388e97ba6a8c033f9a8b5565af41af07f9345
	https://git.kernel.org/stable/c/1a15bb8303b6b104e78028b6c68f76a0d4562134

Impacted products

Vendor

Product

Version

►

Linux

Version: 4c132d1d844a53fc4e4b5c34e36ef10d6124b783
Version: 4c132d1d844a53fc4e4b5c34e36ef10d6124b783
Version: 4c132d1d844a53fc4e4b5c34e36ef10d6124b783
Version: 4c132d1d844a53fc4e4b5c34e36ef10d6124b783
Version: 4c132d1d844a53fc4e4b5c34e36ef10d6124b783
Version: 88eded8104d2ca0429703755dd250f8cbecc1447