ghsa-v9hh-vwqw-fc9h
Vulnerability from github
In the Linux kernel, the following vulnerability has been resolved:
btrfs: adjust subpage bit start based on sectorsize
When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit.
{ "affected": [], "aliases": [ "CVE-2025-37931" ], "database_specific": { "cwe_ids": [], "github_reviewed": false, "github_reviewed_at": null, "nvd_published_at": "2025-05-20T16:15:29Z", "severity": null }, "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nbtrfs: adjust subpage bit start based on sectorsize\n\nWhen running machines with 64k page size and a 16k nodesize we started\nseeing tree log corruption in production. This turned out to be because\nwe were not writing out dirty blocks sometimes, so this in fact affects\nall metadata writes.\n\nWhen writing out a subpage EB we scan the subpage bitmap for a dirty\nrange. If the range isn\u0027t dirty we do\n\n\tbit_start++;\n\nto move onto the next bit. The problem is the bitmap is based on the\nnumber of sectors that an EB has. So in this case, we have a 64k\npagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4\nbits for every node. With a 64k page size we end up with 4 nodes per\npage.\n\nTo make this easier this is how everything looks\n\n[0 16k 32k 48k ] logical address\n[0 4 8 12 ] radix tree offset\n[ 64k page ] folio\n[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers\n[ | | | | | | | | | | | | | | | | ] bitmap\n\nNow we use all of our addressing based on fs_info-\u003esectorsize_bits, so\nas you can see the above our 16k eb-\u003estart turns into radix entry 4.\n\nWhen we find a dirty range for our eb, we correctly do bit_start +=\nsectors_per_node, because if we start at bit 0, the next bit for the\nnext eb is 4, to correspond to eb-\u003estart 16k.\n\nHowever if our range is clean, we will do bit_start++, which will now\nput us offset from our radix tree entries.\n\nIn our case, assume that the first time we check the bitmap the block is\nnot dirty, we increment bit_start so now it == 1, and then we loop\naround and check again. This time it is dirty, and we go to find that\nstart using the following equation\n\n\tstart = folio_start + bit_start * fs_info-\u003esectorsize;\n\nso in the case above, eb-\u003estart 0 is now dirty, and we calculate start\nas\n\n\t0 + 1 * fs_info-\u003esectorsize = 4096\n\t4096 \u003e\u003e 12 = 1\n\nNow we\u0027re looking up the radix tree for 1, and we won\u0027t find an eb.\nWhat\u0027s worse is now we\u0027re using bit_start == 1, so we do bit_start +=\nsectors_per_node, which is now 5. If that eb is dirty we will run into\nthe same thing, we will look at an offset that is not populated in the\nradix tree, and now we\u0027re skipping the writeout of dirty extent buffers.\n\nThe best fix for this is to not use sectorsize_bits to address nodes,\nbut that\u0027s a larger change. Since this is a fs corruption problem fix\nit simply by always using sectors_per_node to increment the start bit.", "id": "GHSA-v9hh-vwqw-fc9h", "modified": "2025-05-20T18:30:55Z", "published": "2025-05-20T18:30:55Z", "references": [ { "type": "ADVISORY", "url": "https://nvd.nist.gov/vuln/detail/CVE-2025-37931" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/396f4002710030ea1cfd4c789ebaf0a6969ab34f" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/b80db09b614cb7edec5bada1bc7c7b0eb3b453ea" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/e08e49d986f82c30f42ad0ed43ebbede1e1e3739" } ], "schema_version": "1.4.0", "severity": [] }
Sightings
Author | Source | Type | Date |
---|
Nomenclature
- Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
- Confirmed: The vulnerability is confirmed from an analyst perspective.
- Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
- Patched: This vulnerability was successfully patched by the user reporting the sighting.
- Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
- Not confirmed: The user expresses doubt about the veracity of the vulnerability.
- Not patched: This vulnerability was not successfully patched by the user reporting the sighting.