Clustered reference count islands to reduce BackupRefPtr memory overhead

428 views
Skip to first unread message

LeMay, Michael

unread,
Apr 30, 2023, 8:15:45 AM4/30/23
Hi,

Slide 27 of the "MiraclePtr status update" deck from the Memory Safety Summit last year [1] shows that a majority of the memory overhead from BackupRefPtr is due to the padding needed for adding a ref count at the end of the previous slot for each allocation. That slide was hidden, which I assume was probably just due to time limitations, but please let me know if it is instead due to some other reason.

Assuming that the data on that slide is still valid and reasonably up-to-date, I have a clarifying question about the graph showing wasted memory with BackupRefPtr enabled: Are the ref counts themselves included in the "waste" that is shown, or are they counted as "requested memory?"

Regardless, I'm wondering whether a potential alternative approach of clustering ref counts into "islands" of contiguous ref counts has been evaluated. It would look something like this:

+---------+---------+---------+---------++-----------+-----------+-----------+-----------++---------+--
| Alloc 0 | Alloc 1 | Alloc 2 | Alloc 3 || Ref Cnt 0 | Ref Cnt 1 | Ref Cnt 2 | Ref Cnt 3 || Alloc 4 | ...
+---------+---------+---------+---------++-----------+-----------+-----------+-----------++---------+--

I illustrated a cluster of four ref counts, since they're each four bytes and PartitionAlloc uses 16-byte alignment. This would avoid imposing any additional padding overheads.

On the other hand, I realize that this has a variety of drawbacks compared to the current approach, such as:
1. Some previous slots already have enough padding to fully contain the ref count for the next slot, and this approach would not reclaim that space.
2. Computing the location of the ref count for each allocation would become more complicated.
3. Computing the bounds of each slot for bounds-checking [2-4] would become more complicated.
4. Cache locality between an allocation and its associated ref count could drop for some allocations. Conversely, reducing padding overheads would help to improve cacheability overall.

Best regards,
Michael LeMay

[1] "MiraclePtr status update"; https://docs.google.com/presentation/d/1beobMsVRHJZ5Q3GeKsRSxq60SMQ-Y7GxJBrWPelVooc/edit#slide=id.g1a54e9dd755_2_78
[2] "Implement operator + and - for raw_ptr<>"; https://bugs.chromium.org/p/chromium/issues/detail?id=1073933#c327
[3] "Poison end-of-allocation BackupRefPtrs"; https://bugs.chromium.org/p/chromium/issues/detail?id=1073933#c336
[4] "Tighten BackupRefPtr OOB poisoning based on types"; https://bugs.chromium.org/p/chromium/issues/detail?id=1073933#c338

Bartek Nowierski

unread,
Apr 30, 2023, 2:42:39 PM4/30/23
to LeMay, Michael, Benoit Lize, Keishi Hattori, [email protected]
Hit Michael,
Thanks for the continuous interest in MiraclePtr!

The slide wording is somewhat imprecise and hence could be interpreted in different ways. Judging from the rest of the email you probably understood it the way we intended, but wanted clarify just in case. PA operates on slots of pre-defined sizes, e.g. 224, 256, 320. Slots of the same size are grouped next to each other into slot spans (multiple system pages). If a request of size e.g. 231 comes, we use a slot of size 256 to satisfy it, and 25B of padding is left... and a ref-count can easily fit in that padding. When a request of size e.g. 253 comes, we don't have enough padding for the ref-count thus we have to use a slot size of 320 instead. In other words, for many requested allocation sizes we can squeeze in a ref-count for free, but when we can't it costs a lot. Given that the request size distribution isn't uniform, we hit the latter case unproportionally more.

Are the ref counts themselves included in the "waste" that is shown, or are they counted as "requested memory?"
I don't know but that's a good question. @Keishi Hattori  or @Benoit Lize may know.

Thanks for suggesting an idea. We considered a solution of grouping ref-counts together, not exactly the island idea as yours. As you noted, it'd make it harder to calculate ref-count location, but also slot start for regular allocation requests. It'd also throw off the algorithm that tightly packs into multi-page slot spans. More importantly, it'd also throw off AlignedAlloc which relies on power-of-2-sized slots to be power-of-2-aligned. There is more to cache locality than you say. If there are multiple ref-count next to each, they'll share a cacheline, causing it to ping-pong between processors. (We kinda have that problem already to much lesser extent, as the ref-count shares a cacheline with the previous slot.)


Thanks,
Bartek


--
You received this message because you are subscribed to the Google Groups "memory-safety-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-safety-dev/CO1PR11MB477090C2133499B43BC593A9FA699%40CO1PR11MB4770.namprd11.prod.outlook.com.
For more options, visit https://groups.google.com/a/chromium.org/d/optout.

LeMay, Michael

unread,
May 1, 2023, 5:37:20 PM5/1/23
to Bartek Nowierski, Benoit Lize, Keishi Hattori, [email protected]

| As you noted, it'd make it harder to calculate ref-count location, but also slot start for regular allocation requests. It'd also throw off the algorithm that tightly packs into multi-page slot spans. More importantly, it'd also throw off AlignedAlloc which relies on power-of-2-sized slots to be power-of-2-aligned.

 

What if the ref counts were packed into reserved slots at intervals that depend on the slot size? This would avoid the need to change any slot boundaries. For example, for 16-byte slots, every fifth slot could store ref counts for the preceding four slots. For 32-byte slots, every ninth slot could store ref counts for the preceding eight slots, and so forth. Of course, this would further increase the distance between a slot and its corresponding ref count in some cases, which would further reduce cache locality between data and metadata. However, some ref counts would still fit in the same cachelines as their associated data. Many ref counts would also at least fit on the same pages as their associated data.

 

There would probably also be some slot size threshold beyond which it would be preferable to store the ref counts in the bitmaps region of the super page to avoid needing to reserve a huge slot for ref counts.

 

| If there are multiple ref-count next to each, they'll share a cacheline, causing it to ping-pong between processors.

 

Could the memory savings from clustering ref counts help to motivate potentially enhancing PA to group allocations for a thread in a contiguous portion of each bucket when possible?

 

Thanks,

Michael

Chris Palmer

unread,
May 11, 2023, 12:51:29 AM5/11/23
to LeMay, Michael, Bartek Nowierski, Benoit Lize, Keishi Hattori, [email protected]
Is it worth revisiting the pre-defined slot sizes? Such as making them more granular?

In my idle moments I have dreamt of dynamically determining slot sizes (rounded up to the nearest multiple of 16), rather than pre-defining them. No idea how feasible that would be, nor if it would be a net win.

Reply all
Reply to author
Forward
0 new messages