Re: [chromium-dev] Debugging bad Callbacks

274 views
Skip to first unread message

Daniel Cheng

unread,
May 23, 2024, 12:13:05 AM5/23/24
to Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, [email protected]
I think this was less problematic before we implemented BRP, as we typically wouldn't crash before we invoked the bound functor. It's worth considering if there's some way to improve the debuggability of these reports, as I do agree it's not great. I'm just hoping we can do it without blowing up memory use or code size too much.

For example, if we somehow knew the concrete type of the object, would that help? Could we base::debug::Alias the remains of the object on the stack? I'm guessing even the vtable would be gone by the (currently irreversible) zapping though...

Daniel

On Wed, 22 May 2024 at 14:08, Ian Barkley-Yeung <[email protected]> wrote:
Yes, I can add a "just for this crash" debugging parameter. 

My concern is that I've seen this type of issue -- there's a callback, it's bad in some way, we have no idea which callback it is -- before. Each time, we have to add "just for this crash" debugging code and then get it into a dev release and then wait until we get enough adoption of the new dev release before we can even start fixing the real bug, increasing the chances that we ship something buggy to the end user. Perhaps because I'm on a stability gardening rotation, I see more of the "we don't know who to assign this to" problems than other people do. Do other stability gardener types see similar issues?


On Wednesday, May 22, 2024 at 1:52:02 PM UTC-7 Joe Mason wrote:
BindStateBase is VERY performance (and size) sensitive, since so many callbacks are created and passed around. If there's really common debug info missing, it might be worth adding more, but it'll be a heavy lift.

For this particular crash, since you know that the callback is passed to SetOnAborted, you could save some debug info (eg. a base::debug::StackTrace and/or base::debug:TaskTrace) in a member variable alongside the callback, instead of augmenting the callback. Then use base::Alias on those member variables in the stack frame that calls "on_aborted_.Run()". 

On Wed, May 22, 2024 at 4:30 PM Joe Mason <[email protected]> wrote:
+scheduler-dev, where the experts on task tracing hang out.

On Wed, May 22, 2024 at 4:21 PM Ian Barkley-Yeung <[email protected]> wrote:
We're trying to research a bug (https://g-issues.chromium.org/issues/335902543) where a OnceCallback has a bad pointer and is crashing.

The issue is that the call site doesn't tell us who owns the callback. It's one of those common places that many pieces of code hook into, so it could be any one of dozens of callbacks that has the problem. 

I've tried examining the stack variables using lldb, but unfortunately, I don't get any useful information about which function the callback will eventually resolve to. Minidumps don't save heap memory, so the contents of the BindState are not recorded.

First question: Does anyone have any ideas on how to debug this crash further? Is there a way I'm not seeing to figure out which piece of code has the issue with the invalid pointer?

Second question: Assuming no one has better ideas on debugging the problem.... I'm thinking of trying to improve the debuggability of these types of issues. (This isn't the first one I've seen.)

In particular, I'd like to move the RETURN_ADDRESS() macro from location.cc into a header, and then have base::BindOnce and base::BindRepeating store the results of RETURN_ADDRESS() into a const void * in BindStateBase, call it BindStateBase::creation_address_. Then, in base::OnceCallback::Run and base::RepeatingCallback::Run, copy the creation_address_ into a local variable and use base::debug::Alias to ensure it's visible on the stack in a minidump. It's annoying but I can then manually symbolize the creation address to the creation function.

Thoughts?

(Side note: Why not a full Location object? Because the normal trick of having a defaulted parameter Location loc = FROM_HERE at the end of a function's parameter list doesn't work with base::BindOnce / base::BindRepeating since they use variadic templates and I don't think you can have a defaulted parameter in a variadic template function. And I'm not willing to do an LSC to change every single call to base::BindOnce/base::BindRepeating to have a FROM_HERE at the beginning. If someone with more template knowledge than me knows of a clever way to accomplish this without requiring every call site to change, please let me know!)

--
--
Chromium Developers mailing list: [email protected]
View archives, change email options, or unsubscribe:
https://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/16e70588-f302-4607-9c0c-9030e452df59n%40chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/scheduler-dev/2628b2c4-a4c3-433d-a8b5-ada611d25687n%40chromium.org.

Daniel Cheng

unread,
May 23, 2024, 1:20:47 AM5/23/24
to Ian Barkley-Yeung, Chromium-dev, Dave Tapuska, Joe Mason, scheduler-dev, [email protected]
I don't really want to alias the entire BindState, and we certainly can't do it on every invocation. It could be very large, and it would introduce more bloat in the BindState specializations, which already take up a non-trivial % of the Chrome binary.

I was suggesting memcpying the dangling object so we could at least figure out what type of object it was. However, that doesn't work today because PartitionAlloc zaps an object and overwrites the entire thing with a pattern, *including* the vtable pointers. The pattern was intentionally chosen to try to maximize the chances of crashing with bad data/bad deref when using freed objects.

The other problem here, of course, is that the object has already been destroyed, so the vtable entries would point to the base class vtables. So we'd have to somehow save them before the dtor started running.

Daniel

On Wed, 22 May 2024 at 15:15, Ian Barkley-Yeung <[email protected]> wrote:
> The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

I'm not sure I believe it though. go/crash marks those frames as having been code-folded (ICF). So any callback with a similar signature (a raw pointer and a weak pointer) will go through that code address. Is it possible that's the source of the problem? Yes. Is it certain? No. And I've had other bugs from previous stability gardening rotations where I believed those code-folded signatures and assigned the crash to the wrong team -- it wasted a week. 

For example, if we somehow knew the concrete type of the object, would that help? Could we base::debug::Alias the remains of the object on the stack? I'm guessing even the vtable would be gone by the (currently irreversible) zapping though...

My understanding is that vtable pointers point to read-only data sections in the ELF? So it might be able to understand a vtable pointer.

I don't think BindState is copyable, but yes, one could memcpy the whole thing into a local char array of the correct size in one of the Invoker functions. I think it would make more sense to only copy the BindState::functor_ which gives you address the callback wanted to call. But that also does impose some cost on each call to Once/RepeatingCallback::Run. (And it's even more difficult to interpret.) Still, it should also work.

On Wednesday, May 22, 2024 at 2:39:51 PM UTC-7 Dave Tapuska wrote:
Have you looked at:


The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

dave.

Arthur Sonzogni

unread,
May 23, 2024, 4:53:45 PM5/23/24
to [email protected], Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, [email protected], Daniel Cheng
I strongly agree with what Dave found in StackTrace. Thanks!
If the WeakPtr to the parent is gone, it means the pointer to the child is dangling.

I am going to propose a patch fixing this.

What caused the author to use this unfortunate detour is because we don't have a `View::GetWeakPtr()` function. Maybe we should? Given the importance of this class, I'm guessing there must be some push back?
Arthur @arthursonzogni


On Thu, May 23, 2024 at 12:36 AM Dave Tapuska <[email protected]> wrote:
Right but I thought the signature would need to include an AshNotificationView (so I'd expect the closures in this method (and the other 5 in this class to match)).. Either way this code is funky..

The "if (parent)" would never execute because it owns collapsed_summary_view_ and main_view_ which would fail the dangling raw_ptr checks, I think those should actually be removed as args on the callbacks and just access them directly when "if (parent)" is true. 

dave.

On Wed, May 22, 2024 at 6:15 PM Ian Barkley-Yeung <[email protected]> wrote:
> The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

I'm not sure I believe it though. go/crash marks those frames as having been code-folded (ICF). So any callback with a similar signature (a raw pointer and a weak pointer) will go through that code address. Is it possible that's the source of the problem? Yes. Is it certain? No. And I've had other bugs from previous stability gardening rotations where I believed those code-folded signatures and assigned the crash to the wrong team -- it wasted a week. 

For example, if we somehow knew the concrete type of the object, would that help? Could we base::debug::Alias the remains of the object on the stack? I'm guessing even the vtable would be gone by the (currently irreversible) zapping though...

My understanding is that vtable pointers point to read-only data sections in the ELF? So it might be able to understand a vtable pointer.

I don't think BindState is copyable, but yes, one could memcpy the whole thing into a local char array of the correct size in one of the Invoker functions. I think it would make more sense to only copy the BindState::functor_ which gives you address the callback wanted to call. But that also does impose some cost on each call to Once/RepeatingCallback::Run. (And it's even more difficult to interpret.) Still, it should also work.

On Wednesday, May 22, 2024 at 2:39:51 PM UTC-7 Dave Tapuska wrote:
Have you looked at:


The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

dave.

On Wed, May 22, 2024 at 5:13 PM Daniel Cheng <[email protected]> wrote:

--
--
Chromium Developers mailing list: [email protected]
View archives, change email options, or unsubscribe:
https://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
Reply all
Reply to author
Forward
0 new messages