8000 Fix deadlock on activity onDestroy by sagebind · Pull Request #94 · rust-mobile/android-activity · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix deadlock on activity onDestroy #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2023

Conversation

sagebind
Copy link
Contributor

Fix a deadlock that occurs when an activity is destroyed without process termination, such as when an activity is destroyed and recreated due to a configuration change.

The deadlock occurs because notify_destroyed blocks until destroyed is set to true. This only occurs when WaitableNativeActivityState is dropped, but the WaitableNativeActivityState instance is the very thing being used to await for destruction, resulting in a deadlock.

This doesn't really make much sense, so instead, wait for the Rust-side main thread to terminate to ensure cleanup can proceed.

@ardocrat
Copy link
8000 ardocrat commented Jul 19, 2023

For me its happening when Android Service is running.

Fix a deadlock that occurs when an activity is destroyed without process
termination, such as when an activity is destroyed and recreated due to
a configuration change.

The deadlock occurs because `notify_destroyed` blocks until `destroyed`
is set to `true`. This only occurs when `WaitableNativeActivityState` is
dropped, but the `WaitableNativeActivityState` instance is the very
thing being used to await for destruction, resulting in a deadlock.

Instead of waiting for the `WaitableNativeActivityState` to be dropped
we now wait until the native `android_main` thread has stopped.

So we can tell the difference between the thread not running because it
hasn't started or because it has finished (in case `android_main`
returns immediately) this replaces the `running` boolean with a
tri-state enum.

Co-authored-by: Robert Bragg <robert@sixbynine.org>
@rib rib force-pushed the fix-deadlock-on-ondestroy branch from 37625ec to a9e91f4 Compare July 21, 2023 19:12
@rib
Copy link
Member
rib commented Jul 21, 2023

Ah, right, oops!

Your fix generally looked good to me but one notable thing was that it looked like it could theoretically enable a different deadlock if the the spawned thread for android_main returned immediately then the wait for running to get set could block the Java main thread indefinitely.

I replaced running with a tristate enum to avoid that possibility.

Although it's redundant I've currently kept the destroyed field which is now set immediately when onDestroyed is called from the java main thread.

It would be great if you're able to double check that this updated PR works for you @sagebind and @ardocrat

@rib rib marked this pull request as ready for review July 21, 2023 19:18
@rib rib requested a review from MarijnS95 July 21, 2023 19:21
@sagebind
Copy link
Contributor Author

@rib Good catch, thanks. Your patch does look better, and I just tested it and it works just as well in my app. I wasn't totally confident in my implementation since I'm not familiar with the codebase (and new to Android in general), which is why I started with a draft PR.

@rib
Copy link
Member
rib commented Jul 22, 2023

Thanks for testing @sagebind

It'd be good to make a patch release with this fix, and to do that before landing the GameActivity 2.0.2 update

@ardocrat
Copy link

Sadly not fixed at my case.

At logs I can see deadlock happened at pthread_cond_wait(&android_app->cond, &android_app->mutex); of android_app_free function from android_native_app_glue.c file and no logs from Rust glue.rs file.

So as temporary dirty fix I am just killing my process after some time to relaunch application later without problems :)

image

@rib
Copy link
Member
rib commented Jul 26, 2023

@ardocrat ah okey, if you're seeing a reference to android_native_app_glue.c that implies you're using the game-activity backend which wouldn't be affected by the fix here currently which only touches the native-activity backend.

I guess there must be a similar bug in the game-activity backend.

@rib
Copy link
Member
rib commented Jul 26, 2023

@ardocrat I can't currently see an equivalent bug in the game-activity backend.

In the game-activity backend it blocks in android_app_free (where you see the deadlock) waiting for the native (Rust) thread to respond to onDestroy with:

    pthread_mutex_lock(&android_app->mutex);
    android_app_write_cmd(android_app, APP_CMD_DESTROY);
    while (!android_app->destroyed) {
        pthread_cond_wait(&android_app->cond, &android_app->mutex);
    }
    pthread_mutex_unlock(&android_app->mutex);

and the thing that sets android_app->destroyed is the android_app_destroy function which should get called by android_app_entry immediately after _rust_glue_entry (i.e. after the app's android_main returns):

static void* android_app_entry(void* param) {
    // <snip>

    _rust_glue_entry(android_app); // (calls android_main)

    android_app_destroy(android_app); // <- should set app->destroyed = 1 to unblock android_app_free
    return NULL;
}
static void android_app_destroy(struct android_app* android_app) {
    LOGV("android_app_destroy!");
    free_saved_state(android_app);
    pthread_mutex_lock(&android_app->mutex);

    AConfiguration_delete(android_app->config);
    android_app->destroyed = 1;
    pthread_cond_broadcast(&android_app->cond);
    pthread_mutex_unlock(&android_app->mutex);
    // Can't touch android_app object after this.
}

Are you maybe able to add a bit of logging to confirm that your applications android_main is returning?

@ardocrat
Copy link
ardocrat commented Jul 26, 2023

Thank you @rib, I am using eframe, I forgot to call frame.close() to unblock android_main, but another problem happened here, eframe.close() method is leading to call of Activity's onPause() method and its impossible to call finish() after this, cause Activity is at background, moreover its blocking app thread for 5 seconds and if I am opening app between moment of exit and this timeout (5 seconds) I am getting ANR popup to force stop the app. At logs I am seeing:

18:07:29.004  D  ************** mainWorkCallback *********
18:07:29.004  D  mainWorkCallback: cmd=1
18:07:29.004  V  android_app_destroy!
18:07:29.031  D  onPause_native
18:07:29.031  V  Pause: 0x722f923370
18:07:34.222  I  Thread[6,tid=8884,WaitingInMainSignalCatcherLoop,Thread*=0x72af9196f0,peer=0x137c0110,"Signal Catcher"]: reacting to signal 3
18:07:34.222  I  
18:07:34.742  I  Wrote stack traces to tombstoned
---------------------------- PROCESS ENDED (8874) for package com.example.android ----------------------------

I guess this problem is related to eframe's close() method, it is pausing Activity by calling this method at same time on app destroy.

@rib
Copy link
Member
rib commented Jul 26, 2023

ah, curious.

The android lifecycle model (https://developer.android.com/guide/components/activities/activity-lifecycle) will go through onPause and onStop before reaching onDestroy.

From your log I see that it looks like android_app_destroy is called before onPause_native

The android_app isn't really in a usable state after android_app_destroy.

The comment at the end of android_app_destroy actually says // Can't touch android_app object after this though since it doesn't ever technically actually free() the android_app we should at least still be able to check app->destroyed

In the NativeActivity backend we have try_with_waitable_activity_ref utility that generalizes upgrading a weak reference for the state that may get dropped when the app exits so any Activity notifications afterwards become safe noops.< 8000 /p>

It looks like the GameActivity glue is missing these kinds of checks to consider that the app may have already destroyed when an Activity native method (like onPause_native) gets called.

We can hopefully address this with a check for app->destroyed in GameActivity.cpp:onPause_native, and should probably have similar checks for other native callbacks too.

@ardocrat
Copy link

@rib At this log I tried to manually finish it before calling frame.close(). if I am not calling Activity::finish() manually, this call is coming after android_main, but leads to same problem, cause Activity can not be finished at background. eframe puts Activity at pause state after calling close().

@rib
Copy link
Member
rib commented Jul 26, 2023

looking at the eframe implementation for close() it seems to just set a flag that results in Winit setting control_flow::Exit which should cause the winit event loop to exit.

Changing exactly where you finish() won't necessarily make much difference since GameActivity_finish just sends an async message to the java main thread.

My guess at the moment is that calling eframe.close() is working as expected and resulting in your android_main then returning normally which will result in calling GameActivity_finish (when sends an async message to main thread) and then android_app_destroy gets calls which leaves the android_app in an inconsistent state.

The async finish message can then trigger onPause -> onStop -> onDestroy lifecycle events but by the time onPause runs then android_app is already in an inconsistent state, and (more importantly) the native thread for android_main has exit.

because onPause_native doesn't recognize that the app has already been destroyed it is then going to try and send a message from the java main thread to your rust/native thread (which won't work because the Rust thread has already exit).

I think actually looking at this a little closer we can't address this directly in onPause_native because the GameActivity/NativeCode class doesn't actually know anything about the implementation details of android_app in android_native_app_glue.c (it just calls the callbacks provided).

I think the right fix here is to probably make android_app->destroyed be a value we read with atomic ops (or consistently read with the android_app->mutex held) and then each android_app Java callback (like onPause) should start with a check of android_app->destroyed and immediately return if its set.

It does give me some renewed motivation to think about starting to port GameActivity to Rust though. This same issue almost certainly used to exist in the old NativeActivity backend too when it was written in C.

@rib
Copy link
Member
rib commented Jul 26, 2023

@ardocrat I wonder if you could try out this local patch to android_app_set_activity_state:

 static void android_app_set_activity_state(struct android_app* android_app,
                                            int8_t cmd) {
     pthread_mutex_lock(&android_app->mutex);
-    android_app_write_cmd(android_app, cmd);
-    while (android_app->activityState != cmd) {
-        pthread_cond_wait(&android_app->cond, &android_app->mutex);
+    if (!android_app->destroyed) {
+        android_app_write_cmd(android_app, cmd);
+        while (android_app->activityState != cmd) {
+            pthread_cond_wait(&android_app->cond, &android_app->mutex);
+        }
     }
     pthread_mutex_unlock(&android_app->mutex);
 }

that doesn't account for all callbacks but should make onPause and onStop gracefully return if the app has already exit.

You might also need this change for android_app_free to make onDestroy also act gracefully:

 static void android_app_free(struct android_app* android_app) {
     pthread_mutex_lock(&android_app->mutex);
+
+    if (!android_app->destroyed) {
+        pthread_mutex_unlock(&android_app->mutex);
+        return;
+    }
+
     android_app_write_cmd(android_app, APP_CMD_DESTROY);
     while (!android_app->destroyed) {
         pthread_cond_wait(&android_app->cond, &android_app->mutex);

@rib
Copy link
Member
rib commented Jul 26, 2023

I can also aim to push a branch that you can try out with those fixes and more for the other callbacks, but just in the middle of some other android-activity changes atm and don't want to fully switch context just yet.

It's probably also worth noting that there's this PR that's looking to update the GameActivity backend and I'll also look at porting these changes to that branch if they help: #88

@ardocrat
Copy link
ardocrat commented Jul 26, 2023

It moved further, now I can see onSurfaceDestroyed_native and NativeWindowDestroyed events at logs, but sadly only onPause was called from Activity lifecycle and it was not finished, so when I am opening back the app, I can see only black screen, without ANR (Application Not Responding) popup now.

image

@rib
Copy link
Member
rib commented Jul 27, 2023

ah, okey, yeah I though you might get away with testing by just handling android_app_set_activity_state but yeah, android_app_set_window is another api that's called by one of the java callbacks that needs a guard. I made the changes locally but will try and share a branch soon.

For android_app_set window I added:

@@ -308,6 +308,10 @@ static void android_app_set_window(struct android_app* android_app,
                                    ANativeWindow* window) {
     LOGV("android_app_set_window called");
     pthread_mutex_lock(&android_app->mutex);
+    if (!android_app->destroyed) {
+        pthread_mutex_unlock(&android_app->mutex);
+        return;
+    }
     if (android_app->pendingWindow != NULL) {
         android_app_write_cmd(android_app, APP_CMD_TERM_WINDOW);
     }

@rib
Copy link
Member
rib commented Jul 30, 2023

@ardocrat I've just created this issue to track the game-activity deadlock: #97

I'll treat it as a separate issue for now, so we can merge this fix as is first.

@rib rib merged commit a84a7b5 into rust-mobile:main Jul 30, 2023
@znakeeye
Copy link

You should have filed an issue at google. Many of us are using GameActivity via prefab so we cannot patch the android_native_app_glue.c file :(

Anyway, I filed an issue. Please upvote 👍
https://issuetracker.google.com/issues/387030986

MarijnS95 8000 added a commit that referenced this pull request Mar 12, 2025
…unning

We see that some Android callbacks like `onStart()` deadlock,
specifically when returning out of the main thread before running any
event loop (but likely also whenver terminating the event loop), because
they don't check if the thread is still even running and are otherwise
guaranteed receive an `activity_state` update or other state change to
unblock themselves.

This is a followup to [#94] which only concerned itself with a deadlock
caused by a destructor not running because that very object was kept
alive to poll on the `destroyed` field that destructor was supposed to
set, but its new `thread_state` can be reused to disable these condvar
waits when the "sending" thread has disappeard.

Separately, that PR mentions `Activity` recreates because of
configuration changes which isn't supported anyway because `Activity` is
still wrongly assumed to be a global singleton.

[#94]: https://togithub.com/rust-mobile/android-activity/pull/94
MarijnS95 added a commit to Traverse-Research/android-activity that referenced this pull request Mar 12, 2025
…unning

We see that some Android callbacks like `onStart()` deadlock,
specifically when returning out of the main thread before running any
event loop (but likely also whenver terminating the event loop), because
they don't check if the thread is still even running and are otherwise
guaranteed receive an `activity_state` update or other state change to
unblock themselves.

This is a followup to [rust-mobile#94] which only concerned itself with a deadlock
caused by a destructor not running because that very object was kept
alive to poll on the `destroyed` field that destructor was supposed to
set, but its new `thread_state` can be reused to disable these condvar
waits when the "sending" thread has disappeard.

Separately, that PR mentions `Activity` recreates because of
configuration changes which isn't supported anyway because `Activity` is
still wrongly assumed to be a global singleton.

[rust-mobile#94]: https://togithub.com/rust-mobile/android-activity/pull/94
MarijnS95 added a commit to Traverse-Research/android-activity that referenced this pull request Mar 12, 2025
…unning

We see that some Android callbacks like `onStart()` deadlock,
specifically when returning out of the main thread before running
any event loop (but likely also whenever terminating the event loop),
because they don't check if the thread is still even running and are
otherwise guaranteed receive an `activity_state` update or other state
change to unblock themselves.

This is a followup to [rust-mobile#94] which only concerned itself with a deadlock
caused by a destructor not running because that very object was kept
alive to poll on the `destroyed` field that destructor was supposed to
set, but its new `thread_state` can be reused to disable these condvar
waits when the "sending" thread has disappeared.

Separately, that PR mentions `Activity` recreates because of
configuration changes which isn't supported anyway because `Activity` is
still wrongly assumed to be a global singleton.

[rust-mobile#94]: https://togithub.com/rust-mobile/android-activity/pull/94
MarijnS95 added a commit that referenced this pull request Mar 12, 2025
…unning

We see that some Android callbacks like `onStart()` deadlock,
specifically when returning out of the main thread before running
any event loop (but likely also whenever terminating the event loop),
because they don't check if the thread is still even running and are
otherwise guaranteed receive an `activity_state` update or other state
change to unblock themselves.

This is a followup to [#94] which only concerned itself with a deadlock
caused by a destructor not running because that very object was kept
alive to poll on the `destroyed` field that destructor was supposed to
set, but its new `thread_state` can be reused to disable these condvar
waits when the "sending" thread has disappeared.

Separately, that PR mentions `Activity` recreates because of
configuration changes which isn't supported anyway because `Activity` is
still wrongly assumed to be a global singleton.

[#94]: https://togithub.com/rust-mobile/android-activity/pull/94
@MarijnS95
Copy link
Member

You should have filed an issue at google. Many of us are using GameActivity via prefab so we cannot patch the android_native_app_glue.c file :(

@znakeeye nobody here is obliged to do anything, especially given how cumbersome it is to report upstream issues. We (especially Rib) has spent enough effort as it is to get and keep everything working especially for Rust, a thank-you would have been nicer instead 😉

@rib
Copy link
Member
rib commented Mar 12, 2025

You should have filed an issue at google. Many of us are using GameActivity via prefab so we cannot patch the android_native_app_glue.c file :(

Anyway, I filed an issue. Please upvote 👍 https://issuetracker.google.com/issues/387030986

just fwiw, I have filed various upstream issues for GameActivity and sometimes they don't really get looked at for a long time so maybe I was a bit lazy at the time with not creating an upstream issue. Thanks for filing the issue though.

I'm not currently sure how it can work to use a prefab build of the native GameActivity code. android-activity has a number of patches to the native GameActivity code for integration and since we vendor a copy of their native code then if you use their prefab I would expect that to pull in two (incompatible) copies of the native integration.

I guess it's lucky the application would link with duplicate GameActivity_onCreate entrypoint symbols if using the prefab and it also sounds lucky that it somehow works at runtime. If the prefab entrypoint is used then I suppose that means it will skip over calling _rust_glue_entry and go straight to android_main which would bypass some logging and JNI initialization as well as bypass the catch_unwind() that should stop an app being able to to panic over the C/C++ FFI boundary.

In general, it's currently only really expected that the game-activity backend will work with the 2.0.2 release, and the assumption would be that the native side of GameActivity is implemented by android-activity/game-activity-csrc which is automatically built into android-activity and includes a number of integration patches (ref: https://github.com/rust-mobile/android-activity/pull/88/commits)

@rib
Copy link
Member
rib commented Mar 12, 2025

More specifically; if the prefab integration for GameActivity would be used then it would bypass this _rust_glue_entry code and jump straight to running android_main:

pub unsafe extern "C" fn _rust_glue_entry(native_app: *mut ffi::android_app) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0