Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failed, "self._bus is None", bluez backend on Linux #641

Closed
egnor opened this issue Sep 21, 2021 · 7 comments · Fixed by #663
Closed

assertion failed, "self._bus is None", bluez backend on Linux #641

egnor opened this issue Sep 21, 2021 · 7 comments · Fixed by #663
Labels
Backend: BlueZ Issues and PRs relating to the BlueZ backend bug Something isn't working

Comments

@egnor
Copy link

egnor commented Sep 21, 2021

  • bleak version: (github head)
  • Python version: 3.7.3
  • Operating System: Raspberry Pi OS "buster"
  • BlueZ version (bluetoothctl -v) in case of Linux: 5.61

Description

Running a stress test that consists of repeated discovery/connect/transmit cycles, after a couple hours I see this message, which seems to happen when closing a connection (BleakClient's __aexit__):

Traceback (most recent call last):
  ... various uninteresting bits of my code elided ...
  File "/home/pi/source/nametag/nametag/bluetooth.py", line 174, in __aexit__
    await self._exits.aclose()
  File "/usr/lib/python3.7/contextlib.py", line 593, in aclose
    await self.__aexit__(None, None, None)
  File "/usr/lib/python3.7/contextlib.py", line 652, in __aexit__
    raise exc_details[1]
  File "/usr/lib/python3.7/contextlib.py", line 635, in __aexit__
    cb_suppress = await cb(*exc_details)
  File "/usr/lib/python3.7/contextlib.py", line 539, in _exit_wrapper
    return await cm_exit(cm, exc_type, exc, tb)
  File "/home/pi/.local/share/virtualenvs/nametag-Zg6otAYu/src/bleak/bleak/backends/client.py", line 65, in __aexit__
    await self.disconnect()
  File "/home/pi/.local/share/virtualenvs/nametag-Zg6otAYu/src/bleak/bleak/backends/bluezdbus/client.py", line 480, in disconnect
    assert self._bus is None
AssertionError

What I Did

I'm talking to a BTLE device (a cheap digital nametag). Unfortunately this is not reproducible on demand, though it usually crops up (and given that it's an assertion error is fairly damaging).

Discussion

Here is the code in question with some commentary added:

    async def disconnect(self) -> bool:
        """Disconnect from the specified GATT server.
        Returns:
            Boolean representing if device is disconnected.
        Raises:
            BleakDBusError: If there was a D-Bus error
            asyncio.TimeoutError if the device was not disconnected within 10 seconds
        """
        logger.debug(f"Disconnecting ({self._device_path})")

        if self._bus is None:
            # No connection exists. Either one hasn't been created or
            # we have already called disconnect and closed the D-Bus
            # connection.
            logger.debug(f"already disconnected ({self._device_path})")
            return True

        if self._disconnecting_event:
            # another call to disconnect() is already in progress
            logger.debug(f"already in progress ({self._device_path})")
            await asyncio.wait_for(self._disconnecting_event.wait(), timeout=10)
        elif self.is_connected:  # <== IF THIS IS FALSE, THE ASSERTION WILL FAIL
            self._disconnecting_event = asyncio.Event()
            try:
                # Try to disconnect the actual device/peripheral
                reply = await self._bus.call(
                    Message(
                        destination=defs.BLUEZ_SERVICE,
                        path=self._device_path,
                        interface=defs.DEVICE_INTERFACE,
                        member="Disconnect",
                    )
                )
                assert_reply(reply)
                await asyncio.wait_for(self._disconnecting_event.wait(), timeout=10)
            finally:
                self._disconnecting_event = None

        # sanity check to make sure _cleanup_all() was triggered by the
        # "PropertiesChanged" signal handler and that it completed successfully
        assert self._bus is None   # <== THE ASSERTION IN QUESTION

        return True

Since this function returns early if self._bus is already None, if the "if self._disconnecting_event:" and "elif self.is_connected" conditions both fall through, the assertion is guaranteed to fail!

Specifically, this could happen with a server side disconnection, with a race between

  • self._properties["Connected"] being set False
  • the call to disconnect()
  • the _cleanup_all() task spawned by the property change handler

This would all be a lot more bulletproof if _cleanup_all() were synchronous so there was no intermediate state with is_connected false but _bus still non-None. This appears to be there to support _remove_signal_handlers which requires an async call to remove_match.

Assuming that all has to be that way, I'd propose

  • establish the invariant that _disconnecting_event is triggered (and removed) exactly when self._bus is set to None
  • rename it, perhaps, to _bus_removal_event to drive that point home
  • change the logic in disconnect() to check for the event (and pile on it), then check for self._bus instead of checking is_connected?
@dlech dlech added Backend: BlueZ Issues and PRs relating to the BlueZ backend bug Something isn't working labels Sep 21, 2021
@dlech
Copy link
Collaborator

dlech commented Oct 7, 2021

Great analysis of the issue, thanks for that.

* bleak version: (github head)

I'm guessing that #621 didn't fix the issue then?

Looking at the bigger picture, I've been thinking that we should probably put locks on all methods that interact with hardware. This would prevent re-entrancy issues for users with parallel tasks as well (e.g. trying to connect to a device at the same time in two separate tasks). I think these sorts of issues are a problem on all backends, not just BlueZ.

dlech added a commit that referenced this issue Oct 7, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by ensuring that we are not in the middle of `_cleanup_all()`
before proceeding with the `disconnect()` method.

Fixes #641.
dlech added a commit that referenced this issue Oct 7, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
@dlech
Copy link
Collaborator

dlech commented Oct 7, 2021

Hi @egnor. I've make a couple of pull requests with possible solutions. Could you please run your stress test on them?

@egnor
Copy link
Author

egnor commented Oct 8, 2021

Alas, as described in #631, this particular project moved to using a different stack, so it's not trivial for me to test back with bleak. I could dig into my git version history, but... I also just shipped off all my devices to a field deployment. (I'll get them back in a few weeks.)

So probably just assume it's fixed until you see more complaints (from me or anyone else)? I think more synchronization here is absolutely a good move for all the reasons you mention.

(Oh and yes, #621 hadn't helped -- I'd moved to head specifically to try that.)

Thanks for all your work on this!! 🙏

@egnor egnor closed this as completed Oct 8, 2021
dlech added a commit that referenced this issue Oct 8, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
dlech added a commit that referenced this issue Oct 8, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
dlech added a commit that referenced this issue Oct 8, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
dlech added a commit that referenced this issue Oct 8, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
dlech added a commit that referenced this issue Oct 8, 2021
When the following sequence of events occured, `assert self._bus is None`
would be hit.

- A "Connected": False property change signal is sent over D-Bus.
- Bleak handles the event and calls the `_cleanup_all()` method.
- The `_cleanup_all()` sends a D-Bus message to unsubscribe from signals and awaits it.
- While the previous method is still awaiting, the disconnect method is called.
- This method hits an assertation because the connected property is false
  but there is still a D-Bus bus connection since `_cleanup_all()` is still using it.

This fixes the issue by making `_cleanup_all()` a syncronous method.
Since we are closing the bus connection, we should not need to remove
signal watchers first.

Fixes #641.
dlech added a commit that referenced this issue Oct 17, 2021
Added
-----

* Allow 16-bit UUID string arguments to ``get_service()`` and ``get_characteristic()``.
* Added ``register_uuids()`` to augment the uuid-to-description mapping.
* Added support for Python 3.10.
* Added ``force_indicate`` keyword argument for WinRT backend client's ``start_notify`` method. Fixes #526.
* Added python-for-android backend.

Changed
-------

* Changed from ``winrt`` dependency to ``bleak-winrt``.
* Improved error when connecting to device fails in WinRT backend. Fixes #647.
* Changed examples to use ``asyncio.run()``.
* Changed the default notify method for the WinRT backend from ``Indicate`` to ``Notify``.
* Refactored GATT error handling in WinRT backend.
* Changed Windows Bluetooth packet capture instructions. Fixes #653.
* Replaced usage of deprecated ``@abc.abstractproperty``.
* Use ``asyncio.get_running_loop()`` instead of ``asyncio.get_event_loop()``.

Removed
-------

* Removed ``dotnet`` backend.
* Dropped support for Python 3.6.
* Removed ``use_cached`` kwarg from ``BleakClient`` ``connect()`` and ``get_services()`` methods. Fixes #646.

Fixed
-----

* Fixed unused timeout in the implementation of BleakScanner's ``find_device_by_address()`` function.
* Fixed BleakClient ignoring the ``adapter`` kwarg. Fixes #607.
* Fixed writing descriptors in WinRT backend. Fixes #615.
* Fixed race on disconnect and cleanup of BlueZ matches when device disconnects early. Fixes #603.
* Fixed memory leaks on Windows.
* Fixed protocol error code descriptions on WinRT backend. Fixes #532.
* Fixed race condition hitting assentation in BlueZ ``disconnect()`` method. Fixes #641.
* Fixed enumerating services on a device with HID service on WinRT backend. Fixes #599.
* Fixed subprocess running to check BlueZ version each time a client is created. Fixes #602.
* Fixed exception when discovering services after reconnecting in CoreBluetooth backend.
@dlech dlech mentioned this issue Oct 17, 2021
@alexanderadam
Copy link

This issue still happens when syncing with omblepy (i.e. here, here, here, here, here, here, here, here) and people kinda got used to it.

But I guess it would be great if this would be fixed and some point.

I'm not a Python developer but I can also reproduce this easily in case I can help.

@dlech
Copy link
Collaborator

dlech commented Sep 29, 2024

This issue is closed, so I would suggest starting a new issue with a reproducible test case.

@alexanderadam
Copy link

Thank you for the fast response.

Can you elaborate on the test case?

As mentioned before, I'm not a Python developer, so I can't contribute any code which would make any sense.
But I'm a omblepy user, a tool that uses bleak. And the disconnect method reliably reproduces that state for all of its users.

So I can reproduce the issue by using a device that I own with the tool that uses bleak.

@dlech
Copy link
Collaborator

dlech commented Sep 29, 2024

If you can reproduce the error with one of the example Bleak scripts or if you can create a minimum version of the code from the tool to reproduce the issue, that would be useful. Also, any logs of error stack traces, debug messages and Bluetooth packet could be useful (see troubleshooting page in the docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend: BlueZ Issues and PRs relating to the BlueZ backend bug Something isn't working
Projects
None yet
3 participants