Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcos_semaphore_wait_timeout() and RTC adjusting #56

Closed
mtlynch opened this issue Nov 12, 2020 · 17 comments
Closed

vcos_semaphore_wait_timeout() and RTC adjusting #56

mtlynch opened this issue Nov 12, 2020 · 17 comments
Assignees
Labels
resolution:fixed Fixed type:bug Something isn't working

Comments

@mtlynch
Copy link
Contributor

mtlynch commented Nov 12, 2020

I noticed that when I use OMX encoding, uStreamer switches to CPU encoding on any OMX encoding failure:

Nov 12 15:00:35 tinypilot ustreamer[576]: -- INFO  [18.174    stream] -- Creating pool with 3 workers ...
Nov 12 15:00:35 tinypilot ustreamer[576]: -- INFO  [18.174    stream] -- Capturing ...
Nov 12 15:01:07 tinypilot ustreamer[576]: -- ERROR [36.695  worker-2] -- Can't wait VCOS semaphore: EAGAIN (timeout)
Nov 12 15:01:07 tinypilot ustreamer[576]: -- INFO  [36.695  worker-2] -- Error while compressing buffer, falling back to CPU
Nov 12 15:01:07 tinypilot ustreamer[576]: -- INFO  [36.717    stream] -- Destroying workers pool ...

Relevant code is here:

https://github.com/pikvm/ustreamer/blob/d9b91a1d5f9f7cad8aaf031e6ef5cda59c52a431/src/encoder.c#L218L239

The permanent switch to CPU surprised me, since EAGAIN seems like only a transient error. I tried patching the code to make the CPU encoding temporary (just for the current buffer), and it seems to work as expected.

Would you be open to making uStreamer wait longer before deciding to switch permanently to CPU-only encoding?

Proposed solutions

A. Only temporarily fall back to CPU

The simplest solution is to make the fallbacks always temporary. I put together a quick implementation of this solution:

https://github.com/pikvm/ustreamer/compare/master...mtlynch:only-fallback-temporarily?expand=1

B. Fall back permanently after N consecutive failed attempts

A more advanced version of (A) would be to maintain a count of successful buffer encodings and only fail over to CPU if there have been N consecutive failed encodings with OMX.

C. Fall back temporarily after EAGAIN, permanently after EINVAL and other errors

The other way to come at it would be to distinguish between fatal and non-fatal errors from the OMX encoder. We could treat EAGAIN as non-fatal (fall back to CPU just for this buffer) and errors like EINVAL as fatal (permanently switch over to CPU).

@mdevaev
Copy link
Member

mdevaev commented Nov 12, 2020

Actually, this is a fatal error that should not occur under normal conditions. If this happens, it means something is going very wrong. There may be something wrong with your kernel or GPU configuration. Show me uname -a, /boot/cmdline and /boot/config.txt

@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 12, 2020

Sure thing. It's pretty much just stock Raspberry Pi OS with the tc358473 and dwc modules loaded.

mike@tinypilot:~ $ uname -a
Linux tinypilot 5.4.72-v7l+ #1356 SMP Thu Oct 22 13:57:51 BST 2020 armv7l GNU/Linux
mike@tinypilot:~ $ cat /boot/cmdline.txt
console=serial0,115200 console=tty1 root=PARTUUID=29136d46-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait
mike@tinypilot:~ $ cat /boot/config.txt
# For more options and information see
# http://rpf.io/configtxt
# Some settings may impact device functionality. See link above for details

# uncomment if you get no picture on HDMI for a default "safe" mode
#hdmi_safe=1

# uncomment this if your display has a black border of unused pixels visible
# and your display can output without overscan
#disable_overscan=1

# uncomment the following to adjust overscan. Use positive numbers if console
# goes off screen, and negative if there is too much border
#overscan_left=16
#overscan_right=16
#overscan_top=16
#overscan_bottom=16

# uncomment to force a console size. By default it will be display's size minus
# overscan.
#framebuffer_width=1280
#framebuffer_height=720

# uncomment if hdmi display is not detected and composite is being output
#hdmi_force_hotplug=1

# uncomment to force a specific HDMI mode (this will force VGA)
#hdmi_group=1
#hdmi_mode=1

# uncomment to force a HDMI mode rather than DVI. This can make audio work in
# DMT (computer monitor) modes
#hdmi_drive=2

# uncomment to increase signal to HDMI, if you have interference, blanking, or
# no display
#config_hdmi_boost=4

# uncomment for composite PAL
#sdtv_mode=2

#uncomment to overclock the arm. 700 MHz is the default.
#arm_freq=800

# Uncomment some or all of these to enable the optional hardware interfaces
#dtparam=i2c_arm=on
#dtparam=i2s=on
#dtparam=spi=on

# Uncomment this to enable infrared communication.
#dtoverlay=gpio-ir,gpio_pin=17
#dtoverlay=gpio-ir-tx,gpio_pin=18

# Additional overlays and parameters are documented /boot/overlays/README

# Enable audio (loads snd_bcm2835)
dtparam=audio=on

[pi4]
# Enable DRM VC4 V3D driver on top of the dispmanx display stack
dtoverlay=vc4-fkms-v3d
max_framebuffers=2

[all]
#dtoverlay=vc4-fkms-v3d
dtoverlay=tc358743
dtoverlay=dwc2

@mdevaev
Copy link
Member

mdevaev commented Nov 12, 2020

Add gpu_mem=256 to config.txt and cma=128M to cmdline.

The name EAGAIN may be misleading. This may occur when waiting for a semaphore that indicates the end of encoding. Three seconds to encode a single image is a very, very long time. Therefore, reaching the timeout means that the hardware has entered a strange state and is unusable. Further decisions should be made by the user. So I would prefer to leave this logic unchanged.

PS: plz combine PRs about typos into one.

@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 12, 2020

Add gpu_mem=256 to config.txt and cma=128M to cmdline.

Will do, thanks!

Therefore, reaching the timeout means that the hardware has entered a strange state and is unusable. Further decisions should be made by the user. So I would prefer to leave this logic unchanged.

Fair enough.

PS: plz combine PRs about typos into one.

No problem. Done: #57

@mtlynch mtlynch closed this as completed Nov 12, 2020
@mdevaev mdevaev self-assigned this Nov 12, 2020
@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 13, 2020

Sorry, I'm actually still seeing this even with the changes to /boot/config.txt and /boot/cmdline.txt. I captured a more verbose log:

https://gist.github.com/mtlynch/31e4c000767ed25b45c62bea6d3c62c0

Here's just the failing worker:

Nov 12 23:43:35 tinypilot ustreamer[575]: -- VERB  [37.373  worker-2] -- Compressed new JPEG: size=153415, time=0.015, worker=2, buffer=3
Nov 12 23:43:35 tinypilot ustreamer[575]: -- DEBUG [37.373  worker-2] -- Worker 2 waiting for a new job ...
Nov 12 23:43:35 tinypilot ustreamer[575]: -- DEBUG [37.405  worker-2] -- Worker 2 compressing JPEG from buffer 1 ...
Nov 12 23:43:35 tinypilot ustreamer[575]: -- VERB  [37.405  worker-2] -- Compressing buffer 1 using OMX
Nov 12 23:43:48 tinypilot ustreamer[575]: -- ERROR [37.415  worker-2] -- Can't wait VCOS semaphore: EAGAIN (timeout)
Nov 12 23:43:48 tinypilot ustreamer[575]: -- INFO  [37.415  worker-2] -- Error while compressing buffer, falling back to CPU
Nov 12 23:43:48 tinypilot ustreamer[575]: -- DEBUG [37.415  worker-2] -- Releasing device buffer index=1 ...
Nov 12 23:43:48 tinypilot ustreamer[575]: -- VERB  [37.415  worker-2] -- Compression failed: worker=2, buffer=1
Nov 12 23:43:48 tinypilot ustreamer[575]: -- DEBUG [37.415  worker-2] -- Worker 2 waiting for a new job ...
Nov 12 23:43:48 tinypilot ustreamer[575]: -- DEBUG [37.438  worker-2] -- Bye-bye (worker 2)

I'm confused by the timing. According to the time of day time, it's 13 seconds between the start of compression and EAGAIN (23:43:35 to 23:43:48), but according to the ustreamer count, it's 10 milliseconds (37.405 to 37.415)? But neither of those matches the 3 seconds I'd expect as the semaphore wait.

@mtlynch mtlynch reopened this Nov 13, 2020
@mdevaev mdevaev added type:bug Something isn't working and removed resolution:fixed Fixed type:question User question labels Nov 13, 2020
@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

Okay, let's get this straight. There are no miracles, it can't be that everything works for me, but for you it breaks.

  • How long does it take from the start of ustreamer to the first fail? Does it happen almost instantly or does it work for a while?

  • Give me the entire list of ustreamer parameters used.

@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

Sorry, I didn't notice your message about the time in my sleep. I finally woke up and realized the problem. What if the system clock is weird? Are you running ntp?

@LeeNX
Copy link

LeeNX commented Nov 13, 2020

Sorry, I'm actually still seeing this even with the changes to /boot/config.txt and /boot/cmdline.txt.
@mtlynch Not to be silly, but did you reboot you R-Pi after making changes to the above files?

I wonder if there is a way to check if these values are active in the OS vs looking at the boot files on the file system? Time to Google that.

@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

The ustreamer clock is a monotonically increasing kernel timer. I trust this value a little more than the rest of the numbers.

I think I understand. For some reason, your OS changes the system clock.

Here the code. vcos_semaphore_wait_timeout() uses clock_gettime(CLOCK_REALTIME). This is a regular system RTC that is subject to jumps. Ustreamer uses CLOCK_MONOTONIC[_RAW|_FAST] so its values don't change in a strange way.

In turn sem_timedwait() also relies on the RTC value and sometimes this this causes problems.

In my opinion, using these functions are broken by design, but it is unlikely that anyone will change this so as not to break the existing code. In our case, checking after EAGAIN that the required amount of time has actually passed would help. But this will not save you from the situation if your clock is moved to the past.

For temporary fix uncomment this line and comment the whole switch. Check that everything works after the fix. You should also deal with your NTP.

In the meantime, I'll figure out how best to fix it.

@mdevaev mdevaev changed the title uStreamer aggressively gives up on OMX encoding vcos_semaphore_wait_timeout() and RTC adjusting Nov 13, 2020
@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

raspberrypi/userland#658

mtlynch added a commit to tiny-pilot/ansible-role-ustreamer that referenced this issue Nov 13, 2020
If the system clock changes significantly while uStreamer is running with the OMX encoder, it breaks the encoder because it thinks a 3 second semaphore timeout has occurred when it's actually just a time jump from network time adjustment. See:

pikvm/ustreamer#56 (comment)

This fix adds a time-sync.target dependency to the uStreamer service to ensure that uStreamer doesn't start until the system has completed its network time sync during boot.
@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 13, 2020

Thanks for digging into this @mdevaev!

I'm surprised this isn't more common. Shouldn't it occur any time the Pi adjusts its time by more than 3s when uStreamer is running with OMX encoding?

I run uStreamer as a systemd service that auto-runs at boot, so if others are starting uStreamer later after boot, then maybe it's more rare for significant time adjustments to happen while uStreamer is running.

One way to consistently repro:

  1. Disconnect your network from the Internet (alternatively: add routing rules that block the Pi's access to NTP servers)
  2. Shut off your Pi
  3. Wait 30 seconds
  4. Boot the Pi up again
  5. Run uStreamer with --encoder=omx
  6. Reconnect Internet
  7. Wait ~30 seconds for the Pi to sync its clock with NTP.

The Pi's clock will be behind real world time because it fell behind while the device was powered off. When the Pi regains Internet access and updates its clock, it will hit the Can't wait VCOS semaphore: EAGAIN (timeout) error because of the sudden time jump forward.

I considered fixing this on my end by adding a dependency on time-sync.target, but that feels pretty kludgy and still doesn't mitigate the error in scenarios where the device can't access NTP at boot time but regains access after uStreamer is already running.

For temporary fix uncomment this line and comment the whole switch. Check that everything works after the fix. You should also deal with your NTP.

That does work, thanks!

In the meantime, I'll figure out how best to fix it.

Please let me know if I can help out in this investigation or contributing code to the fix.

I wonder if there is a way to check if these values are active in the OS vs looking at the boot files on the file system? Time to Google that.

@LeeNX - I'm not sure how to check the cma setting at runtime, but you can verify the GPU setting like this:

$ sudo vcgencmd get_mem gpu
gpu=256M

@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

@mtlynch It's hard to say what this is about. Maybe your NTP server periodically lies and gives the wrong time, and then fixes it. Perhaps the Pi clock is somehow lagging behind. Time problems are quite common, but this is usually taken into account. Like everyone knows that sem_timedwait() works on RTC. The VCOS implementation is simply flawed, and its API is misleading that it uses relative timestamps. Still, VCOS functions are not used very often, and are almost never used outside of the Pi.

I made another fix that slightly increases the load on the processor (just a couple of percent), but saves the use of timeout. Try it please: https://github.com/pikvm/ustreamer/tree/semwait

@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 13, 2020

It's hard to say what this is about. Maybe your NTP server periodically lies and gives the wrong time, and then fixes it. Perhaps the Pi clock is somehow lagging behind.

Since the Pi has no RTC, isn't it expected that it always lags behind after being powered off? It has no way of knowing how much time has passed while it was powered down until it syncs with NTP again.

I made another fix that slightly increases the load on the processor (just a couple of percent), but saves the use of timeout. Try it please: https://github.com/pikvm/ustreamer/tree/semwait

Fix worked for me. I repro'ed the delayed access to NTP method, and uStreamer produced no errors and continued encoding with OMX. CPU load remains ~14-17% on Pi 4B 2GB.

@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

Since the Pi has no RTC, isn't it expected that it always lags behind after being powered off? It has no way of knowing how much time has passed while it was powered down until it syncs with NTP again.

Well, yes, that's right. In your case, NTP is started at a later date. Pi-KVM is designed differently, so it is not affected by this problem when booting.

Fix worked for me

Nice. Before making a final decision, I will wait for a response from the RPi developers. They may want to make a fix to their code.

@mdevaev
Copy link
Member

mdevaev commented Nov 13, 2020

Although I have never encountered a hang on this semaphore. There are three possible reasons: incorrect code, empty input data, or a hardware error. The first is already debugged, the second I filter, and the third I haven't seen yet, even considering the entire user base. Probably a good enough solution would be to use the function without timeout (infinite waiting). Anyway I am waiting for a response from RPi's devs.

PS: Many, many thanks :)

@mdevaev
Copy link
Member

mdevaev commented Nov 26, 2020

No response has been received. I will enable the default implementation without timeout. The timeout can be enabled by a flag during compilation.

@mtlynch
Copy link
Contributor Author

mtlynch commented Nov 27, 2020

Thanks! I bumped the bug to see if adding extra context about repro steps might help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolution:fixed Fixed type:bug Something isn't working
Development

No branches or pull requests

3 participants