Forum: >>> Magnum BBS <<<

How I tripped over sched_rt_runtime_us

From Grant Edwards@21:1/5 to All on Wed Aug 26 17:35:04 2015

How I Tripped Over sched_rt_runtime_us

or

Don't Try to Control What You Can't Measure

I recently spent almost two weeks banging my head on the wall trying
to figure out why some software I was working on was failing. There
were a number of SCHED_FIFO tasks triggered by timers. They ran with
periods of typically 4,8,10 and 20 ms. Each woke up based on a
periodic timerfd and ran for a few hundred microseconds. Most times
when I would start the program, everything ran fine -- and would run
fine indefinitely.

But, about 1 time out of 20 maybe, I would start up the programs, and
they just wouldn't work right. Hardware devices would repeatedly
time-out because they weren't being serviced. There would be gaps in
message streams, etc.

After poking about for a few days, I determined that a thread that was
supposed to be waking up every 4ms (for example), would run fine for
"a while" and then just "not run" for many tens of ms. Sometimes it
just didn't wake up, and sometimes it got suspended while it was
running. When running, it wasn't doing anything that could block.
When it got suspended while running, the location where it got
suspended was pretty much random.

After instrumenting code a bit more, I determined that the most common
failure state resulted in all my SCHED_FIFO threads being suspended
for 50ms once every second. _Exactly_ 50ms, _exactly_ once every
second.

Changing to SCHED_RR didn't change anything. Rearranging the
priorities didn't change anything. However when I changed one of the
threads to SCHED_OTHER, it started running without interruption.

The SCHED_OTHER thread was running _while_ SCHED_FIFO threads were
being suspended "against their will".

I had read about the real-time throttling feature in the kernel, but
it was only supposed to kick in when realtime tasks were using up 95%
of CPU. I had just measured my CPU usage, and the realtime tasks were
barely using 5% -- not anywhere close to the 95% limit.

But still, SCHED_FIFO threads being suspended for 50ms every 1s while SCHED_OTHER threads continue to run. That sure smells like somebody is intentionally blocking all realtime tasks 5% of the time.

Then it dawned on my that (at least on the ARM9), the mechanism used
by the kernel to measure per-thread CPU usage is based on the
assumption that tasks execute _randomly_ and their execution times are
not correlated with kernel timers.

Mine _are_ controlled by kernel timers.

If you're running threads triggered by timers, then the kernel's CPU
usage numbers are complete garbage[*]. I knew this, and in the past had
to work out my own CPU usage measurement schemes because of it.

_If_ a task is a realtime task, _and_ it happens to be in phase with
the CPU usage sampling, then the realtime usage looks like it's 100%
and the sched_rt_runtime_us limiter kicks in and locks out all
realtime threads for 50ms once every 1s. In this case, that breaks
all sorts of seemingly unrelated stuff.

The moral to the story: if you're are using an architecture with
without _real_ per-thread CPU usage measurements and you are
triggering realtime tasks using timers, you _must_ disable the
realtime limiter by setting sched_rt_runtime_us to -1 because the
limiter is being fed CPU measurements that are, in effect, random.

* If you have a thread that uses 2% of the CPU time, but it's
synchronized with the CPU usage sampling, then it will usually
appear to use either 0% or 100% depending on the phase difference
between the sampling and the task's execution. If the phase
difference is _just_ right you will see a number that varies wildly
between 0 and 100.

--
Grant Edwards grant.b.edwards Yow! I want the presidency
at so bad I can already taste
gmail.com the hors d'oeuvres.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Cronus
  Mon Apr 22 20:57:16 2024
  from Provo, Ut via SSH
- Keyop
  Mon Apr 22 13:23:46 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Tue Apr 23 10:35:46 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Tue Apr 23 10:35:37 2024
  from Huddersfield, West Yorkshire via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	295
Nodes:	16 (2 / 14)
Uptime:	03:23:43
Calls:	6,642
Calls today:	2
Files:	12,190
Messages:	5,325,645

How I tripped over sched_rt_runtime_us

Who's Online

Recent Visitors

System Info