LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)
-   -   Huge latency on pselect (https://www.linuxquestions.org/questions/linux-kernel-70/huge-latency-on-pselect-4175729712/)

nyquist09 10-09-2023 10:14 AM

Huge latency on pselect
 
I have the following pre-condition on Linux (low latency kernel):

  • I'm using a process to read serial data using pselect on a serial device /dev/ttySX.
  • Data comes in at a stable frequency of 400 Hz.
  • To optimize latency of that process I used some measures:
  • The reading thread is pinned to a core (using affinity) and no tasks are allowed to run on that core. This is done via cgroups/cpuset.
  • The reading thread has an RT prio of 49 (just below some of the IRQ processes) with the SCHED_FIFO policy.
  • The IRQ corresponding to /dev/ttyS4 is pinned to that same core. Also the IRQ process runs on that core. This was done to further reduce latency.
  • Fully loading the system with stress --cpu XX --io XX does not affect the latency of the readings, and they come in nicely at 400 Hz

The problem I experience:

There is another offending user space process which uses a lot of resources. When this one runs, it can cause huge latency spikes on my serial read thread. It can be 100 ms or more, even though serial data from the hardware arrives at 2.5 ms.
I don't know too much about that other offending user space process, except that it is using the regular 'nice' scheduler and it spawns a lot of threads.


My question:
  • Any ideas / approaches how I can possibly debug this? Maybe using ftrace/ptrace, but I am not quite sure where to start.
  • Any ideas what could cause such a behavior? A delay of several tens of milliseconds seems like a solvable problem from user space.
  • I assume that there is some sort of a kernel process involved that wakes user space programs waiting on select. What is a good place to find information like that? I guess that this process somehow does not have the right priority. Since my program only waits on one device, maybe using the select approach is not the best one and maybe a simple read could yield better results?

At this point, I am happy for any hints/ideas, thanks!

syg00 10-11-2023 02:53 AM

Do you happen to be pinning your process to CPU0 ?. If so pick another one - I like to stay away from the processors that the kernel has to run on in early boot as (to my simple mind) it's likely to reschedule there all the time. Might be nothing, but easy to implement as a test.

nyquist09 10-19-2023 12:27 PM

Quote:

Originally Posted by syg00 (Post 6458156)
Do you happen to be pinning your process to CPU0 ?. If so pick another one - I like to stay away from the processors that the kernel has to run on in early boot as (to my simple mind) it's likely to reschedule there all the time. Might be nothing, but easy to implement as a test.

Thanks. No I am not using CPU0.

In fact, I figured out what was the problem by tracing down scheduler events using ftrace:

The tty driver relies on an unbound kernel worker to push data to the user via a workqueue. That kworker is scheduled with 'SCHED_OTHER'. This is kind of a strange situation, because I prioritized the tty IRQ and the receiving application both with SCHED_FIFO, but I have this SCHED_OTHER kworker in between, which is clearly the weakest chain in the link. There used to be a low_latency flag for tty, but it got removed because it was buggy apparently.

Well, in any case I understood the problem and I am evaluation options to overcome this.


All times are GMT -5. The time now is 06:10 PM.