omp[ fails after dist upgrade
Linux geeves 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
package: libomp-dev 1:14.0-55.7-deb12u1 Code:
// oomptest.c Quote:
Quote:
The original program performed a gaussian elimination on a very large, very sparse linear algebra matrix. The matrix consists of two flat files: one file to the left of the equation, and one to the right of the equation. Each element in those files is a structure with six degrees of freedom. Thus, what is actually being passed to the omp pragma are the leftmost structure in each equation, and also the rightmost structure in each equation. As compiled, the program runs without fault under valgrind, will run successfully if "number of threads" is reduced to one, but fails for more than one thread. |
The test program fails on my machine with gcc 13.2.0. The behavior is not deterministic. Some runs end in segfaults. There is some race problem.
I use pthreads and can't help much with OpenMP. Ed |
Code:
void subroutine(float *subject, float *object, int index){ Code:
subroutine(&v0[1], &v1[0], id); Code:
void subroutine(float *subject, float *object, int index){ |
So your solution to a buggy omp library is to avoid omp entirely?
I submit that a better solution is to change the number of threads passed to the omp pragma from 5 to one. That will work since the omp library apparently does work for single threads. For my application, that means solving the problem with one thread chunks instead multiple thread chunks. When the library eventually does get fixed, I can simply change the '1' back to "nthreads". Quote:
|
Quote:
You have a bug in your program, the omp library is fine. It just happens that the bug is hidden when you are only using one thread. It is not even a threading bug, it's just about the arithmetic you are doing on the thread ID number. My solution is to put 0 instead of id or index in one of the highlighted places (choose one of the places, not both). |
That would work if v0 and v1 were not both pointers to a linked list (the left side of linear algebra).
I tried to keep the program snippet simple. In practice, I am also similarly passing two pointers to a different linked list (the right side of the equation). Quote:
|
You have a bug in your minimal sample program, so it is plausible that you have bugs in the actual application as well.
|
Please explain to me why I have a bug in my sample program.
My sample program works correctly for id = 1, 3, 5. For id == 1, 3, 5 it reads from both the v0 array and the v1 array correctly. As I said in my initial posting, this code has worked correctly for the past ten years, having been recompiled at least once for each distribution upgrade. The only thing that has changed is the version. In that program, I am passing the addresses of two structures in one linked list of structures, and the addresses of two structures in a parallel linked list of structures. My (poor old) CPU will handle up to 8 simultaneous threads. Just to see what would happen, I changed line 23 to read: Code:
subroutine(&v0[id%2], &v1[id], id); I only got one correct answer. Quote:
|
The v1 array is being indexed twice - once in the caller and again in the callee (as object). You need to index it once.
The buggy code works only for id==0. Does the real program have the same bug? Ed |
Quote:
Quote:
Code:
#include <stdlib.h> Code:
line xx id=0 Quote:
Quote:
|
Just a piece of advice. I don't know if your original code is similar to the one you show here.
In the code you show, all threads are modifying v1 at the same time, this may logically be Ok, but in terms of hardware if several cores act in the same memory block you may have a "false sharing" performance issue. This is because each core has a copy of the block in a cache line, and if one modifies its copy the other cores go stall until the caches are synchronized. How bad this is depends on the number of cache levels and the number of processors/cores you are using. Better explained: https://en.wikipedia.org/wiki/False_sharing https://parallelcomputing2017.wordpr...false-sharing/ https://cpp-today.blogspot.com/2008/...its-again.html |
[QUOTE=ntubski;6483437]Then you have made it too simple. Please try again using linked lists.
Code:
// parallel_test.c In this example, newnode.value is a single floating variable. In practice, "value" is six dimensional (three orthogonal, and three rotational). |
What do you expect to see?
Code:
$ ./paralell_test |
(vid)value (vid)value (vid)value (vid)value
line1: (3)3345.225586 (4)-410.425629 (5)-2900.000000 (6)-34.799999 head[0] (3)-410.425629 (4)-410.425629 (new node value = 0) head[1] (3)-2900.000000 (5)3345.225586 (7)-410.425629 (8)-34.799999 head[2] (3)-34.799999 (6)3345.225586 (8)-2900.000000) (9)-410.425629) The object is to reduce head[id] (vid == 3)value to zero (gaussian elimination) Expected results: For head[0], v = 3345.225586 / -410.425629; (= -8.15063) (3) = (-410.4 * -8.1506) -3345.2 = 0; (4) = (-410.4 * -8.1506) -(-410.4) = -2934.8; (5) = ((blank -> 0) * -8.1505) -(-29000) = 29000; (6) = ((blank -> 0) * -8.1505) -(-34.8) = 34.8; for head[1], v = 3345.225586 / -29000; (= -0.11535) (3) = (-29000 * -0.11535) -3345.2 = 0; (4) = ((blank ->0) * -0.11535) -(-410.4) = 410.4; (5) = (3345.2 * -0.11535) -(-29000) = 28614; (6) = ((blank -> 0) * -0.11535) -(-34.8) = 34.8; (7) = (-410.4 * -0.11535) -(blank -> 0) = 47.34; (8) = (-34.8 * -0.11535) -(blank -> 0) = 4.014; for head[2], v = 3345.225586 / -34.8 (= -96.127) (3) = (-34.8 * -96.127) -3345.2 = 0; (4) = ((blank -> 0) * -96.127) -(-410.4) = 410.4; (5) = ((blank -> 0) * -96.127) -(-29000) = 29000; (6) = (3345.2 * -96.127) -(-34.8) = 321529; (7) = ((blank -> 0) * -96.127) -(blank ->0) = 0; (8) = (-29000 * -96.127) -(blank ->0) = 2787683; (9) = (-410.4 * -96.127) -(blank ->0) = 39451; Actual results: (3)-410.425629 (4)-410.425629 (6)0.000000 (3)-nan (5)-inf (7)inf (8)inf (3)-34.799999 (6)3345.225586 (8)-2900.000000 (9)-410.425629 For gaussian forward elimination, nodes head[0], head[1] and head[2] would be freed from the matrix line1 = end_of_line1->next; loop until (line1 == NULL); Quote:
|
Quote:
|
All times are GMT -5. The time now is 05:18 PM. |