LinuxQuestions.org - GCC update to another version

Page 1 of 2

Show 50 post(s) from this thread on one page

- Slackware (https://www.linuxquestions.org/questions/slackware-14/)

- - GCC update to another version (https://www.linuxquestions.org/questions/slackware-14/gcc-update-to-another-version-4175733291/)

selfprogrammed

01-28-2024 06:00 PM

GCC update to another version

I have got to the point where I am questioning the bug-iness of the gcc 11.2 distributed with Slackware 15.0.

There is this very large program that is very buggy. It should not be that buggy.
I have been working on it for months trying to diagnose the problem.

I have made changes like:

Code:

// constructor

obj2::obj2(): ptr1(NULL), bval(false), ptr2(NULL) { }



// I changed that to:



obj2::obj2() {

 ptr1 = NULL;

 ptr2 = NULL;

 bval = false;

 }

This has cured that particular bug.

I can find no reasoning for why that should make a difference, but it definitely has worked.
The program previously would crash every couple of hours, taking work with it, and has stopped doing that entirely. I have made numerous other patches, instrumentation, and other changes, that have not affected these problems at all. It is not a matter of just recompiling.

In another place I have made another change like this, with it having notable effect upon that program bug.
Did not fix it entirely, but it is like a different bug now.

Still have a mass of other bug-iness that also defies understanding.
That style of constructor is all over the program, and it is large.

// ----

I am compiling for target i686 (am running a Quad Athlon).
The program uses CMake.

I ran a memory tester for an entire night, and got no errors.

// ----

This is gcc 11.2.
A year ago, they released gcc 11.4 a bug fix release.
Any chance that gcc 11.4 could be released as a patch upgrade for Slack 15.0.

This is a system used for business purposes, developing programs, and I cannot be testing unstable compilers. I need a stable release version of gcc. I am hoping that such an upgrade just might fix some of this bug-iness.

It is also possible that the program is just that bad, or that style of constructor has those kind of problems. But I cannot find any evidence of that either.

Have compiled a massive amount of slackbuilds using this compiler, and have not noticed anything like this with other programs. So I am confused. Getting desperate to try something, and upgrading GCC is a way to test if that makes a difference.

Trying to put another GCC on the side, and getting this slackbuild to compile with that does not look like an easy task.

jailbait

01-28-2024 06:16 PM

You should report the bugs you find to the gcc project:

https://gcc.gnu.org/bugs/

Lockywolf

01-28-2024 08:21 PM

Quote:

Originally Posted by selfprogrammed (Post 6479908)

Code:

// constructor

obj2::obj2(): ptr1(NULL), bval(false), ptr2(NULL) { }



// I changed that to:



obj2::obj2() {

 ptr1 = NULL;

 ptr2 = NULL;

 bval = false;

 }

>I can find no reasoning for why that should make a difference, but it definitely has worked.

I am afraid you would have to look at the disassembly to find out where the problem lies. I am not that well versed in C++ standartese, but I don't think that those constructors are identical. I think there is some difference when those are called for base/derived classes.

In any case, if your code is fixed by 11.4, you can compile it yourself from scratch, looking at Slackware's gcc SlackBuild, just change the prefix into /opt/gcc-11.4/, or something like that. You can also have a look at the gcc-5.SlackBuild on SlackBuilds.Org. That script does install the older gcc into /opt/.

volkerdi

01-28-2024 10:00 PM

Quote:

Originally Posted by selfprogrammed (Post 6479908)

In several cases my fix for issues like this has been to use clang.

BrunoLafleur

01-29-2024 03:59 AM

Quote:

Originally Posted by selfprogrammed (Post 6479908)

Code:

// constructor

obj2::obj2(): ptr1(NULL), bval(false), ptr2(NULL) { }



// I changed that to:



obj2::obj2() {

 ptr1 = NULL;

 ptr2 = NULL;

 bval = false;

 }

For that sort of bug, how is the definition of the class ? In what order are the variables members and are they all initialized ?

The example is incomplete to have some ideas.

pan64

01-29-2024 04:41 AM

you told nothing about your code, how do you know it is buggy?
You need to compile it with -Wall, and check all the warnings or errors.
It was already mentioned we don't know how was that class declared. Also what kind of bug was fixed by that modification?
If you want make a bug report you need to show [us] or describe exactly how can we reproduce that bug (means a working example)

drumz

01-29-2024 08:25 AM

As said above, install the version you want in /opt. I have a few for various reasons:

Code:

# ls /opt | grep gcc

gcc-10.2.0/

gcc-13.2.0/

gcc-8.4.0/

Here's my do_build.sh script for 13.2.0. Yes, I'm a bad boy and don't create a package. But everything is contained in /opt/gcc-13.2.0, so I don't feel too bad. Obviously build script is inspired by Slackware's build script.

Code:

#!/bin/sh



set -e



srcdir=../gcc-13.2.0

destdir=/opt/gcc-13.2.0



SLKCFLAGS="-O2 -fPIC"

LIBDIRSUFFIX="64"

LIB_ARCH=amd64



TARGET=x86_64-slackware-linux



NUMJOBS=" -j 8 "



#tar xvf gcc-13.2.0.tar.xz

#cd gcc-13.2.0

#zcat ../patches/gcc-no_fixincludes.diff.gz | patch -p1

#cd ..



mkdir build

cd build



GCC_ARCHOPTS="--disable-multilib"



CFLAGS="$SLKCFLAGS" \

CXXFLAGS="$SLKCFLAGS" \

"$srcdir/configure" \

  --prefix=$destdir \

  --libdir=$destdir/lib$LIBDIRSUFFIX \

  --enable-shared \

  --enable-bootstrap \

  --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++ \

  --enable-threads=posix \

  --enable-checking=release \

  --enable-objc-gc \

  --with-system-zlib \

  --enable-libstdcxx-dual-abi \

  --with-default-libstdcxx-abi=new \

  --disable-libstdcxx-pch \

  --disable-libunwind-exceptions \

  --enable-__cxa_atexit \

  --disable-libssp \

  --enable-gnu-unique-object \

  --enable-plugin \

  --enable-lto \

  --disable-install-libiberty \

  --disable-werror \

  --with-gnu-ld \

  --with-isl \

  --verbose \

  --with-arch-directory=$LIB_ARCH \

  --disable-gtktest \

  --enable-clocale=gnu \

  $GCC_ARCHOPTS \

  --target=${TARGET} \

  --build=${TARGET} \

  --host=${TARGET} || exit 1



make $NUMJOBS bootstrap || exit 1

( cd gcc || exit

  make $NUMJOBS gnatlib GNATLIBCFLAGS="$SLKCFLAGS" || exit 1



  CFLAGS="$SLKCFLAGS" \

          CXXFLAGS="$SLKCFLAGS" \

          make $NUMJOBS gnattools || exit 1

)

make info || exit 1



#make $NUMJOBS check  || exit 1



make install || exit 1

selfprogrammed

01-30-2024 01:01 AM

Well, it was possible that users had seen this before and knew of such a bug in GCC, but I guess that is not going to be the case.

My first thought was to compile it with clang. Some of my users use clang (and FreeBSD, and NetBSD, etc), and I have had to augment a program to support CLANG too.
But that was why I mentioned that it uses CMAKE.

Do you have an easy and simple way to convince slackbuild and CMAKE to accept another version of GCC compiler (or clang), without having to rewrite and/or debug that effort too.
I expect it is just another option to pass to CMAKE, but I have not used such before. I have tried to configure CMAKE before, and it did not go well. I expect I will have to dive into the CMAKE docs again.
If I try to change the Makefile, CMAKE will just recreate it. The slackbuilds erasing and recreating everything does not help much either.
The chance is that I will just add to my workload. From my previous work, it might do the same exact thing, as clang does mostly what GCC does.

I have been using Slackware since the 90's, and have not seen the GCC updated very often. I just want to put in a word that there is a bug fix for this version, and I would like to get that in an official Slackware package, if possible. Not needing to go to 12 or 13, as they probably have other enhancements, and other possible bugs. But I would like the official Slackware GCC to be the last to have all of the bug fixes for that major version.

As to the actual program. I purposely did not give much detail. That class is moderate, but is not understandable without seeing a dozen more structs and classes.
(see slackbuilds Voxelands, mapblock_mesh.cpp, MeshMakeData constructor)

It will crash often, and with explicit error messages. I have it instrumented by now and have been trying to identify exactly how the data gets screwed, but cannot find anything.
I also got canarys in place and they detect nothing.
That my "fix" actually cured that particular problem makes me uneasy, as it should not have done so.
It is also possible and likely that the original constructor was at fault, and I just cannot see it. A strange warning message did go away too, but I was making many other changes too.

I have found some other problems in the code, such as their use of an Exception when a particular function would fail.
They had an identical function that did not throw the Exception but returned NULL instead.
After disabling the Exception throwing version, and making everything use the explicit NULL test, the Exception error messages have stopped.
I did find two places where they called the exception throwing version, but did not have the try/catch, and I fixed those.

Had one bug where a destructor would segfault, due to a C++ iterator on a vector generating bad ptrs. It was putting out a ptr value of 4, and such.
I stopped that one by guarding the iterator with a test on the vector being empty. There are too many obscure details on these std:: operators that are gotchas.
I did not write this code, just trying to rescue it.

It still crashes, often, but not the same errors now.

I will eventually submit patches to the actual maintainers of this program. From past experience, it is not likely the maintainer will accept them.
(One project was an exception to that. They accepted my patches. And that was how I ended up as the main programmer for a free software project. It's a trap I tell you.)

Thank you for your attention.

pan64

01-30-2024 01:18 AM

crashes, segfaults are usually memory problems, you can use valgrind to catch them (for example)
You can simply specify the compiler for CMAKE, but without details we can only guess how.
https://stackoverflow.com/questions/...piler-in-cmake
Additionally you can use static code analyzers which can find a lot of problems too.

BrunoLafleur

01-30-2024 04:38 AM

For the specific incomplete example you give at the beginning, the order of the list of the class members in the constructor should be the same as in the declaration of the class. Else it could segfault. But with -Worder option or -Wall, the compiler should emit a warning.

If the class members are initialized in the body of the constructor, the order doesn't matter (you do what you want in the body).

Some old codes also don't always initialize every member of all classes. The rely on defaults from the compiler which may have changed with the latest versions of the standards.

In your code that you didn't write from scratch, it may be not the compiler which is buggy but the code itself. So like said above, you could try some tools like valgrind to find faulty areas. I use valgrind a lot even on code that never segfaults running months.

For valgrind it is better to compile with -g option to have all lines where problems are detected.

GazL	01-30-2024 09:25 AM

Quote:

Originally Posted by selfprogrammed (Post 6479908)

I have made changes like:

Code:

// constructor

obj2::obj2(): ptr1(NULL), bval(false), ptr2(NULL) { }



// I changed that to:



obj2::obj2() {

 ptr1 = NULL;

 ptr2 = NULL;

 bval = false;

 }

This has cured that particular bug.

I can find no reasoning for why that should make a difference, but it definitely has worked.

I'm only a novice at C++ (I much prefer C), but as I understand it the difference is as follows...

The first approach calls the constructors that take an argument for each of the members in the initialisation list. The second approach calls the default constructor of each member in the class definition and then assigns values afterwards when the containing class's constructor is run.

It likely won't matter for fundamental types, but the constructors for nested class objects could potentially end up doing different things.

the3dfxdude

01-30-2024 09:30 AM

Quote:

Originally Posted by BrunoLafleur (Post 6480217)

I saw that earlier in the example, but...

Code:

      -Wreorder (C++ and Objective-C++ only)

          Warn when the order of member initializers given in the code does

          not match the order in which they must be executed.  For instance:



                  struct A {

                    int i;

                    int j;

                    A(): j (0), i (1) { }

                  };



          The compiler rearranges the member initializers for "i" and "j" to

          match the declaration order of the members, emitting a warning to

          that effect.  This warning is enabled by -Wall.

Please tell me that gcc cannot still introduce a bug, if they can detect it, and give a warning if you want. I don't see why they would allow buggy code, unless there is more to it than just this.

GazL	01-30-2024 09:44 AM

Quote:

Originally Posted by the3dfxdude (Post 6480273)

Please tell me that gcc cannot still introduce a bug, if they can detect it, and give a warning if you want. I don't see why they would allow buggy code, unless there is more to it than just this.

It sounds more along the lines of -Wparenthesis or -Wmisleading-indentation where it's just a "Hey, are you sure you got this right?" type of thing.

BrunoLafleur

01-30-2024 10:17 AM

Quote:

Originally Posted by the3dfxdude (Post 6480273)

I saw that earlier in the example, but...

Code:

      -Wreorder (C++ and Objective-C++ only)

          Warn when the order of member initializers given in the code does

          not match the order in which they must be executed.  For instance:



                  struct A {

                    int i;

                    int j;

                    A(): j (0), i (1) { }

                  };



          The compiler rearranges the member initializers for "i" and "j" to

          match the declaration order of the members, emitting a warning to

          that effect.  This warning is enabled by -Wall.

Please tell me that gcc cannot still introduce a bug, if they can detect it, and give a warning if you want. I don't see why they would allow buggy code, unless there is more to it than just this.

It can be more because initialisation can be dynamic and not only be some constants. One member can depend on other members as defined by the programmer. If the order is not respected and the compiler changes it, some dependencies can rely on not yet initialized values for example.

Also some defaults in the C++ standard have changed and can lead also to uninitialized values for some or all members of a class or subclasses.

Valgrind for examples detects all those errors on runtime.

the3dfxdude

01-31-2024 02:31 PM

Quote:

Originally Posted by BrunoLafleur (Post 6480286)

Maybe I should not say a bug as much here to not be confused with when the programmer won't heed the warning that is the issue. Yes the programmer should know when to use list initializers or not use them or change the ordering in the definition. But I was looking for the bug, as in what was reported. I can't see the bug in the Voxeland MeshMakeData constructor initializer list, as it appears to be correctly ordered in what needs to be initialized. Maybe there is something different in addition to this that was more like the generic example in the beginning? Or the only other thing I can think of, is with the constructor and methods in the header, is this an optimizing compiler issue? Because I think the gcc compiler here won't trigger the reorder warning, so how would the programmer know of a buggy constructor for that reason? So I guess try compiling with the affected gcc in "-O0 -Wall" and maybe "-Wall" with llvm and look for problems if the programmer didn't use -Wall before and also determining if there is badly optimized code causing a problem?

selfprogrammed

01-31-2024 04:17 PM

Would be interesting if anyone else was able to compile Voxelands, and run it for more than 12 hours without getting one of these segfaults (it takes a while). It is addictive.

I tried Valgrind (I think it was this program). It tried for about 10 minutes then came back that the program was doing something that it does not support, then it gave up. All the exceptions alone might have thrown it off.

I have looked at CMAKE docs about how to specify your choice of compiler. Any CMAKE supported way to specify that, so far eludes me. Trying to do this within a Slackbuild script that wants to rebuild EVERYTHING does not help. I have modified that slackbuild script into a build debug version that can get around that problem.

Amazing that this thing will compile without warnings. It saves its weirdness for run-time.
I did notice that it specifies the optimization, and then respecifies differently for debugging. I wonder if the debugging and optimization combination is adding to the confusion.
That would be another CMAKE issue that probably is difficult to override.

That this many programmers cannot see what the problem with this, and can come up so many possibilities, indicates that it is mostly a C++ language problem. Trying CLANG just to avoid the issue is not a real solution, but it might give indications as CLANG has it own error detection.

The solution is to avoid the questionable constructor practices, and put in ones that are safe and sure. The compiler can be trusted to see the duplicated effort and optimize it out. Do not really need the programmer playing tricks just to avoid assigning to a field after it was default initialized.
As I already have put this much effort into it, this should get fixed (instead of just using CLANG, even if it worked better).

I was wondering about the fix where putting a test for an empty vector, before using an iterator on it. I would think that those std:: operators were more robust than that. Anyone sure about whether iterators need protection against empty vector lists ?

I would still like to have an upgrade on GCC though. Getting GCC up to 11.4 may be a small step, but it avoids whatever they did that required jumping to 12.0.
If I have to do this myself, I am likely going to have to replace the slack package entirely. My code is not having these problems because I avoid these kind of iffy constructs.
Having a alternative version in /opt will not help if I cannot use it in place of the installed GCC with these problem slackbuilds. I really need some way to switch out the compiler that the whole slackbuild sees.

rkelsen

01-31-2024 11:18 PM

Quote:

Originally Posted by selfprogrammed (Post 6480182)

I will eventually submit patches to the actual maintainers of this program. From past experience, it is not likely the maintainer will accept them.

You're probably right about that:

Quote:

25 Jul 2018 at 7:09pm
"I've decided to just let Voxelands die, it was fun while it lasted, but it's just become a chore lately.

I'm still working on other games, one of which 'The Void' will be available on itch.io soonish."

https://forum.minetest.net/viewtopic.php?t=20448

The last update to the project was in May 2018.

BrunoLafleur

02-01-2024 01:20 AM

Quote:

I tried Valgrind (I think it was this program). It tried for about 10 minutes then came back that the program was doing something that it does not support, then it gave up. All the exceptions alone might have thrown it off.

Valgrind is very time consuming. It can be a hundred times the normal execution time. But it can give very good information at the very beginning of a memory problem. Else with gdb the bugs maybe visible soon enough to be trapped an it is quicker.

BrunoLafleur

02-01-2024 01:27 AM

For cmake something like :

cmake -DCMAKE_C_COMPILER=... -DCMAKE_CXX_COMPILER=...
added aside the usuals -DCMAKE_CXX_FLAGS and -DCMAKE_C_FLAGS

pan64

02-01-2024 01:29 AM

Quote:

Originally Posted by selfprogrammed (Post 6480545)

I have looked at CMAKE docs about how to specify your choice of compiler. Any CMAKE supported way to specify that, so far eludes me. Trying to do this within a Slackbuild script that wants to rebuild EVERYTHING does not help. I have modified that slackbuild script into a build debug version that can get around that problem.

I don't know what it is, but again there should be a variable CC to specify the compiler. It should work on all cmake and make based builds.

Quote:

Originally Posted by selfprogrammed (Post 6480545)

Amazing that this thing will compile without warnings. It saves its weirdness for run-time.
I did notice that it specifies the optimization, and then respecifies differently for debugging. I wonder if the debugging and optimization combination is adding to the confusion.

those are mainly incompatible with each other. Optimization may (and will) change the code, so the result cannot be connected to the source lines. You can use still a low level debugger on the running code, if you wish, but you will not be able to follow the source code. It does not depend on the tool you use, but the level of optimization.

Quote:

Originally Posted by selfprogrammed (Post 6480545)

That would be another CMAKE issue that probably is difficult to override.

That this many programmers cannot see what the problem with this, and can come up so many possibilities, indicates that it is mostly a C++ language problem. Trying CLANG just to avoid the issue is not a real solution, but it might give indications as CLANG has it own error detection.

The solution is to avoid the questionable constructor practices, and put in ones that are safe and sure. The compiler can be trusted to see the duplicated effort and optimize it out. Do not really need the programmer playing tricks just to avoid assigning to a field after it was default initialized.
As I already have put this much effort into it, this should get fixed (instead of just using CLANG, even if it worked better).

I was wondering about the fix where putting a test for an empty vector, before using an iterator on it. I would think that those std:: operators were more robust than that. Anyone sure about whether iterators need protection against empty vector lists ?

I would still like to have an upgrade on GCC though. Getting GCC up to 11.4 may be a small step, but it avoids whatever they did that required jumping to 12.0.
If I have to do this myself, I am likely going to have to replace the slack package entirely. My code is not having these problems because I avoid these kind of iffy constructs.
Having a alternative version in /opt will not help if I cannot use it in place of the installed GCC with these problem slackbuilds. I really need some way to switch out the compiler that the whole slackbuild sees.

What you can do is to use different tools to identify problems, including more compilers (like clang), and using other static code analyses, like cppcheck. Also you need to check all the reported warnings. (like implicit type cast, type mismatch and other similar issues).

ponce

02-01-2024 05:54 AM

Quote:

Originally Posted by selfprogrammed (Post 6480182)

Do you have an easy and simple way to convince slackbuild and CMAKE to accept another version of GCC compiler (or clang), without having to rewrite and/or debug that effort too.
I expect it is just another option to pass to CMAKE, but I have not used such before. I have tried to configure CMAKE before, and it did not go well. I expect I will have to dive into the CMAKE docs again.
If I try to change the Makefile, CMAKE will just recreate it. The slackbuilds erasing and recreating everything does not help much either.
The chance is that I will just add to my workload. From my previous work, it might do the same exact thing, as clang does mostly what GCC does.

usually just exporting two environment variables is enough (before any configure/cmake/whatever invocation): when I want to force clang, for example, I export these

Code:

export CC=clang

export CXX=clang++

but if compilers are forced somewhere else as a parameter (like the examples Bruno has done above) that might override what specified in the environment...

selfprogrammed

02-01-2024 10:06 PM

Thank you Bruno, that may be the secret incantation that I was looking for. I wonder where you found that. I must not be getting far enough into the CMAKE docs.

I am familiar with the CC, and CXX, but I was afraid that CMAKE might just ignore or override them. CMAKE might honor environment variables, I don't know. You did not say if that worked in general or specifically did work with CMAKE. I will give up on them and will keep an eye out for those in the CMAKE docs. I do not think that I am done with those CMAKE docs.

Thank you all for your comments. I will make what use of them that I can.
This may not be immediate, as I have two other major Linux projects to work on, and they may actually produce income so I really should get some work done on them.

Voxelands: that project has been abandoned, and resurrected more times than I can keep track. That it has such a high burnout rate is not surprising considering what their code look like. I have the itch for wholesale renaming of functions at minimum, and a few other re-arrangements.
These are the more reachable bugs.

It runs threads, that seem to do something that may be database management. What is it doing with this separate MeshMake structure. This of course is perfect for out-of-sync updates and accesses causing spontaneous mystery bugs.

My last fix was putting a try/catch around a particular call to a particular getBlock call. That stopped that. It ran better for a while (several hours), then started throwing the same exception from the handler for point and click on an block, but this time because of an unguarded getNode call.
Why it repeatedly cannot find specific blocks is still a mystery (the exact same block repeatedly, but different for each run). How it treats blocks it cannot find (are they AIR, should they be automatically created?) is not documented, and I have not found anything systemic in the code that would give me a clue.

pan64

02-02-2024 12:52 AM

Quote:

Originally Posted by selfprogrammed (Post 6480781)

Thank you Bruno, that may be the secret incantation that I was looking for. I wonder where you found that. I must not be getting far enough into the CMAKE docs.

https://stackoverflow.com/questions/...g-assimp-for-i
Here is a page about that super secret, but anyway, it is pretty well documented. Here is the official: https://cmake.org/cmake/help/latest/..._COMPILER.html

BrunoLafleur

02-02-2024 03:08 AM

Quote:

Originally Posted by selfprogrammed (Post 6480781)

What version of voxelands do you use ? Are your changes somewhere ? Maybe I could try to play with it and see if I can find some bugs ? If I find some time. I see there is a sbo package for voxelands. Is it uptodate with the version you use ?

selfprogrammed

02-05-2024 12:32 AM

I tried to use clang.
I used the CMAKE options specified by BRUNO.
It compiled, and I got some additional warning messages, about several things not directly relevant.
The package that was made would not run. Something is missing. The package is too small. That is all I know at this time.

If you want to join in the fun, here is my current work snapshot.
Most every debugging is in DEBUG #ifdef.

Voxelands 1709, as downloaded at slackbuilds.

The link or URL you must send to the recipient(s) of your file(s) is:
http://www.fileconvoy.com/dfl.php?id...66473d16e58211

The file(s) that can be retrieved with the above link is (are):

voxelands-v1709_debug_01.diff.tar.bz2 (24.323 KB)

The file(s) will be available on the server for the next 10 days.

In some strange way I am making it more stable, at least. It will run for a hour, and then crash several times in the next 20 minutes. It must have something to do with what actions are being done.
Most of the crashes now are failing to destroy a vector owned by one of the Mesh structures. It often starts with the QueuedMeshUpdate destructor.
Have not been able to catch what it is that is corrupted.

C++ fights me in every way with checking ptrs for validity. It will not let me do any void* comparisons, like setting a valid range for the heap and checking if the ptr in in that range.

pan64

02-05-2024 12:52 AM

Quote:

Originally Posted by selfprogrammed (Post 6481429)

C++ fights me in every way with checking ptrs for validity. It will not let me do any void* comparisons, like setting a valid range for the heap and checking if the ptr in in that range.

I just don't understand it. Probably you can show an example.
There are several different ways to check pointers, but in general a pointer can point to anywhere and can be anything on that location. Especially if the memory is overwritten for any reason.

drumz

02-05-2024 07:45 AM

Quote:

Originally Posted by selfprogrammed (Post 6481429)

C++ fights me in every way with checking ptrs for validity. It will not let me do any void* comparisons, like setting a valid range for the heap and checking if the ptr in in that range.

I think if you're trying to do tricks like this, you need an older compiler.

selfprogrammed

02-07-2024 08:15 PM

To check ptr for validity you have to check that it is within some bounds.
I tried to get a valid bounds by getting the ADDR of an early item allocated off the stack, and the MAX of other items allocated off the stack as the other limit.
The code is in the tar file that I provided 2 posts up.

The C++ compiler would not let me cast them to void, nor would it let me compare them to any ptr in that structure.
Please note that radically changing the compiler compile settings would disrupt the entire program that I am trying to debug.

It is not even a trick. It is just trying to treat a ptr to something allocated as a ptr to a memory area. How does Malloc and new get away with doing this.
Does the compiler give Malloc some special rules.

Yes, this new compiler is a problem, see original post.

I was hoping to see if Bruno got the same results I did, or if his machine behaves significantly different.

selfprogrammed

02-07-2024 11:15 PM

Got another seqfault, in a place this time where I could examine variables.
I get about 5 to 8 segfaults per 2 hour session, usually grouped together.
I work on this computer all day long, and do not see this behavior with other programs.
This is the first time I have seen this one.

InventoryList inventory.cpp:1453 segfault

Code:

(gdb) l inventory.cpp:1453



const s32 Inventory::getListIndex(const std::string &name) const

1451    {

1452            for (u32 i=0; i<m_lists.size(); i++) {

1453                    if (m_lists[i]->getName() == name)

1454                            return i;

1455            }

1456            return -1;

1457    }



(gdb) p m_lists

$3 = {data = 0x0, allocated = 0, used = 60, allocator = {_vptr.irrAllocator = 0xc0373e0}, 

  strategy = (irr::core::ALLOC_STRATEGY_DOUBLE | unknown: 0xc), free_when_destroyed = false, 

  is_sorted = false}



(gdb) p m_lists.size()

$4 = 60

Code:

// Declaration

class InventoryList

{

public:

        InventoryList(std::string name, u32 size);

        ~InventoryList();

        void clearItems();

        void serialize(std::ostream &os) const;

        void deSerialize(std::istream &is);



        InventoryList(const InventoryList &other);

        InventoryList & operator = (const InventoryList &other);



        const std::string &getName() const;

        u32 getSize();

        // Count used slots

        u32 getUsedSlots();

        u32 getFreeSlots();



        // set specific nodes only allowed in inventory

        void addAllowed(content_t c) {m_allowed[c] = true;}

        void clearAllowed() {m_allowed.clear();}



        // set specific nodes not allowed in inventory

        void addDenied(content_t c) {m_denied[c] = true;}

        void clearDenied() {m_denied.clear();}



        // whether an item is allowed in inventory

        bool isAllowed(content_t c)

        {

                if (m_allowed.size() > 0)

                        return m_allowed[c];

                return !m_denied[c];

        }

        bool isAllowed(InventoryItem *item) {return isAllowed(item->getContent());}



        // set whether items can be stacked (more than one per slot)

        void setStackable(bool s=true) {m_stackable = s;}

        bool getStackable() {return m_stackable;}



        /*bool getDirty(){ return m_dirty; }

        void setDirty(bool dirty=true){ m_dirty = dirty; }*/



        // Get pointer to item

        const InventoryItem * getItem(u32 i) const;

        InventoryItem * getItem(u32 i);

        // Returns old item (or NULL). Parameter can be NULL.

        InventoryItem * changeItem(u32 i, InventoryItem *newitem);

        // Delete item

        void deleteItem(u32 i);



        // Adds an item to a suitable place. Returns leftover item.

        // If all went into the list, returns NULL.

        InventoryItem * addItem(InventoryItem *newitem);



        // If possible, adds item to given slot.

        // If cannot be added at all, returns the item back.

        // If can be added partly, decremented item is returned back.

        // If can be added fully, NULL is returned.

        InventoryItem * addItem(u32 i, InventoryItem *newitem);



        // Updates item type/count/wear

        void updateItem(u32 i, content_t type, u16 wear_count, u16 data);



        // Checks whether the item could be added to the given slot

        bool itemFits(const u32 i, const InventoryItem *newitem);



        // Checks whether there is room for a given item

        bool roomForItem(const InventoryItem *item);



        // Checks whether there is room for a given item after it has been cooked

        bool roomForCookedItem(const InventoryItem *item);



        // Checks whether there is room for a given item aftr it has been crushed

        bool roomForCrushedItem(const InventoryItem *item);



        // Takes some items from a slot.

        // If there are not enough, takes as many as it can.

        // Returns NULL if couldn't take any.

        InventoryItem * takeItem(u32 i, u32 count);



        // find a stack containing an item

        InventoryItem *findItem(content_t item, u16 *item_i = NULL);



        // Decrements amount of every material item

        void decrementMaterials(u16 count);



        void print(std::ostream &o);



        void addDiff(u32 index, InventoryItem *item) {m_diff.add(m_name,index,item);}

        InventoryDiff &getDiff() {return m_diff;}



private:

        core::array<InventoryItem*> m_items;

        u32 m_size;

        std::string m_name;

        std::map<content_t,bool> m_allowed;

        std::map<content_t,bool> m_denied;

        bool m_stackable;

        InventoryDiff m_diff;

}

Note: there is no explicit constructor for m_list. The default constructor probably was run. At least the data and allocated fields were set NULL.
Note: m_lists size just returns the used field.
Note: The loop has no guard, it just relies upon size().
Note: This core: may be part of the irrlicht library, however I have seen the similar kinds of problems with a std: vector.
Note: this is the first time, in weeks of running this program, that I have seen this particular segfault.

Possibilities:
1. the library is not initializing the m_lists correctly
2. the user is required to check for empty list before looping over content, or using size(). Don't know how.
3. the user is required to initialize this item explicitly.
4. the compiler is optimizing away something that it should not.
5. Something is writing random values.

Comments based on your experience with C++ constructs and the C++ compiler please.
Please note that I can not make this segfault happen again, so I cannot "try something" and see immedidate results.

pan64

02-08-2024 02:17 AM

valgrind looks for exactly this kind of errors.

selfprogrammed

02-08-2024 10:49 PM

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.
I am not sure what it could tell me that is not already obvious.
Valgrind does not have any special insight into the library, its proper usage, or the compiler optimization.

When there is no data, if used is going to be left with a value, and size() is going to only return that value, of course it is going to segfault.
Perhaps it is the library that is responsible, ... maybe.
Maybe this works on other compilers, or other programs, or it requires some rules for how the user can call the functions.

I find it hard to believe that this library could be this way, and not have errors all over the place. Something is wrong, but I am not sure exactly what.

I am still mistrusting the compiler as it is capable of producing such a inconsistency by optimizing.
It could be removing the setting of used to zero because it thought it was already zeroed by some other initialization.

The above segfault was an unusual one, in that it involved a structure that I could look at.
Most of the segfaults are similar, but involve deallocation of a std::vector that has mysteriously gone bad.
Those segfaults occur deep in the std: implementation libraries, where I have not found any relevant information that I can print out besides "info stack".
What I notice is the stack trace contains a deallocation of several mesh structures, and the automatic deallocation of MeshData fails, for reasons it does not tell me (except that there was segfault in an allocator). What it is doing calling an allocator in order to deallocate, I do not know. This was intended to be hidden implementation, and they did a good job on the hiding part, and a terrible job in the reliability part.

So anyone got any good ideas on how to checkup on a std::vector.
When I try to print one using GDB, I only get some standardized format, not a view of the internal fields.
It will print out values for indices[1222] even when indices has length 0 capacity 0.
How it does that, when capacity is 0 (no allocated vector data memory?), is probably why the segfaults are so erratic.

To consider the position that the std::vector is well written, and cannot possibly be at fault in its design, then will have to explain how this keeps happening.
Then the possibility that the compiler is at fault will be the only other explanation. I already realized that before I posted anything.
Also, reporting a suspected GCC bug will do NOTHING (IMExp), unless it can be documented, it is reproducible, and you are running the latest bleeding edge version.
From my previous experiences, the version of GCC offered by Slackware, would be considered too old to be supported.

selfprogrammed

02-09-2024 12:46 AM

I have, in desperation, looked at the std::vector implementation. Desperation, because no one should try to read that, as it makes APL look compact and reasonable.

It is in "/usr/include/c++/11.2.0/bits/stl_vector.h".

Most everything is protected, and convoluted on top of that.

Notable is that there are multiple patches to std::vector in these header files, that are enabled by

Code:

#if _cplusplus >= 201103L

I suspect those patches would not be enabled by 11.2.0, but would be enabled by 11.4.0.

The patches change the std::vector implementation significantly. There are so many, that I have to point you at the header file, read it for yourself.
Constructors and initializers are heavily affected by the patches. This is exactly where the problem I am having is manifesting.

I have discovered the following peeks into the structure.
Given user declared std::vector v1
GDB commands that work.

Code:

p  v1._M_impl

p  v1->_M_impl

p  v1->_M_impl._M_start

p  v1->_M_impl._M_finish

p  v1->_M_impl._M_end_of_storage

I have copied parts of the stl header here to discuss it.

Code:

      // [23.2.4.2] capacity

      /**  Returns the number of elements in the %vector.  */

      size_type

      size() const _GLIBCXX_NOEXCEPT

      { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }

So size() = _M_finish - _M_start

The clear() function calls _M_erase_at_end(), a function that is repeatedly in the stack dump when it segfaults.

Code:

    void

      clear() _GLIBCXX_NOEXCEPT

      { _M_erase_at_end(this->_M_impl._M_start); }

Code:

      // Called by erase(q1,q2), clear(), resize(), _M_fill_assign,

      // _M_assign_aux.

      void

      _M_erase_at_end(pointer __pos) _GLIBCXX_NOEXCEPT

      {

        if (size_type __n = this->_M_impl._M_finish - __pos)

          {

            std::_Destroy(__pos, this->_M_impl._M_finish,

                          _M_get_Tp_allocator());

            this->_M_impl._M_finish = __pos;

            _GLIBCXX_ASAN_ANNOTATE_SHRINK(__n);

          }

      }

It calls Destroy, another function that appears in the segfault stack dumps.
It is guarded by an expression that translates to "if( n = size() )".

Code:

      /**

      *  The dtor only erases the elements, and note that if the

      *  elements themselves are pointers, the pointed-to memory is

      *  not touched in any way.  Managing the pointer is the user's

      *  responsibility.

      */

      ~vector() _GLIBCXX_NOEXCEPT

      {

        std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,

                      _M_get_Tp_allocator());

        _GLIBCXX_ASAN_ANNOTATE_BEFORE_DEALLOC;

      }

Destroy is ONLY CALLED by the deallocator, and _M_erase_at_end, and both of them call it with the second param = _M_finish,
but with different first param.
So Destroy cannot be doing deallocate using the first param.

I found _Destroy in "/usr/include/c++/11.2.0/bits/alloc_traits.h".
The alloc_traits name also appears in the segfault stack dump.

Code:

  /**

  * Destroy a range of objects using the supplied allocator.  For

  * non-default allocators we do not optimize away invocation of

  * destroy() even if _Tp has a trivial destructor.

  */



  template<typename _ForwardIterator, typename _Allocator>

    void

    _Destroy(_ForwardIterator __first, _ForwardIterator __last,

            _Allocator& __alloc)

    {

      for (; __first != __last; ++__first)

#if __cplusplus < 201103L

        __alloc.destroy(std::__addressof(*__first));

#else

        allocator_traits<_Allocator>::destroy(__alloc,

                                              std::__addressof(*__first));

#endif

    }

It appears to destroy the vector as an array, calling alloc.destroy for each element.
I do not see where it calls the deallocation for the array memory, except that it keeps passing around this Allocator.

I assume that _M_start is the ptr to the allocated vector data.
Altering _M_start in any way would make deallocation fail due to the heap allocation header being stored immediately before it (AFAIK).

The only way that I can see that _Destroy could be segfaulting, is if one of the ptrs (_M_start, _M_finish) had been corrupted, or if the memory page allocation had moved such to make it invalid.

That is all I can tell from this examination.
Maybe it provides some info that gives someone else some revelation.

henca

02-09-2024 01:08 AM

Quote:

Originally Posted by selfprogrammed (Post 6482374)

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.

What message did it give at exit? Something like "too many errors, go fix your program"? If so, there is a flag to valgrind which does not cause it to stop even after a huge amount of errors.

Quote:

Originally Posted by selfprogrammed (Post 6482374)

involve deallocation of a std::vector that has mysteriously gone bad.

Those "mysteriously gone bad" stuff is exactly what valgrind is good at finding. Studying a core file after a segfault you might see which variables has become broken, but gdb will at that time not be able to tell when or how those variables got broken. For that you will need something like the rr debugger which records the entire run and allows you to step forwards and backwards in the run.

Quote:

Originally Posted by selfprogrammed (Post 6482374)

When I try to print one using GDB, I only get some standardized format, not a view of the internal fields.
It will print out values for indices[1222] even when indices has length 0 capacity 0.
How it does that, when capacity is 0 (no allocated vector data memory?), is probably why the segfaults are so erratic.

To more easily study the contents in variables and what pointers point to you might want to try ddd which is a frontend for gdb and is included in Slackware.

regards Henrik

pan64

02-09-2024 02:35 AM

Quote:

Originally Posted by selfprogrammed (Post 6482374)

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.

That means you actually did not try it, you gave up. valgrind needs time to do its job, so be patient. If you have compiled your code in debug mode and without optimization it can tell you exactly which line/variable/value caused this issue.

Another possibility can be cppcheck, which can identify bad coding practices, that may lead to similar issues.

selfprogrammed

02-09-2024 03:40 AM

This is what I had to do to access the std::vector to check it.

Code:

#ifdef DEBUG_VECTOR_ALLOC

#include "common.h"

namespace tststd

{

  template< typename _Tp, typename _Alloc = std::allocator<_Tp> >

    struct vector : public std::vector< _Tp, _Alloc >

  {

    public:

      _Tp vpeek0, vpeek1;

        

      void check_vector( const char * who )

      {

          const char * s;



          if( this->_M_impl._M_start )

          {

              if( this->_M_impl._M_finish < this->_M_impl._M_start )  goto corrupted;

              if( this->_M_impl._M_end_of_storage < this->_M_impl._M_start )  goto corrupted;

              if( this->_M_impl._M_end_of_storage < this->_M_impl._M_finish )  goto corrupted;

              // test for segfault

              vpeek0 = *this->_M_impl._M_start;

              if( this->_M_impl._M_finish - this->_M_impl._M_start > 1 )

              {

                  vpeek1 = *(this->_M_impl._M_finish - 1);

              }

          }

          else

          {

              if( this->_M_impl._M_finish

                  || this->_M_impl._M_end_of_storage

                ) {

                  s = "not clean";

                  goto dump_content;

              }

          }

          return;

          

corrupted:

          s = "corrupt";

      

dump_content:          

          vlprintf( CN_DEBUG, "Vector %s: %s (%p,%p,%p) size = %i\n", who, s,

                    this->_M_impl._M_start, this->_M_impl._M_finish, this->_M_impl._M_end_of_storage, this->size() );

      }

      



  };

};



#endif

selfprogrammed

02-09-2024 03:59 AM

Yes I did try valgrind, and it did give up on its own.

I now have a new problem. Ever since I tried to compile with clang, the compiles have generated a smaller binary that will not run.
The previous binary was 29024397 bytes (29M), and the clang binary was 22787472 bytes (22M).
The clang binary would not run.
I went back to compiling for GCC, and it still generates a small binary (22M) that will not run.

This last time I got an error message.

Code:

Starting program: /usr/bin/voxelands 

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/libthread_db.so.1".

safestack CHECK failed: /tmp/llvm-13.0.0.src/projects/compiler-rt/lib/safestack/safestack.cpp:95 MAP_FAILED != addr



Program received signal SIGABRT, Aborted.

0xb6c0da1a in raise () from /lib/libc.so.6

I have tried deleting two CMakeFiles directories and letting it recompile everything again.
It still compiled into a binary of 22M, that will not run.
After careful checking of compile dates, I did several compiles on that same day. All the binaries created before the clang experiment are 29M in size.
The clang binary and every GCC binary compiled after the clang experiment is 22M, and will not run.
I had made a separate build script for the clang experiment. All the changes that I made to the build script to compile with clang were done in the Clang version of that debug script.

So I can find nothing that was changed in the debug build script used for GCC.
I will keep looking. I wonder if any of the Clang users know of some trap like this.

selfprogrammed

02-09-2024 04:15 AM

I got a running binary again.

If you do this CLANG experiment, afterwards you must remove:
the files CMakeCache.txt,
and directories CMakeFiles, src/CMakeFiles.

When the build script is modified for clang, it has these two explicit CMAKE options that explicitly specify the clang compiler.
If you try to switch back to the normal build script, there are no explicit option lines, so CMAKE goes to the cache and uses the value from the last compile.
The CMAKE display during compile, does not reveal that it has done this.
If it was anywhere in the printout, it went by so fast that I would not have had time to even focus my eyes upon the lines.

It would be safer to use an entirely separate copy of the voxelands directory for the clang experiment, so to keep the two setups truly separate.

pan64

02-11-2024 09:08 AM

Quote:

Originally Posted by selfprogrammed (Post 6482398)

It would be safer to use an entirely separate copy of the voxelands directory for the clang experiment, so to keep the two setups truly separate.

That is your choice. Create an environment for gcc, another one for clang. That's all.
You didn't answer, was there any error output when valgrind stopped?

selfprogrammed

02-13-2024 01:21 PM

I was going to try it again, and found I already had a log.
"Valgrind's memory management: out of memory"
"Whatever the reason, Valgrind cannot continue. Sorry."

Code:

==2441== Memcheck, a memory error detector

==2441== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.

==2441== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info

==2441== Command: voxelands

==2441== Parent PID: 2440

==2441== 

--2441-- core    : 1,228,931,072/1,228,931,072 max/curr mmap'd, 0/2 unsplit/split sb unmmap'd,    959,172,292/  959,172,292 max/curr,    19281722/1660987460 totalloc-blocks/bytes,    20408867 searches 4 rzB

--2441-- dinfo  :    16,551,936/  15,761,408 max/curr mmap'd, 5/4 unsplit/split sb unmmap'd,    15,032,928/  14,128,064 max/curr,      231099/ 101042936 totalloc-blocks/bytes,      234608 searches 4 rzB

--2441-- client  :  851,001,344/  851,001,344 max/curr mmap'd, 0/0 unsplit/split sb unmmap'd,    816,995,192/  816,995,192 max/curr,      557027/3114778704 totalloc-blocks/bytes,    1286858 searches 20 rzB

--2441-- demangle:        65,536/      65,536 max/curr mmap'd, 0/0 unsplit/split sb unmmap'd,          1,240/          912 max/curr,          24/      3760 totalloc-blocks/bytes,          23 searches 4 rzB

--2441-- ttaux  :    3,014,656/    2,584,576 max/curr mmap'd, 13/2 unsplit/split sb unmmap'd,      2,898,952/    2,466,792 max/curr,      11144/  7001072 totalloc-blocks/bytes,      11124 searches 4 rzB

--2441-- translate: 1,107,423 guest insns, 120,629 traces, 48,952 uncond chased, 1276 cond chased

--2441-- translate:            fast new/die SP updates identified: 126,297 (67.0%)/37,140 (19.7%)

--2441-- translate:  generic_known new/die SP updates identified: 16,311 (8.7%)/8,035 (4.3%)

--2441-- translate: generic_unknown SP updates identified: 694 (0.4%)

--2441-- translate: PX: SPonly 0,  UnwRegs 120,629,  AllRegs 0,  AllRegsAllInsns 0

--2441--    tt/tc: 300,144 tt lookups requiring 1,116,851 probes

--2441--    tt/tc: 199,285 fast-cache updates, 5 flushes

--2441--  transtab: new        120,629 (4,211,144 -> 79,615,373; ratio 18.9) [0 scs] avg tce size 660

--2441--  transtab: dumped    0 (0 -> ??) (sectors recycled 0)

--2441--  transtab: discarded  0 (0 -> ??)

--2441-- scheduler: 2,237,770,218 event checks.

--2441-- scheduler: 77,731,380 indir transfers, 76,696 misses (1 in 1013) ..

--2441-- scheduler: .. of which: 76,148,415 hit0, 1,270,488 hit1, 175,425 hit2, 60,356 hit3, 76,696 missed

--2441-- scheduler: 22,376/1,414,011 major/minor sched events.

--2441--    sanity: 22376 cheap, 190 expensive checks.

--2441--    exectx: 393,241 lists, 350,732 contexts (avg 0.89 per list) (avg 8.37 IP per context)

--2441--    exectx: 1,158,041 searches, 1,069,289 full compares (923 per 1000)

--2441--    exectx: 0 cmp2, 10 cmp4, 0 cmpAll

--2441--  errormgr: 5 supplist searches, 33 comparisons during search

--2441--  errormgr: 5 errlist searches, 10 comparisons during search

--2441--  memcheck: freelist: vol 19904484 length 1704

--2441--  memcheck: sanity checks: 22376 cheap, 191 expensive

--2441--  memcheck: auxmaps: 0 auxmap entries (0k, 0M) in use

--2441--  memcheck: auxmaps_L1: 0 searches, 0 cmps, ratio 0:10

--2441--  memcheck: auxmaps_L2: 0 searches, 0 nodes

--2441--  memcheck: SMs: n_issued      = 44783 (716528k, 699M)

--2441--  memcheck: SMs: n_deissued    = 31183 (498928k, 487M)

--2441--  memcheck: SMs: max_noaccess  = 65535 (1048560k, 1023M)

--2441--  memcheck: SMs: max_undefined = 1026 (16416k, 16M)

--2441--  memcheck: SMs: max_defined  = 7832 (125312k, 122M)

--2441--  memcheck: SMs: max_non_DSM  = 13600 (217600k, 212M)

--2441--  memcheck: max sec V bit nodes:    10023 (313k, 0M)

--2441--  memcheck: set_sec_vbits8 calls: 985205 (new: 215959, updates: 769246)

--2441--  memcheck: max shadow mem size:  218217k, 213M

--2441--  ocacheL1:  2,084,852,981 refs      71,613,145 misses (20,373,183 lossage)

--2441--  ocacheL1:  1,874,521,262 at 0      138,718,574 at 1

--2441--  ocacheL1:              0 at 2+      72,696,883 move-fwds

--2441--  ocacheL1:    92,274,688 sizeB      67,108,864 useful

--2441--  ocacheL2:    91,986,327 finds      67,868,287 misses

--2441--  ocacheL2:    18,169,091 adds      49,142,810 dels

--2441--  ocacheL2:    16,628,700 max nodes 16,628,700 curr nodes

--2441--  niacache:            0 refs              0 misses



host stacktrace:

==2441==    at 0x5803FBF6: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5804B332: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5804B4B7: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5804B9EC: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5804DD2E: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5804F8BC: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x5800B832: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x58010506: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x58015231: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x580053D4: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x58005598: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x580A9354: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x580FB59F: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x580FB8E0: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

==2441==    by 0x580BCB6E: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)



sched status:

  running_tid=5



Thread 1: status = VgTs_WaitSys syscall 3 (lwpid 2441)

==2441==    at 0x5370EBA: read (in /lib/libc-2.33.so)

==2441==    by 0x52F7BFF: _IO_file_seekoff@@GLIBC_2.1 (in /lib/libc-2.33.so)

==2441==    by 0x52F3B91: fseek (in /lib/libc-2.33.so)

==2441==    by 0x81036C7: file_load (in /usr/bin/voxelands)

==2441==    by 0x83767DB: ??? (in /usr/bin/voxelands)

==2441==    by 0x8376FD4: sound_init (in /usr/bin/voxelands)

==2441==    by 0x80F623A: main (in /usr/bin/voxelands)

client stack range: [0xBEB62000 0xBEB85FFF] client SP: 0xBEB84500

valgrind stack range: [0x68179000 0x68278FFF] top usage: 7040 of 1048576



Thread 2: status = VgTs_WaitSys syscall 422 (lwpid 2446)

==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)

==2441==    by 0x4E0A478: pthread_cond_wait@@GLIBC_2.3.2 (in /lib/libpthread-2.33.so)

==2441==    by 0xA3278E1: ??? (in /usr/lib/dri/r600_dri.so)

==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)

==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)

client stack range: [0xF66D000 0xFE6CFFF] client SP: 0xFE6C0C0

valgrind stack range: [0x6DDC5000 0x6DEC4FFF] top usage: 4764 of 1048576



Thread 3: status = VgTs_WaitSys syscall 422 (lwpid 2447)

==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)

==2441==    by 0x4E0A478: pthread_cond_wait@@GLIBC_2.3.2 (in /lib/libpthread-2.33.so)

==2441==    by 0xA3278E1: ??? (in /usr/lib/dri/r600_dri.so)

==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)

==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)

client stack range: [0xFFBA000 0x107B9FFF] client SP: 0x107B90C0

valgrind stack range: [0x6E0D5000 0x6E1D4FFF] top usage: 3072 of 1048576



Thread 4: status = VgTs_WaitSys syscall 168 (lwpid 2453)

==2441==    at 0x5378E5A: poll (in /lib/libc-2.33.so)

==2441==    by 0x4B6D71B: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x111C3688: pa_mainloop_poll (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x111C3ED3: pa_mainloop_iterate (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x111C3FB4: pa_mainloop_run (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x4B6F33E: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)

==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)

==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)

client stack range: [0x16476000 0x16C75FFF] client SP: 0x16C751C0

valgrind stack range: [0x7E5AE000 0x7E6ADFFF] top usage: 5648 of 1048576



Thread 5: status = VgTs_Runnable (lwpid 2454)

==2441==    at 0x4040690: malloc (vg_replace_malloc.c:431)

==2441==    by 0x111DE538: pa_xmalloc (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x11A7D963: pa_memblock_new (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x111CE1F7: pa_stream_begin_write (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x4B6C975: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x111CDADF: ??? (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x11A86FC7: pa_pdispatch_run (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x111A8DDC: ??? (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x11A8A300: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x11A8DEB4: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x11A8E3B7: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x11A8EF52: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)

==2441==    by 0x111C3AD5: pa_mainloop_dispatch (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x111C3EE1: pa_mainloop_iterate (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x111C3FB4: pa_mainloop_run (in /usr/lib/libpulse.so.0.24.0)

==2441==    by 0x4B6F33E: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)

==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)

==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)

client stack range: [0x1F014000 0x1F813FFF] client SP: 0x1F812E40

valgrind stack range: [0x7F502000 0x7F601FFF] top usage: 5648 of 1048576



Thread 6: status = VgTs_WaitSys syscall 422 (lwpid 2455)

==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)

==2441==    by 0x4E0D566: do_futex_wait.constprop.0 (in /lib/libpthread-2.33.so)

==2441==    by 0x4E0D60E: __new_sem_wait_slow64.constprop.0 (in /lib/libpthread-2.33.so)

==2441==    by 0x4B866A1: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x4B01694: ??? (in /usr/lib/libopenal.so.1.21.1)

==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)

==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)

==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)

client stack range: [0x33863000 0x34062FFF] client SP: 0x34062060

valgrind stack range: [0x847CF000 0x848CEFFF] top usage: 3072 of 1048576



==2441== 

==2441==    Valgrind's memory management: out of memory: Invalid argument

==2441==        newSuperblock's request for 4194304 bytes failed.

==2441==        2,691,284,992 bytes have already been mmap-ed ANONYMOUS.

==2441==    Valgrind cannot continue.  Sorry.

==2441== 

==2441==    There are several possible reasons for this.

==2441==    - You have some kind of memory limit in place.  Look at the

==2441==      output of 'ulimit -a'.  Is there a limit on the size of

==2441==      virtual memory or address space?

==2441==    - You have run out of swap space.

==2441==    - You have some policy enabled that denies memory to be

==2441==      executable (for example selinux deny_execmem) that causes

==2441==      mmap to fail with Permission denied.

==2441==    - Valgrind has a bug.  If you think this is the case or you are

==2441==    not sure, please let us know and we'll try to fix it.

==2441==    Please note that programs can take substantially more memory than

==2441==    normal when running under Valgrind tools, eg. up to twice or

==2441==    more, depending on the tool.  On a 64-bit machine, Valgrind

==2441==    should be able to make use of up 32GB memory.  On a 32-bit

==2441==    machine, Valgrind should be able to use all the memory available

==2441==    to a single process, up to 4GB if that's how you have your

==2441==    kernel configured.  Most 32-bit Linux setups allow a maximum of

==2441==    3GB per process.

==2441== 

==2441==    Whatever the reason, Valgrind cannot continue.  Sorry.

selfprogrammed

02-13-2024 01:34 PM

I have been adding more canary guards and more vector checks to the destructor that segfaults most often.
As I instrument the object, its behavior changes.

I did actually have one of the canary guards trigger and it presented me with an object that looked like it had been overwritten with some table. Looked like it was an overlaid with an int array with many almost consecutive values.
It could of been a wild object ptr, or an overwrite from something else. I could not determine more.

It is getting more stable and runs longer. It will be difficult if the debugging code makes the bugs hide better.

pan64

02-14-2024 01:01 AM

Quote:

Originally Posted by selfprogrammed (Post 6483366)

this is just shooting in the dark, you can only shift the problem, not solve by that. Is there any way to add more memory (or swap) to the system?
By the way, no enough ram probably enough to kill your process.

selfprogrammed

02-15-2024 10:42 AM

I have installed many slackbuilds.
During that, I have had multiple programs compiling on this system simultaneous, while running other editors and logins, and never had it even touch the swap.
No process other than valgrind has ever run out of memory.

I must suspect that valgrind did something unusual.
This voxelands program uses multiple tasks, and swaps a large mesh structure with another construct. I suspect that valgrind tried to track that. and could not cope with it.

Rather than trying to change my system, it would be more instructive if someone who considers their system to be superior for this, to test the program and tell if it produces the same behavior on their system. Perhaps, if their memory is large enough, they can even run valgrind on it.

The instrumented checks have tracked this down to a single vector that is repeatedly failing in its destructor. It is a std::vector, so it is difficult to get inside and identify why it is segfaulting during the destructor.

Note: I have added a clean to the destructor, to empty the vectors with explicit commands. so that does not occur during the default destructor execution.
It does not segfault during that. However, it will segfault later at the end of MeshData destructor, during a specific vector default destructor.
Note: That vector content is supplied by the IrrLicht library.
Note: that sometimes the vector has negative size, which means the internal pointers have been corrupted.
Note: there may be still more than one fault acting here

Possibilities:
1. corruption of the vector allocator data (something internal to the vector).
2. the pointer to the MeshData may have been corrupted.
3. maybe that MeshData was released long ago, and this is a stale pointer.
4. The compiler can still not be excluded. The symptoms change too much when I have only added some debugging code that should not have such an effect.
It runs longer between faults, sometimes for hours, and it now mostly segfaults during a specific vector destructor.
Those change would alter what the default constructors, default destructors, and optimizers have to cope with, but I have not altered what the code is doing with this MeshData.

BrunoLafleur

02-15-2024 11:04 AM

Quote:

Originally Posted by selfprogrammed (Post 6483801)

4. The compiler can still not be excluded. The symptoms change too radically when I have only added some debugging code that should not have such an effect.

It is usually a symptom of an earlier write outside array boundaries or write in an already freed memory. It is normal it changes the result a lot if you add new code in area you see the problem because it changes the memory maps. It will corrupt data in other places.

If valgrind can be set to see something it will find the problem source probably at the same place even in your instrumented code because it will be the root cause.

I didn't have time to test for now. I don't know if my computers have enough memory but I will see.

selfprogrammed

03-04-2024 05:34 PM

I was getting some other code done and committed, and could not test this for a while.

I have a derived vector class that I have instrumented with canary, a data copy.
Due to voxelands using vector swap operation, I also had to implement a swap_debug function to swap the debugging information.
I reconfigured the canary and added a tracker, on each of the 3 classes that figure in this.

I have seen two faults with the new instrumentation. That is within about 2 hours play time.

It will not fault where I can see why. It faults in the default destructors at the end of the explicit destructor. I have added code to empty the lists before that, so the only thing left there is releasing the vector memory allocation.
It does not like something with the allocator, such as in one case it complained that the size had changed.
Note that the stl vector swap, exchanges the allocated vector data with another vector. It is supposed to also swap the allocator information if needed.
Of course the swapped vector data are not likely to be the same size.

The tracker records the last 16 operations. I did confirm that the last operation the tracker recorded was the vector swap.
I had a little trouble with that as the function that prints out the tracker info will hang if invoked from the debugger at that point. It seems that the data structure is locked, and it will hang in a spin lock. Impossible to recover from.

This would be the fault of the compiler stl implementation not handling the vector swap operator correctly. That is my best guess right now.
I doubt many programs ever use the stl vector swap operation.

Need to test this on something different like clang. My last attempt to compile voxelands using clang generated an unusable package, and I do not know what went wrong.

Addendum:
I am familiar with wild writes. That is why the instrumentation includes canary values and a copy of the first vector element, which is repeatedly checked against the actual vector.
The canary is marked with a unique value during deallocation, so that the canary checks will detect double deallocation, or writing to a deallocated entity.
They have found nothing, other than that swap operation violates their assumptions, thus the swap_debug operation to also swap the debug information.
The faults I get ware usually segfault. I think the instrumentation somehow has stopped those. Now I get double deallocation suspected, and at least one wrong deallocation size.
I do not think that Valgrind could instrument this any more completely.

selfprogrammed

03-04-2024 06:03 PM

This is a voxelands (1709) diff, of the latest patches and instrumentation that I have applied to voxelands. Some of the patches do make segfaults go away.

https://filebin.net/ezdpdagwu81vbinc

Filebin

The bin ezdpdagwu81vbinc was created now and it expires 6 days from now. It contains 1 uploaded file at 101 kB.

Filename Content type Size Uploaded
voxelands-v1709.00_debug_02.diff text/plain; charset=utf-8 101 kB now
More

selfprogrammed

03-20-2024 06:18 AM

I am now up the 49'th version of modified code, and still cannot identify exactly what is happening. It could still be a compiler issue, or possibly a strange coding error. I have been over that code so many times that I think I would have found a coding error by now.

It is still consistent in where it faults and how it faults. I have added more instrumentation to check possible faults.
The one deallocation fault is still there, is still happening around 2 times per 2 hours (it will not occur sooner), with an error message of "corrupted or double deallocation" when trying to free a some particular vectors. These vectors are subject to a swap operation, from another thread, protected by a Lock.

I have detected that another vector segfault occurred after several layers of instrumentation had verified the object repeatedly. Examination revealed an object with random data. It appears that the "this" ptr was corrupted in the middle of the function, so I have added instrumentation to detect that.
Of course the latest runs, do not detect anything, yet.
That is why, I must still consider that this may be a compiler fault.

I have obtained a copy of gcc 12.3. I just have to figure out how to install it without compromising the existing GCC package.

kgha	03-20-2024 06:22 AM

Quote:

Originally Posted by selfprogrammed (Post 6490776)

I just have to figure out how to install it without compromising the existing GCC package.

See https://gcc.gnu.org/faq.html#multiple

selfprogrammed

03-22-2024 06:04 AM

I know this is going on an on and is becoming an exercise in discovering what the voxelands programmers did.
I did a test of the memory allocation.
I have replaced several of the std::vector uses with a derived version with some instrumentation in it.
I added to that a test array allocation of a small array of bytes.

The program now faults on those allocations and deallocations, in about 2 seconds, with the same kind of messages I was getting before.
There is very little that I can see that could go wrong with this allocation and deallocation. I even NULL the ptr after delete.
The difference between this and the actual vector, is that I deallocate and reallocate with every length change, so to finb allocation problems much sooner.
The size of the byte array is the same as the length of the vector, about 16 bytes.

This program creates a structure that has new data for the database. It uses a thread to do the actual update, and that thread does the deallocation of the update data, which was originally created in the client program.

In this environment (Linux), are stack allocations and deallocations thread safe, and can they deallocate in a different thread than it was allocated ?
My experience is with different hardware, and it could be either valid or absolutely NOT.

selfprogrammed

03-28-2024 08:41 AM

I have gcc 12.3 compiler installed. I compiled itself 3 times, just to make sure that last two times were the same. It took around 5 to 6 hours of computer time. After all that, I am taking it as a comfirmation that my machine does not have a physical fault.

Have not had the chance to try it out as I have found a whole new problem with the code.
It threw up another error that I had not seen before, and so I had to investigate. Now I am stuck trying to deal with it.

In another file, the program has an array of "Mesh" blocks, and is trying to use "memcpy" to copy part of the array to another place.
These contain the std::vector that is such a problem. Those std::vector have some internal allocation data structure that is faulting in the destructor.
The compiler is putting out a "warning" message about the memcpy, and say to make something else in the structure.
That is not easy to do.

From my analysis of the stack at the fault (I get about 4 or 5 of these to analyze every day), it would be entirely consistent that it had been copied using memcpy from some other source.
The internal ptrs of my debugging copies are wrong for the current instance of the data.

I expect that in some previous compiler version that an array of such classes could be copied using memcpy, and now they cannot.
With all this behind-the-back allocation and stl secret data, they are just making everything more and more fragile.
It is no wonder that it looks like the compiler is part of the problem. In a way, it is the stl implementation that comes with the compiler.

BrunoLafleur

03-28-2024 09:02 AM

Quote:

Originally Posted by selfprogrammed (Post 6492514)

Yes it is probably a good catch. But in C++ memcpy has never been a good way (even in very very old compilers) to copy objects because if objects have internal classes members and/or pointeurs, only the pointers are copied and not the data themselves. And duplicates of pointers is bad practice (because of aliasing and because we could deallocate via one copy and forget with the other copy. Also threading is adding some more complexity). Each pointer has its own semantic that depend on the class which is used. So for copying we must rely on constructors and destructors even on arrays of objects.

And you are probably right in saying memcpy comes from a version where they were no STL. But it is not the fault of STL but from the one who did the conversion to the STL : the fix was too quick. When porting to STL, it is necessary to rethink the code or to rewrite without STL (it is possible because we don't always need the complexity and genericity of the STL lib). More specialized and simpler code is often enough.

All times are GMT -5. The time now is 01:17 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page