GCC update to another version

selfprogrammed · 02-08-2024, 10:49 PM

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.
I am not sure what it could tell me that is not already obvious.
Valgrind does not have any special insight into the library, its proper usage, or the compiler optimization.

When there is no data, if used is going to be left with a value, and size() is going to only return that value, of course it is going to segfault.
Perhaps it is the library that is responsible, ... maybe.
Maybe this works on other compilers, or other programs, or it requires some rules for how the user can call the functions.

I find it hard to believe that this library could be this way, and not have errors all over the place. Something is wrong, but I am not sure exactly what.

I am still mistrusting the compiler as it is capable of producing such a inconsistency by optimizing.
It could be removing the setting of used to zero because it thought it was already zeroed by some other initialization.

The above segfault was an unusual one, in that it involved a structure that I could look at.
Most of the segfaults are similar, but involve deallocation of a std::vector that has mysteriously gone bad.
Those segfaults occur deep in the std: implementation libraries, where I have not found any relevant information that I can print out besides "info stack".
What I notice is the stack trace contains a deallocation of several mesh structures, and the automatic deallocation of MeshData fails, for reasons it does not tell me (except that there was segfault in an allocator). What it is doing calling an allocator in order to deallocate, I do not know. This was intended to be hidden implementation, and they did a good job on the hiding part, and a terrible job in the reliability part.

So anyone got any good ideas on how to checkup on a std::vector.
When I try to print one using GDB, I only get some standardized format, not a view of the internal fields.
It will print out values for indices[1222] even when indices has length 0 capacity 0.
How it does that, when capacity is 0 (no allocated vector data memory?), is probably why the segfaults are so erratic.

To consider the position that the std::vector is well written, and cannot possibly be at fault in its design, then will have to explain how this keeps happening.
Then the possibility that the compiler is at fault will be the only other explanation. I already realized that before I posted anything.
Also, reporting a suspected GCC bug will do NOTHING (IMExp), unless it can be documented, it is reproducible, and you are running the latest bleeding edge version.
From my previous experiences, the version of GCC offered by Slackware, would be considered too old to be supported.

selfprogrammed · 02-09-2024, 12:46 AM

I have, in desperation, looked at the std::vector implementation. Desperation, because no one should try to read that, as it makes APL look compact and reasonable.

It is in "/usr/include/c++/11.2.0/bits/stl_vector.h".

Most everything is protected, and convoluted on top of that.

Notable is that there are multiple patches to std::vector in these header files, that are enabled by

Code:

#if _cplusplus >= 201103L

I suspect those patches would not be enabled by 11.2.0, but would be enabled by 11.4.0.

The patches change the std::vector implementation significantly. There are so many, that I have to point you at the header file, read it for yourself.
Constructors and initializers are heavily affected by the patches. This is exactly where the problem I am having is manifesting.

I have discovered the following peeks into the structure.
Given user declared std::vector v1
GDB commands that work.

Code:

p  v1._M_impl
p  v1->_M_impl
p  v1->_M_impl._M_start
p  v1->_M_impl._M_finish
p  v1->_M_impl._M_end_of_storage

I have copied parts of the stl header here to discuss it.

Code:

      // [23.2.4.2] capacity
      /**  Returns the number of elements in the %vector.  */
      size_type
      size() const _GLIBCXX_NOEXCEPT
      { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }

So size() = _M_finish - _M_start

The clear() function calls _M_erase_at_end(), a function that is repeatedly in the stack dump when it segfaults.

Code:

     void
      clear() _GLIBCXX_NOEXCEPT
      { _M_erase_at_end(this->_M_impl._M_start); }

Code:

      // Called by erase(q1,q2), clear(), resize(), _M_fill_assign,
      // _M_assign_aux.
      void
      _M_erase_at_end(pointer __pos) _GLIBCXX_NOEXCEPT
      {
	if (size_type __n = this->_M_impl._M_finish - __pos)
	  {
	    std::_Destroy(__pos, this->_M_impl._M_finish,
			  _M_get_Tp_allocator());
	    this->_M_impl._M_finish = __pos;
	    _GLIBCXX_ASAN_ANNOTATE_SHRINK(__n);
	  }
      }

It calls Destroy, another function that appears in the segfault stack dumps.
It is guarded by an expression that translates to "if( n = size() )".

Code:

      /**
       *  The dtor only erases the elements, and note that if the
       *  elements themselves are pointers, the pointed-to memory is
       *  not touched in any way.  Managing the pointer is the user's
       *  responsibility.
       */
      ~vector() _GLIBCXX_NOEXCEPT
      {
	std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,
		      _M_get_Tp_allocator());
	_GLIBCXX_ASAN_ANNOTATE_BEFORE_DEALLOC;
      }

Destroy is ONLY CALLED by the deallocator, and _M_erase_at_end, and both of them call it with the second param = _M_finish,
but with different first param.
So Destroy cannot be doing deallocate using the first param.

I found _Destroy in "/usr/include/c++/11.2.0/bits/alloc_traits.h".
The alloc_traits name also appears in the segfault stack dump.

Code:

  /**
   * Destroy a range of objects using the supplied allocator.  For
   * non-default allocators we do not optimize away invocation of
   * destroy() even if _Tp has a trivial destructor.
   */

  template<typename _ForwardIterator, typename _Allocator>
    void
    _Destroy(_ForwardIterator __first, _ForwardIterator __last,
	     _Allocator& __alloc)
    {
      for (; __first != __last; ++__first)
#if __cplusplus < 201103L
	__alloc.destroy(std::__addressof(*__first));
#else
	allocator_traits<_Allocator>::destroy(__alloc,
					      std::__addressof(*__first));
#endif
    }

It appears to destroy the vector as an array, calling alloc.destroy for each element.
I do not see where it calls the deallocation for the array memory, except that it keeps passing around this Allocator.

I assume that _M_start is the ptr to the allocated vector data.
Altering _M_start in any way would make deallocation fail due to the heap allocation header being stored immediately before it (AFAIK).

The only way that I can see that _Destroy could be segfaulting, is if one of the ptrs (_M_start, _M_finish) had been corrupted, or if the memory page allocation had moved such to make it invalid.

That is all I can tell from this examination.
Maybe it provides some info that gives someone else some revelation.

henca · 02-09-2024, 01:08 AM

Quote:

Originally Posted by selfprogrammed

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.

What message did it give at exit? Something like "too many errors, go fix your program"? If so, there is a flag to valgrind which does not cause it to stop even after a huge amount of errors.

Quote:

Originally Posted by selfprogrammed

involve deallocation of a std::vector that has mysteriously gone bad.

Those "mysteriously gone bad" stuff is exactly what valgrind is good at finding. Studying a core file after a segfault you might see which variables has become broken, but gdb will at that time not be able to tell when or how those variables got broken. For that you will need something like the rr debugger which records the entire run and allows you to step forwards and backwards in the run.

Quote:

Originally Posted by selfprogrammed

When I try to print one using GDB, I only get some standardized format, not a view of the internal fields.
It will print out values for indices[1222] even when indices has length 0 capacity 0.
How it does that, when capacity is 0 (no allocated vector data memory?), is probably why the segfaults are so erratic.

To more easily study the contents in variables and what pointers point to you might want to try ddd which is a frontend for gdb and is included in Slackware.

regards Henrik

pan64 · 02-09-2024, 02:35 AM

Quote:

Originally Posted by selfprogrammed

As previously reported, valgrind was tried, it ran for 10 minutes then gave up and exited.

That means you actually did not try it, you gave up. valgrind needs time to do its job, so be patient. If you have compiled your code in debug mode and without optimization it can tell you exactly which line/variable/value caused this issue.

Another possibility can be cppcheck, which can identify bad coding practices, that may lead to similar issues.

selfprogrammed · 02-09-2024, 03:40 AM

This is what I had to do to access the std::vector to check it.

Code:

#ifdef DEBUG_VECTOR_ALLOC
#include "common.h"
namespace tststd
{
   template< typename _Tp, typename _Alloc = std::allocator<_Tp> >
     struct vector : public std::vector< _Tp, _Alloc >
  {
    public:
       _Tp vpeek0, vpeek1;
	 
       void check_vector( const char * who )
       {
	   const char * s;

	   if( this->_M_impl._M_start )
	   {
	       if( this->_M_impl._M_finish < this->_M_impl._M_start )  goto corrupted;
	       if( this->_M_impl._M_end_of_storage < this->_M_impl._M_start )  goto corrupted;
	       if( this->_M_impl._M_end_of_storage < this->_M_impl._M_finish )  goto corrupted;
	       // test for segfault
	       vpeek0 = *this->_M_impl._M_start;
	       if( this->_M_impl._M_finish - this->_M_impl._M_start > 1 )
	       {
		   vpeek1 = *(this->_M_impl._M_finish - 1);
	       }
	   }
	   else
	   {
	       if( this->_M_impl._M_finish
		   || this->_M_impl._M_end_of_storage
		 ) {
		   s = "not clean";
		   goto dump_content;
	       }
	   }
	   return;
	   
corrupted:
	   s = "corrupt";
       
dump_content:	   
	   vlprintf( CN_DEBUG, "Vector %s: %s (%p,%p,%p) size = %i\n", who, s,
		     this->_M_impl._M_start, this->_M_impl._M_finish, this->_M_impl._M_end_of_storage, this->size() );
       }
       

   };
};

#endif

selfprogrammed · 02-09-2024, 03:59 AM

Yes I did try valgrind, and it did give up on its own.

I now have a new problem. Ever since I tried to compile with clang, the compiles have generated a smaller binary that will not run.
The previous binary was 29024397 bytes (29M), and the clang binary was 22787472 bytes (22M).
The clang binary would not run.
I went back to compiling for GCC, and it still generates a small binary (22M) that will not run.

This last time I got an error message.

Code:

Starting program: /usr/bin/voxelands 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
safestack CHECK failed: /tmp/llvm-13.0.0.src/projects/compiler-rt/lib/safestack/safestack.cpp:95 MAP_FAILED != addr

Program received signal SIGABRT, Aborted.
0xb6c0da1a in raise () from /lib/libc.so.6

I have tried deleting two CMakeFiles directories and letting it recompile everything again.
It still compiled into a binary of 22M, that will not run.
After careful checking of compile dates, I did several compiles on that same day. All the binaries created before the clang experiment are 29M in size.
The clang binary and every GCC binary compiled after the clang experiment is 22M, and will not run.
I had made a separate build script for the clang experiment. All the changes that I made to the build script to compile with clang were done in the Clang version of that debug script.

So I can find nothing that was changed in the debug build script used for GCC.
I will keep looking. I wonder if any of the Clang users know of some trap like this.

selfprogrammed · 02-09-2024, 04:15 AM

I got a running binary again.

If you do this CLANG experiment, afterwards you must remove:
the files CMakeCache.txt,
and directories CMakeFiles, src/CMakeFiles.

When the build script is modified for clang, it has these two explicit CMAKE options that explicitly specify the clang compiler.
If you try to switch back to the normal build script, there are no explicit option lines, so CMAKE goes to the cache and uses the value from the last compile.
The CMAKE display during compile, does not reveal that it has done this.
If it was anywhere in the printout, it went by so fast that I would not have had time to even focus my eyes upon the lines.

It would be safer to use an entirely separate copy of the voxelands directory for the clang experiment, so to keep the two setups truly separate.

pan64 · 02-11-2024, 09:08 AM

Quote:

Originally Posted by selfprogrammed

It would be safer to use an entirely separate copy of the voxelands directory for the clang experiment, so to keep the two setups truly separate.

That is your choice. Create an environment for gcc, another one for clang. That's all.
You didn't answer, was there any error output when valgrind stopped?

selfprogrammed · 02-13-2024, 01:21 PM

I was going to try it again, and found I already had a log.
"Valgrind's memory management: out of memory"
"Whatever the reason, Valgrind cannot continue. Sorry."

Code:

==2441== Memcheck, a memory error detector
==2441== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==2441== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==2441== Command: voxelands
==2441== Parent PID: 2440
==2441== 
--2441-- core    : 1,228,931,072/1,228,931,072 max/curr mmap'd, 0/2 unsplit/split sb unmmap'd,    959,172,292/  959,172,292 max/curr,    19281722/1660987460 totalloc-blocks/bytes,    20408867 searches 4 rzB
--2441-- dinfo   :    16,551,936/   15,761,408 max/curr mmap'd, 5/4 unsplit/split sb unmmap'd,     15,032,928/   14,128,064 max/curr,      231099/ 101042936 totalloc-blocks/bytes,      234608 searches 4 rzB
--2441-- client  :   851,001,344/  851,001,344 max/curr mmap'd, 0/0 unsplit/split sb unmmap'd,    816,995,192/  816,995,192 max/curr,      557027/3114778704 totalloc-blocks/bytes,     1286858 searches 20 rzB
--2441-- demangle:        65,536/       65,536 max/curr mmap'd, 0/0 unsplit/split sb unmmap'd,          1,240/          912 max/curr,          24/      3760 totalloc-blocks/bytes,          23 searches 4 rzB
--2441-- ttaux   :     3,014,656/    2,584,576 max/curr mmap'd, 13/2 unsplit/split sb unmmap'd,      2,898,952/    2,466,792 max/curr,       11144/   7001072 totalloc-blocks/bytes,       11124 searches 4 rzB
--2441-- translate: 1,107,423 guest insns, 120,629 traces, 48,952 uncond chased, 1276 cond chased
--2441-- translate:            fast new/die SP updates identified: 126,297 (67.0%)/37,140 (19.7%)
--2441-- translate:   generic_known new/die SP updates identified: 16,311 (8.7%)/8,035 (4.3%)
--2441-- translate: generic_unknown SP updates identified: 694 (0.4%)
--2441-- translate: PX: SPonly 0,  UnwRegs 120,629,  AllRegs 0,  AllRegsAllInsns 0
--2441--     tt/tc: 300,144 tt lookups requiring 1,116,851 probes
--2441--     tt/tc: 199,285 fast-cache updates, 5 flushes
--2441--  transtab: new        120,629 (4,211,144 -> 79,615,373; ratio 18.9) [0 scs] avg tce size 660
--2441--  transtab: dumped     0 (0 -> ??) (sectors recycled 0)
--2441--  transtab: discarded  0 (0 -> ??)
--2441-- scheduler: 2,237,770,218 event checks.
--2441-- scheduler: 77,731,380 indir transfers, 76,696 misses (1 in 1013) ..
--2441-- scheduler: .. of which: 76,148,415 hit0, 1,270,488 hit1, 175,425 hit2, 60,356 hit3, 76,696 missed
--2441-- scheduler: 22,376/1,414,011 major/minor sched events.
--2441--    sanity: 22376 cheap, 190 expensive checks.
--2441--    exectx: 393,241 lists, 350,732 contexts (avg 0.89 per list) (avg 8.37 IP per context)
--2441--    exectx: 1,158,041 searches, 1,069,289 full compares (923 per 1000)
--2441--    exectx: 0 cmp2, 10 cmp4, 0 cmpAll
--2441--  errormgr: 5 supplist searches, 33 comparisons during search
--2441--  errormgr: 5 errlist searches, 10 comparisons during search
--2441--  memcheck: freelist: vol 19904484 length 1704
--2441--  memcheck: sanity checks: 22376 cheap, 191 expensive
--2441--  memcheck: auxmaps: 0 auxmap entries (0k, 0M) in use
--2441--  memcheck: auxmaps_L1: 0 searches, 0 cmps, ratio 0:10
--2441--  memcheck: auxmaps_L2: 0 searches, 0 nodes
--2441--  memcheck: SMs: n_issued      = 44783 (716528k, 699M)
--2441--  memcheck: SMs: n_deissued    = 31183 (498928k, 487M)
--2441--  memcheck: SMs: max_noaccess  = 65535 (1048560k, 1023M)
--2441--  memcheck: SMs: max_undefined = 1026 (16416k, 16M)
--2441--  memcheck: SMs: max_defined   = 7832 (125312k, 122M)
--2441--  memcheck: SMs: max_non_DSM   = 13600 (217600k, 212M)
--2441--  memcheck: max sec V bit nodes:    10023 (313k, 0M)
--2441--  memcheck: set_sec_vbits8 calls: 985205 (new: 215959, updates: 769246)
--2441--  memcheck: max shadow mem size:   218217k, 213M
--2441--  ocacheL1:  2,084,852,981 refs       71,613,145 misses (20,373,183 lossage)
--2441--  ocacheL1:  1,874,521,262 at 0      138,718,574 at 1
--2441--  ocacheL1:              0 at 2+      72,696,883 move-fwds
--2441--  ocacheL1:     92,274,688 sizeB      67,108,864 useful
--2441--  ocacheL2:     91,986,327 finds      67,868,287 misses
--2441--  ocacheL2:     18,169,091 adds       49,142,810 dels
--2441--  ocacheL2:    16,628,700 max nodes 16,628,700 curr nodes
--2441--  niacache:            0 refs              0 misses

host stacktrace:
==2441==    at 0x5803FBF6: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5804B332: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5804B4B7: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5804B9EC: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5804DD2E: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5804F8BC: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x5800B832: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x58010506: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x58015231: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x580053D4: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x58005598: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x580A9354: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x580FB59F: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x580FB8E0: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)
==2441==    by 0x580BCB6E: ??? (in /usr/libexec/valgrind/memcheck-x86-linux)

sched status:
  running_tid=5

Thread 1: status = VgTs_WaitSys syscall 3 (lwpid 2441)
==2441==    at 0x5370EBA: read (in /lib/libc-2.33.so)
==2441==    by 0x52F7BFF: _IO_file_seekoff@@GLIBC_2.1 (in /lib/libc-2.33.so)
==2441==    by 0x52F3B91: fseek (in /lib/libc-2.33.so)
==2441==    by 0x81036C7: file_load (in /usr/bin/voxelands)
==2441==    by 0x83767DB: ??? (in /usr/bin/voxelands)
==2441==    by 0x8376FD4: sound_init (in /usr/bin/voxelands)
==2441==    by 0x80F623A: main (in /usr/bin/voxelands)
client stack range: [0xBEB62000 0xBEB85FFF] client SP: 0xBEB84500
valgrind stack range: [0x68179000 0x68278FFF] top usage: 7040 of 1048576

Thread 2: status = VgTs_WaitSys syscall 422 (lwpid 2446)
==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)
==2441==    by 0x4E0A478: pthread_cond_wait@@GLIBC_2.3.2 (in /lib/libpthread-2.33.so)
==2441==    by 0xA3278E1: ??? (in /usr/lib/dri/r600_dri.so)
==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)
==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)
client stack range: [0xF66D000 0xFE6CFFF] client SP: 0xFE6C0C0
valgrind stack range: [0x6DDC5000 0x6DEC4FFF] top usage: 4764 of 1048576

Thread 3: status = VgTs_WaitSys syscall 422 (lwpid 2447)
==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)
==2441==    by 0x4E0A478: pthread_cond_wait@@GLIBC_2.3.2 (in /lib/libpthread-2.33.so)
==2441==    by 0xA3278E1: ??? (in /usr/lib/dri/r600_dri.so)
==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)
==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)
client stack range: [0xFFBA000 0x107B9FFF] client SP: 0x107B90C0
valgrind stack range: [0x6E0D5000 0x6E1D4FFF] top usage: 3072 of 1048576

Thread 4: status = VgTs_WaitSys syscall 168 (lwpid 2453)
==2441==    at 0x5378E5A: poll (in /lib/libc-2.33.so)
==2441==    by 0x4B6D71B: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x111C3688: pa_mainloop_poll (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x111C3ED3: pa_mainloop_iterate (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x111C3FB4: pa_mainloop_run (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x4B6F33E: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)
==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)
==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)
client stack range: [0x16476000 0x16C75FFF] client SP: 0x16C751C0
valgrind stack range: [0x7E5AE000 0x7E6ADFFF] top usage: 5648 of 1048576

Thread 5: status = VgTs_Runnable (lwpid 2454)
==2441==    at 0x4040690: malloc (vg_replace_malloc.c:431)
==2441==    by 0x111DE538: pa_xmalloc (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x11A7D963: pa_memblock_new (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x111CE1F7: pa_stream_begin_write (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x4B6C975: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x111CDADF: ??? (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x11A86FC7: pa_pdispatch_run (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x111A8DDC: ??? (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x11A8A300: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x11A8DEB4: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x11A8E3B7: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x11A8EF52: ??? (in /usr/lib/pulseaudio/libpulsecommon-15.0.so)
==2441==    by 0x111C3AD5: pa_mainloop_dispatch (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x111C3EE1: pa_mainloop_iterate (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x111C3FB4: pa_mainloop_run (in /usr/lib/libpulse.so.0.24.0)
==2441==    by 0x4B6F33E: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)
==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)
==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)
client stack range: [0x1F014000 0x1F813FFF] client SP: 0x1F812E40
valgrind stack range: [0x7F502000 0x7F601FFF] top usage: 5648 of 1048576

Thread 6: status = VgTs_WaitSys syscall 422 (lwpid 2455)
==2441==    at 0x4E125DF: __futex_abstimed_wait_cancelable64 (in /lib/libpthread-2.33.so)
==2441==    by 0x4E0D566: do_futex_wait.constprop.0 (in /lib/libpthread-2.33.so)
==2441==    by 0x4E0D60E: __new_sem_wait_slow64.constprop.0 (in /lib/libpthread-2.33.so)
==2441==    by 0x4B866A1: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x4B01694: ??? (in /usr/lib/libopenal.so.1.21.1)
==2441==    by 0x4FBE4E5: ??? (in /usr/lib/libstdc++.so.6.0.29)
==2441==    by 0x4E03327: start_thread (in /lib/libpthread-2.33.so)
==2441==    by 0x5385F05: clone (in /lib/libc-2.33.so)
client stack range: [0x33863000 0x34062FFF] client SP: 0x34062060
valgrind stack range: [0x847CF000 0x848CEFFF] top usage: 3072 of 1048576

==2441== 
==2441==     Valgrind's memory management: out of memory: Invalid argument
==2441==        newSuperblock's request for 4194304 bytes failed.
==2441==        2,691,284,992 bytes have already been mmap-ed ANONYMOUS.
==2441==     Valgrind cannot continue.  Sorry.
==2441== 
==2441==     There are several possible reasons for this.
==2441==     - You have some kind of memory limit in place.  Look at the
==2441==       output of 'ulimit -a'.  Is there a limit on the size of
==2441==       virtual memory or address space?
==2441==     - You have run out of swap space.
==2441==     - You have some policy enabled that denies memory to be
==2441==       executable (for example selinux deny_execmem) that causes
==2441==       mmap to fail with Permission denied.
==2441==     - Valgrind has a bug.  If you think this is the case or you are
==2441==     not sure, please let us know and we'll try to fix it.
==2441==     Please note that programs can take substantially more memory than
==2441==     normal when running under Valgrind tools, eg. up to twice or
==2441==     more, depending on the tool.  On a 64-bit machine, Valgrind
==2441==     should be able to make use of up 32GB memory.  On a 32-bit
==2441==     machine, Valgrind should be able to use all the memory available
==2441==     to a single process, up to 4GB if that's how you have your
==2441==     kernel configured.  Most 32-bit Linux setups allow a maximum of
==2441==     3GB per process.
==2441== 
==2441==     Whatever the reason, Valgrind cannot continue.  Sorry.

selfprogrammed · 02-13-2024, 01:34 PM

I have been adding more canary guards and more vector checks to the destructor that segfaults most often.
As I instrument the object, its behavior changes.

I did actually have one of the canary guards trigger and it presented me with an object that looked like it had been overwritten with some table. Looked like it was an overlaid with an int array with many almost consecutive values.
It could of been a wild object ptr, or an overwrite from something else. I could not determine more.

It is getting more stable and runs longer. It will be difficult if the debugging code makes the bugs hide better.

pan64 · 02-14-2024, 01:01 AM

Quote:

Originally Posted by selfprogrammed

I have been adding more canary guards and more vector checks to the destructor that segfaults most often.
As I instrument the object, its behavior changes.

I did actually have one of the canary guards trigger and it presented me with an object that looked like it had been overwritten with some table. Looked like it was an overlaid with an int array with many almost consecutive values.
It could of been a wild object ptr, or an overwrite from something else. I could not determine more.

It is getting more stable and runs longer. It will be difficult if the debugging code makes the bugs hide better.

this is just shooting in the dark, you can only shift the problem, not solve by that. Is there any way to add more memory (or swap) to the system?
By the way, no enough ram probably enough to kill your process.

selfprogrammed · 02-15-2024, 10:42 AM

I have installed many slackbuilds.
During that, I have had multiple programs compiling on this system simultaneous, while running other editors and logins, and never had it even touch the swap.
No process other than valgrind has ever run out of memory.

I must suspect that valgrind did something unusual.
This voxelands program uses multiple tasks, and swaps a large mesh structure with another construct. I suspect that valgrind tried to track that. and could not cope with it.

Rather than trying to change my system, it would be more instructive if someone who considers their system to be superior for this, to test the program and tell if it produces the same behavior on their system. Perhaps, if their memory is large enough, they can even run valgrind on it.

The instrumented checks have tracked this down to a single vector that is repeatedly failing in its destructor. It is a std::vector, so it is difficult to get inside and identify why it is segfaulting during the destructor.

Note: I have added a clean to the destructor, to empty the vectors with explicit commands. so that does not occur during the default destructor execution.
It does not segfault during that. However, it will segfault later at the end of MeshData destructor, during a specific vector default destructor.
Note: That vector content is supplied by the IrrLicht library.
Note: that sometimes the vector has negative size, which means the internal pointers have been corrupted.
Note: there may be still more than one fault acting here

Possibilities:
1. corruption of the vector allocator data (something internal to the vector).
2. the pointer to the MeshData may have been corrupted.
3. maybe that MeshData was released long ago, and this is a stale pointer.
4. The compiler can still not be excluded. The symptoms change too much when I have only added some debugging code that should not have such an effect.
It runs longer between faults, sometimes for hours, and it now mostly segfaults during a specific vector destructor.
Those change would alter what the default constructors, default destructors, and optimizers have to cope with, but I have not altered what the code is doing with this MeshData.

BrunoLafleur · 02-15-2024, 11:04 AM

Quote:

Originally Posted by selfprogrammed

4. The compiler can still not be excluded. The symptoms change too radically when I have only added some debugging code that should not have such an effect.

It is usually a symptom of an earlier write outside array boundaries or write in an already freed memory. It is normal it changes the result a lot if you add new code in area you see the problem because it changes the memory maps. It will corrupt data in other places.

If valgrind can be set to see something it will find the problem source probably at the same place even in your instrumented code because it will be the root cause.

I didn't have time to test for now. I don't know if my computers have enough memory but I will see.

selfprogrammed · 03-04-2024, 05:34 PM

I was getting some other code done and committed, and could not test this for a while.

I have a derived vector class that I have instrumented with canary, a data copy.
Due to voxelands using vector swap operation, I also had to implement a swap_debug function to swap the debugging information.
I reconfigured the canary and added a tracker, on each of the 3 classes that figure in this.

I have seen two faults with the new instrumentation. That is within about 2 hours play time.

It will not fault where I can see why. It faults in the default destructors at the end of the explicit destructor. I have added code to empty the lists before that, so the only thing left there is releasing the vector memory allocation.
It does not like something with the allocator, such as in one case it complained that the size had changed.
Note that the stl vector swap, exchanges the allocated vector data with another vector. It is supposed to also swap the allocator information if needed.
Of course the swapped vector data are not likely to be the same size.

The tracker records the last 16 operations. I did confirm that the last operation the tracker recorded was the vector swap.
I had a little trouble with that as the function that prints out the tracker info will hang if invoked from the debugger at that point. It seems that the data structure is locked, and it will hang in a spin lock. Impossible to recover from.

This would be the fault of the compiler stl implementation not handling the vector swap operator correctly. That is my best guess right now.
I doubt many programs ever use the stl vector swap operation.

Need to test this on something different like clang. My last attempt to compile voxelands using clang generated an unusable package, and I do not know what went wrong.

Addendum:
I am familiar with wild writes. That is why the instrumentation includes canary values and a copy of the first vector element, which is repeatedly checked against the actual vector.
The canary is marked with a unique value during deallocation, so that the canary checks will detect double deallocation, or writing to a deallocated entity.
They have found nothing, other than that swap operation violates their assumptions, thus the swap_debug operation to also swap the debug information.
The faults I get ware usually segfault. I think the instrumentation somehow has stopped those. Now I get double deallocation suspected, and at least one wrong deallocation size.
I do not think that Valgrind could instrument this any more completely.

selfprogrammed · 03-04-2024, 06:03 PM

This is a voxelands (1709) diff, of the latest patches and instrumentation that I have applied to voxelands. Some of the patches do make segfaults go away.

https://filebin.net/ezdpdagwu81vbinc

Filebin

The bin ezdpdagwu81vbinc was created now and it expires 6 days from now. It contains 1 uploaded file at 101 kB.

Filename Content type Size Uploaded
voxelands-v1709.00_debug_02.diff text/plain; charset=utf-8 101 kB now
More