LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   NVIDIA driver working on CentOS 7.2, but not for all users (https://www.linuxquestions.org/questions/linux-server-73/nvidia-driver-working-on-centos-7-2-but-not-for-all-users-4175657928/)

ehereth 07-23-2019 04:28 PM

NVIDIA driver working on CentOS 7.2, but not for all users
 
Good day LQ!!

I have a confusing issue that I've wasted most of a day trying to debug.

I have a server (really, a cluster of severs, but I do not think that is relevant to the question) with a NVIDIA P100 installed in it. We have groups of researchers who run GPU enabled codes on these servers. There's one group running NAMD on these successfully, however, recently, they've added a few users and some of them seem to be unable to successfully run the code with errors like:

... CUDA driver version is insufficient for CUDA runtime version

However, other users are able to run the code (using the GPU) without any trouble.

Now, I've looked at length at pretty much anything that I can think of that might be different about these users:
  1. The users are using the exact same executable, options, and inputs
  2. Their environments are functionally identical ($SHELL, $PATH, $LD_LIBRARY_PATH, etc. the only things that are different are user specific stuff like $HOME etc.)
  3. Their permissions/groups are correct
  4. Running modinfo nvidia results in the exact same output for each user (the version of the driver is 361.93.03)
  5. The permissions of /dev/nvidia* are such that all users can see/use them

I've simplified about everything I can think about their use case; they normally try to access these servers using a job scheduler (sge, which can be complicated and confusing), but I've logged into one of the target servers as several of the users and can verify that some of them can run the code directly on the server and others cannot.

I'm at a loss and have run out of ideas; I would very much appreciate any help you may be able to give me the might point me to the reason why certain users cannot use the GPUs while others can. Please give me ideas!

Thank you all very much for your support and time!

ehereth 07-24-2019 05:45 PM

ping
 
All, I'm not trying to be a pain; but I really need help with this. Does anybody have any ideas or perhaps a recommendation about where else I might post this question?

scasey 07-24-2019 06:32 PM

Compare the .bashrc / ,profile / .bash_profile, etc. files?

ehereth 07-25-2019 08:37 AM

scasey, thank you for your reply. While your suggestion didn't directly fix my problem, it did help me find that this particular application that we're trying to run has a secret hidden/dot file that it loads if it exists that I'd completely forgotten about. Once all users have this file, they can run the application.

The application did nothing at all to hint that this was the problem and the errors didn't indicate anything helpful either. Very frustrating and a crappy way to waste time!

Thanks again for helping me find this solution!

Cheers!


All times are GMT -5. The time now is 01:11 PM.