NVIDIA GPU falls off the bus on Razer Blade 15 2022

Forum rules
Before you post read how to get help. Topics in this forum are automatically closed 6 months after creation.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

hi all,

I recently installed Linux Mint 21 MATE on a new Razer Blade 15 (2022).
I installed the package `linux-oem-22.04` in order to easily upgrade from kernel 5.15 to 5.17 and solve my wireless issue.
I then installed the latest proprietary NVIDIA driver on it, `nvidia-driver-525`, precisely version 526.60.11.
Right after installing the NVIDIA driver, I had to wrestle with this nasty "out of memory" error at boot time:
https://bugs.launchpad.net/ubuntu/+sour ... ug/1970402
and was able to solve it by following comment 25.
I was then able to boot successfully a fresh 5.17 kernel with NVIDIA driver.

The problem is that a few minutes after booting up the laptop, the GPU dies.
This happens both when the PRIME profile is `on-demand`, the default, which thankfully means that I can still use the desktop although any `nvidia`-related command errors out, but also when the PRIME profile is set to `nvidia`, which means that the system freezes completely and needs a hard reboot.
I have not yet tried connecting an external monitor, but that will be my normal setup once I fix these errors.

The system logs report the nasty "GPU has fallen off the bus" error, which is often described to be related to power supply issues or thermals.
Power supply should not be the problem since this is an embedded laptop from a reputable brand, not a self-assembled hack job of a desktop with a poor PSU.
Thermals are not to blame either, as this consistently happens a few minutes (say, five) after booting, without any usage whatsoever, definitely not after a heavy computational or gaming session.

I read that one could try and set the persistence mode on the GPU to avoid an automatic switch-off by typing:

Code: Select all

sudo nvidia-smi -pm 1
and that such command is deprecated and that one should instead enable the systemctl service named `nvidia-persistenced`.
In my case, the service was already enabled and running even as I was having these issues.
I noticed that the service itself was running with parameter

Code: Select all

--no-persistence-mode
, so I figured that might be the problem and modified the service file to run with

Code: Select all

--persistence-mode
, instead.
That had no effect on the error, and the GPU still "falls off the bus" after a few minutes.
Finally, since I am running with PRIME profile `on-demand`, I can see that X is successfully loaded on the GPU by running `nvidia-smi` right as I get the desktop, but before the GPU dies out.
In other words, it's not like the GPU gets switched off because nothing is using it, say, after having completed some CUDA computations -- X is using it!

Any help is appreciated, and I am happy to share any logs to you knowledgeable gurus. Cheers!

---

nvidia logs at boot, before crash:

Code: Select all

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU (UUID: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7)

$ nvidia-smi 
Thu Dec  8 10:13:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8    10W /  N/A |      5MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1839      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
after crash:

Code: Select all

$ nvidia-smi 
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
further info, after crash:

Code: Select all

System:
  Kernel: 5.17.0-1021-oem x86_64 bits: 64 compiler: gcc v: 11.3.0
    Desktop: MATE 1.26.0 info: mate-panel wm: marco 1.26.0 vt: 7
    dm: LightDM 1.30.0 Distro: Linux Mint 21 Vanessa base: Ubuntu 22.04 jammy
Machine:
  Type: Laptop System: Razer product: Blade 15 (2022) - RZ09-0421 v: 8.04
    serial: <superuser required> Chassis: type: 10 serial: <superuser required>
  Mobo: Razer model: CH580 v: 4 serial: <superuser required> UEFI: Razer
    v: 1.08 date: 02/16/2022
CPU:
  Info: 14-core (6-mt/8-st) model: 12th Gen Intel Core i7-12800H bits: 64
    type: MST AMCP smt: enabled arch: Alder Lake rev: 3 cache: L1: 1.2 MiB
    L2: 11.5 MiB L3: 24 MiB
  Speed (MHz): avg: 534 high: 699 min/max: 400/4800:3700 cores: 1: 510
    2: 441 3: 499 4: 548 5: 552 6: 681 7: 490 8: 467 9: 469 10: 447 11: 435
    12: 445 13: 615 14: 633 15: 608 16: 530 17: 552 18: 496 19: 580 20: 699
    bogomips: 112127
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel Alder Lake-P Integrated Graphics vendor: Razer USA
    driver: i915 v: kernel ports: active: eDP-1 empty: none bus-ID: 00:02.0
    chip-ID: 8086:46a6 class-ID: 0300
  Device-2: NVIDIA GA104 [Geforce RTX 3070 Ti Laptop GPU] driver: nvidia
    v: 525.60.11 pcie: speed: Unknown lanes: 63 ports: active: none
    empty: DP-1, DP-2, DP-3, HDMI-A-1 bus-ID: 01:00.0 chip-ID: 10de:24a0
    class-ID: 0300
  Device-3: IMC Networks Integrated RGB Camera type: USB driver: uvcvideo
    bus-ID: 1-2:2 chip-ID: 13d3:5279 class-ID: 0e02 serial: <filter>
  Display: x11 server: X.Org v: 1.21.1.3 compositor: marco v: 1.26.0
    driver: X: loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa
    gpu: i915 display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 1920x1080 s-dpi: 98 s-size: 499x280mm (19.6x11.0")
    s-diag: 572mm (22.5")
  Monitor-1: eDP-1 model: TL156VDXP02-0 res: 1920x1080 hz: 60 dpi: 142
    size: 344x194mm (13.5x7.6") diag: 395mm (15.5") modes: 1920x1080
  OpenGL: renderer: Mesa Intel Graphics (ADL GT2) v: 4.6 Mesa 22.0.5
    direct render: Yes
---

EDIT:
The NVIDIA GPU remains "on the bus" if the NVIDIA Settings PowerMizer mode is set to "Maximum Performance".
Last edited by LockBot on Sat Jun 10, 2023 10:00 pm, edited 3 times in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Thu Dec 08, 2022 11:17 amI recently installed Linux Mint 21 MATE on a new Razer Blade 15 (2022).
Do you have the most recent BIOS/UEFI installed?
piramiday wrote: Thu Dec 08, 2022 11:17 amThe system logs report the nasty "GPU has fallen off the bus" error, which is often described to be related to power supply issues or thermals.
Actually, I've helped others who had that error because of a Nvidia driver-related issue.

You have already made so many changes it might be hard to tell what might be happening because of the changes you made versus what has happened because of the original issue.

Did you install CUDA or CUDA-tools?

What is the output of

Code: Select all

journalctl -b | grep -i "drm\|nvidia\|01:00.0"
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Fri Dec 09, 2022 9:34 pm Do you have the most recent BIOS/UEFI installed?
Yes, AFAIK. The laptop is brand new and the original Razer product page does not mention any firmware or BIOS upgrade yet, apart from a firmware for OLED panels which is not my case.
SMG wrote: Fri Dec 09, 2022 9:34 pm Actually, I've helped others who had that error because of a Nvidia driver-related issue.
Do you think that trying with an older nvidia driver might make sense? say, 520 as opposed to 525?
SMG wrote: Fri Dec 09, 2022 9:34 pm You have already made so many changes it might be hard to tell what might be happening because of the changes you made versus what has happened because of the original issue.
No, not really.
Apart from needing kernel 5.17+ to fix my wireless issues, I have not made any change.
I reverted the persistence mode on/off in the system service after I noticed it had no effect, and I tried the various boot options on-the-fly so my grub config is clean.
The only lingering change is the initrd config needed to get rid of the pesky "out of memory" error, without which I would not even boot the machine.
SMG wrote: Fri Dec 09, 2022 9:34 pm Did you install CUDA or CUDA-tools?
no, not yet, did not install anything apart from normal stuff, like web browsers and email clients.
SMG wrote: Fri Dec 09, 2022 9:34 pm What is the output of
before the crash:

Code: Select all

Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: [10de:24a0] type 00 class 0x030000
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: reg 0x10: [mem 0x83000000-0x83ffffff]
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: reg 0x14: [mem 0x6000000000-0x63ffffffff 64bit pref]
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: reg 0x1c: [mem 0x6400000000-0x6401ffffff 64bit pref]
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: reg 0x24: [io  0x3000-0x307f]
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: reg 0x30: [mem 0x84000000-0x8407ffff pref]
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: PME# supported from D0 D3hot
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:00:01.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: vgaarb: bridge control possible
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Dec 09 20:54:26 blade kernel: pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
Dec 09 20:54:26 blade kernel: pci 0000:01:00.0: Adding to iommu group 15
Dec 09 20:54:26 blade kernel: ACPI: bus type drm_connector registered
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adlp_dmc_ver2_16.bin (v2.16)
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] HuC authenticated
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] GuC RC: enabled
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 09 20:54:26 blade kernel: nvidia: loading out-of-tree module taints kernel.
Dec 09 20:54:26 blade kernel: nvidia: module license 'NVIDIA' taints kernel.
Dec 09 20:54:26 blade kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 09 20:54:26 blade kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Dec 09 20:54:26 blade kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
Dec 09 20:54:26 blade kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Dec 09 20:54:26 blade kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.60.11  Wed Nov 23 23:04:03 UTC 2022
Dec 09 20:54:26 blade kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.60.11  Wed Nov 23 22:49:17 UTC 2022
Dec 09 20:54:26 blade kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Dec 09 20:54:26 blade kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Dec 09 20:54:26 blade kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: fbcon: i915drmfb (fb0) is primary device
Dec 09 20:54:26 blade kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade systemd[1]: Starting Load Kernel Module drm...
Dec 09 20:54:26 blade systemd[1]: modprobe@drm.service: Deactivated successfully.
Dec 09 20:54:26 blade systemd[1]: Finished Load Kernel Module drm.
Dec 09 20:54:26 blade kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Dec 09 20:54:26 blade kernel: nvidia-uvm: Loaded the UVM driver, major device number 507.
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input14
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input15
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input16
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input17
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input18
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input19
Dec 09 20:54:26 blade kernel: input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input20
Dec 09 20:54:27 blade audit[1282]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1282 comm="apparmor_parser"
Dec 09 20:54:27 blade audit[1282]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1282 comm="apparmor_parser"
Dec 09 20:54:27 blade kernel: audit: type=1400 audit(1670637267.944:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1282 comm="apparmor_parser"
Dec 09 20:54:27 blade kernel: audit: type=1400 audit(1670637267.944:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1282 comm="apparmor_parser"
Dec 09 20:54:28 blade systemd[1]: Starting NVIDIA Persistence Daemon...
Dec 09 20:54:28 blade nvidia-persistenced[1386]: Verbose syslog connection opened
Dec 09 20:54:28 blade nvidia-persistenced[1386]: Now running with user ID 128 and group ID 138
Dec 09 20:54:28 blade nvidia-persistenced[1386]: Started (1386)
Dec 09 20:54:28 blade nvidia-persistenced[1386]: device 0000:01:00.0 - registered
Dec 09 20:54:28 blade nvidia-persistenced[1386]: Local RPC services initialized
Dec 09 20:54:28 blade systemd[1]: Started NVIDIA Persistence Daemon.
Dec 09 20:54:28 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
[ previous line repeated many many times ]
and after the crash just the error:

Code: Select all

Dec 09 20:57:41 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
thanks, SMG, for taking the time to troubleshoot with me! cheers.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Fri Dec 09, 2022 10:06 pmDo you think that trying with an older nvidia driver might make sense? say, 520 as opposed to 525?
You could try it and see. There is a bug in earlier versions that can affect 3000 series GPUs, so you likely would not want to try anything older than the Nvidia-520. (Make sure you do not use the driver with "open" in its name.)
piramiday wrote: Fri Dec 09, 2022 10:06 pmI reverted the persistence mode on/off in the system service after I noticed it had no effect, and I tried the various boot options on-the-fly so my grub config is clean.
I did not realize you had undone those. Sorry if I misread what you wrote.

Unfortunately, the fix for the out-of-memory is necessary. I do not know if there are plans to fix that bug, so consider it permanent for the present time.

This does not look good. What info I could find seems to indicate it might be a problem with the EDID? Considering you only have the laptop screen, that does not sound good. Do you know if Windows works okay on this hardware?

Code: Select all

Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:26 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 09 20:54:28 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
[ previous line repeated many many times ]
piramiday wrote: Fri Dec 09, 2022 10:06 pm and after the crash just the error:

Code: Select all

Dec 09 20:57:41 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
Somewhere in journalctl are some lines which are part of these lines and give "the rest of the information".

Let's see if we can get the rest with just this:

Code: Select all

journalctl -b | grep -i "NVRM"
rather than having to search many sequential lines of dmesg output.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Fri Dec 09, 2022 10:28 pm You could try it and see. There is a bug in earlier versions that can affect 3000 series GPUs, so you likely would not want to try anything older than the Nvidia-520. (Make sure you do not use the driver with "open" in its name.)
okay, I will try and see what happens.
EDIT: nope, there is no nvidia 520 choice in my driver manager, only 515 and older. too bad.
I did read enough about the `open` driver that my excitement for its novelty disappeared right away, I will stay clear of it.
SMG wrote: Fri Dec 09, 2022 10:28 pm I did not realize you had undone those. Sorry if I misread what you wrote.
no problem, thanks for helping out!
my stream-of-consciousness troubleshooting steps possibly were not clear as they could have been. :D
SMG wrote: Fri Dec 09, 2022 10:28 pm This does not look good. What info I could find seems to indicate it might be a problem with the EDID? Considering you only have the laptop screen, that does not sound good.
As I briefly mentioned, I tried with an external monitor directly to the HDMI and it failed instantly.
In the next couple of days I will try with an external monitor connected through a hub to the USB-C port, since I read that that might be able to offload on the Intel GPU.
If I don't get a working external monitor connection, I am in deep trouble.
SMG wrote: Fri Dec 09, 2022 10:28 pm Do you know if Windows works okay on this hardware?
Windows works fine, but I do not use it (duh).
I can try and leave the laptop running W11 later to verify that the NVIDIA card will keep running even after the 5ish minutes that it takes for the problem to manifest itself with Linux.
SMG wrote: Fri Dec 09, 2022 10:28 pm Let's see if we can get the rest with just this:
no further information, there.
the first line comes from the successful boot, and the rest are the same error:

Code: Select all

$ journalctl -b | grep -i nvrm
Dec 10 13:30:02 blade kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.60.11  Wed Nov 23 23:04:03 UTC 2022
Dec 10 13:34:08 blade kernel: NVRM: GPU at PCI:0000:01:00: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7
Dec 10 13:34:08 blade kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 10 13:34:08 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Sat Dec 10, 2022 2:46 pm
SMG wrote: Fri Dec 09, 2022 10:28 pm This does not look good. What info I could find seems to indicate it might be a problem with the EDID? Considering you only have the laptop screen, that does not sound good.
As I briefly mentioned, I tried with an external monitor directly to the HDMI and it failed instantly.
Was that output from when you had the external monitor attached? Maybe those lines were from it.
piramiday wrote: Sat Dec 10, 2022 2:46 pmIn the next couple of days I will try with an external monitor connected through a hub to the USB-C port, since I read that that might be able to offload on the Intel GPU.
The inxi output seems to indicate Intel only runs the laptop screen.

Device-1: Intel Alder Lake-P Integrated Graphics vendor: Razer USA
driver: i915 v: kernel ports: active: eDP-1 empty: none bus-ID: 00:02.0
chip-ID: 8086:46a6 class-ID: 0300
Device-2: NVIDIA GA104 [Geforce RTX 3070 Ti Laptop GPU] driver: nvidia
v: 525.60.11 pcie: speed: Unknown lanes: 63 ports: active: none
empty: DP-1, DP-2, DP-3, HDMI-A-1
bus-ID: 01:00.0 chip-ID: 10de:24a0
class-ID: 0300
piramiday wrote: Sat Dec 10, 2022 2:46 pm Windows works fine, but I do not use it (duh).
I can try and leave the laptop running W11 later to verify that the NVIDIA card will keep running even after the 5ish minutes that it takes for the problem to manifest itself with Linux.
I would recommend doing that considering the computer is new. Especially if it was sold to you with Windows on it, they are going to take any complaints more seriously if they happen on Windows.
piramiday wrote: Sat Dec 10, 2022 2:46 pm the first line comes from the successful boot, and the rest are the same error:

Code: Select all

$ journalctl -b | grep -i nvrm
Dec 10 13:30:02 blade kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.60.11  Wed Nov 23 23:04:03 UTC 2022
Dec 10 13:34:08 blade kernel: NVRM: GPU at PCI:0000:01:00: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7
Dec 10 13:34:08 blade kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 10 13:34:08 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Except it is missing the other lines which were in the prior output.

Code: Select all

                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded
I was wondering what the line before the two I listed indicates.

journalclt can be queried by time. Maybe this might get the information?

Code: Select all

journalctl --since "2022-12-10 13:34:00" --until "2022-12-10 13:35:00"
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Sat Dec 10, 2022 3:39 pm I would recommend doing that considering the computer is new. Especially if it was sold to you with Windows on it, they are going to take any complaints more seriously if they happen on Windows.
from a first test, leaving it idle on the W11 desktop for half an hour, it works fine.
if it did not, I would find many people complaining about such an issue, I guess.
my gut tells me it is related to Linux and, obviously, NVIDIA drivers.
SMG wrote: Sat Dec 10, 2022 3:39 pm Except it is missing the other lines which were in the prior output.
I'm not sure I'm following, because we have been grepping for different things.
Here is the result for either `drm`, or `nvrm`, or `nvidia`:

Code: Select all

journalctl -b | grep -i -e drm -e nvrm -e nvidia
Dec 10 14:54:58 blade kernel: ACPI: bus type drm_connector registered
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adlp_dmc_ver2_16.bin (v2.16)
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
   [ line repeated many times! ]
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] HuC authenticated
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] GuC RC: enabled
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] Protected Xe Path (PXP) protected content support initialized
Dec 10 14:54:58 blade kernel: nvidia: loading out-of-tree module taints kernel.
Dec 10 14:54:58 blade kernel: nvidia: module license 'NVIDIA' taints kernel.
Dec 10 14:54:58 blade kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Dec 10 14:54:58 blade kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Dec 10 14:54:58 blade kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
Dec 10 14:54:58 blade kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Dec 10 14:54:58 blade kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.60.11  Wed Nov 23 23:04:03 UTC 2022
Dec 10 14:54:58 blade kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.60.11  Wed Nov 23 22:49:17 UTC 2022
Dec 10 14:54:58 blade kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Dec 10 14:54:58 blade kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Dec 10 14:54:58 blade kernel: [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 0
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
   [ line repeated many times! ]
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 10 14:54:58 blade kernel: fbcon: i915drmfb (fb0) is primary device
Dec 10 14:54:58 blade kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
   [ line repeated many times! ]
Dec 10 14:54:58 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 10 14:54:58 blade systemd[1]: Starting Load Kernel Module drm...
Dec 10 14:54:58 blade systemd[1]: modprobe@drm.service: Deactivated successfully.
Dec 10 14:54:58 blade systemd[1]: Finished Load Kernel Module drm.
Dec 10 14:54:58 blade kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Dec 10 14:54:58 blade kernel: nvidia-uvm: Loaded the UVM driver, major device number 508.
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input17
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input18
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input19
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input20
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input21
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input22
Dec 10 14:54:58 blade kernel: input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input23
Dec 10 14:55:00 blade audit[1288]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1288 comm="apparmor_parser"
Dec 10 14:55:00 blade audit[1288]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1288 comm="apparmor_parser"
Dec 10 14:55:00 blade kernel: audit: type=1400 audit(1670702100.056:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1288 comm="apparmor_parser"
Dec 10 14:55:00 blade kernel: audit: type=1400 audit(1670702100.056:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1288 comm="apparmor_parser"
Dec 10 14:55:00 blade systemd[1]: Starting NVIDIA Persistence Daemon...
Dec 10 14:55:00 blade nvidia-persistenced[1410]: Verbose syslog connection opened
Dec 10 14:55:00 blade nvidia-persistenced[1410]: Now running with user ID 128 and group ID 138
Dec 10 14:55:00 blade nvidia-persistenced[1410]: Started (1410)
Dec 10 14:55:00 blade nvidia-persistenced[1410]: device 0000:01:00.0 - registered
Dec 10 14:55:00 blade nvidia-persistenced[1410]: Local RPC services initialized
Dec 10 14:55:00 blade systemd[1]: Started NVIDIA Persistence Daemon.
Dec 10 14:55:00 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
   [ line repeated many many MANY times! ]
Dec 10 14:55:03 blade kernel: [drm] DisplayID checksum invalid, remainder is 20
Dec 10 14:58:54 blade kernel: NVRM: GPU at PCI:0000:01:00: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7
Dec 10 14:58:54 blade kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 10 14:58:54 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Dec 10 14:58:54 blade kernel: NVRM: A GPU crash dump has been created. If possible, please run
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
EDIT: another boot, this time with a couple minutes of working external monitor via HDMI (!) produced exactly the same log as above.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Sat Dec 10, 2022 4:04 pmI'm not sure I'm following, because we have been grepping for different things.
Here is the result for either `drm`, or `nvrm`, or `nvidia`:
This found the missing line (which has nvrm in it so I don't know why it didn't show up in the prior grep :? ). All three of these are one sentence. The prior output only had lines 2 and 3.

Code: Select all

Dec 10 14:58:54 blade kernel: NVRM: A GPU crash dump has been created. If possible, please run
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
I have no idea if that file might be helpful, but maybe it will.

I see the DisplayID checksum invalid error is still there. :(

I know there was a bug in the Nvidia-515 driver for some 3000 series GPU connections which was fixed in the Nvidia-520, so if you want to try different drivers I would suggest either the Nvidia-520 (experimental branch) or the Nvidia-510 (stable branch). I would think both of those would show as options in Driver Manager.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Sat Dec 10, 2022 4:30 pm This found the missing line (which has nvrm in it so I don't know why it didn't show up in the prior grep :? ). All three of these are one sentence. The prior output only had lines 2 and 3.
oh, I see, maybe it was just a poor copy/paste into the forum textbox on my part.
SMG wrote: Sat Dec 10, 2022 4:30 pm I have no idea if that file might be helpful, but maybe it will.
I also posted to the NVIDIA developer forum including that file, but I have not seen any reply just yet.
the gzip file is loaded here: https://transfer.sh/JphnSL/nvidia-bug-report.log.gz
SMG wrote: Sat Dec 10, 2022 4:30 pm I see the DisplayID checksum invalid error is still there. :(
yes, unfortunately it is. the only notable thing about the embedded display is that it's a 360 Hz FHD display which of course in Linux only runs at 60 Hz.
SMG wrote: Sat Dec 10, 2022 4:30 pm I know there was a bug in the Nvidia-515 driver for some 3000 series GPU connections which was fixed in the Nvidia-520, so if you want to try different drivers I would suggest either the Nvidia-520 (experimental branch) or the Nvidia-510 (stable branch). I would think both of those would show as options in Driver Manager.
I tried sudo-apt installing the 520 driver but it installed just that package, without installing anything else and without removing the 525, so clearly not the way to go.
Any suggestions there?

I also just tried having `nvidia-xconfig` create a xorg.conf file, but after the reboot I did not even get the desktop.
So I deleted it and went back not not having any xorg.conf file under /etc/X11.

I am also trying to find the sweet spot of external monitor working to see whether I can disable the internal display entirely but, again, the GPU crashed as soon as I popped in the HDMI, in a matter of 5 seconds.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Sat Dec 10, 2022 4:38 pmI also posted to the NVIDIA developer forum including that file, but I have not seen any reply just yet.
the gzip file is loaded here: https://transfer.sh/JphnSL/nvidia-bug-report.log.gz
I will check this a little later.
piramiday wrote: Sat Dec 10, 2022 4:38 pmyes, unfortunately it is. the only notable thing about the embedded display is that it's a 360 Hz FHD display which of course in Linux only runs at 60 Hz.
What is the output of xrandr?
piramiday wrote: Sat Dec 10, 2022 4:38 pm I tried sudo-apt installing the 520 driver but it installed just that package, without installing anything else and without removing the 525, so clearly not the way to go.
Any suggestions there?
Please just use Driver Manager. I have no idea what you may have done by installing one driver on top of the other. (And sudo is not needed to install drivers. :| ) If you use Driver Manager to switch drivers then the uninstalling is taken care of for you.

How did you originally install the Nvidia-525 driver? Maybe that is related to your issue.

Please provide the output of

Code: Select all

dpkg -l | grep -i nvidia
piramiday wrote: Sat Dec 10, 2022 4:38 pmI also just tried having `nvidia-xconfig` create a xorg.conf file, but after the reboot I did not even get the desktop.
So I deleted it and went back not not having any xorg.conf file under /etc/X11.
When you run nvidia-xconfig, it assume you only have an Nvidia GPU installed. That is why you could not get to the desktop. You completely shut out Intel and Intel runs your laptop monitor.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Sat Dec 10, 2022 4:57 pm What is the output of xrandr?
after the nvidia gpu has crashed:

Code: Select all

Screen 0: minimum 320 x 200, current 1920 x 1080, maximum 16384 x 16384
eDP-1 connected primary 1920x1080+0+0 (normal left inverted right x axis y axis) 344mm x 194mm
   1920x1080     60.05*+
DP-1-0 disconnected (normal left inverted right x axis y axis)
DP-1-1 disconnected (normal left inverted right x axis y axis)
DP-1-2 disconnected (normal left inverted right x axis y axis)
DP-1-3 disconnected (normal left inverted right x axis y axis)
HDMI-1-0 disconnected (normal left inverted right x axis y axis)
DP-1-4 disconnected (normal left inverted right x axis y axis)
DP-1-5 disconnected (normal left inverted right x axis y axis)
SMG wrote: Sat Dec 10, 2022 4:57 pm Please just use Driver Manager. I have no idea what you may have done by installing one driver on top of the other. (And sudo is not needed to install drivers. :| ) If you use Driver Manager to switch drivers then the uninstalling is taken care of for you.
of course, this is the first time that it seems it installed any package without first uninstalling everything else.
anyway, I reverted it back right away.
SMG wrote: Sat Dec 10, 2022 4:57 pm How did you originally install the Nvidia-525 driver? Maybe that is related to your issue.
that was done through the driver manager, no funny stuff there.
SMG wrote: Sat Dec 10, 2022 4:57 pm Please provide the output of

Code: Select all

$ dpkg -l | grep -i nvidia
ii  libnvidia-cfg1-525:amd64                   525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-525                       525.60.11-0ubuntu0.22.04.1                  all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-525:amd64                525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA libcompute package
ii  libnvidia-compute-525:i386                 525.60.11-0ubuntu0.22.04.1                  i386         NVIDIA libcompute package
ii  libnvidia-decode-525:amd64                 525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-525:i386                  525.60.11-0ubuntu0.22.04.1                  i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64               1:1.1.9-1.1                                 amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-525:amd64                 525.60.11-0ubuntu0.22.04.1                  amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-525:i386                  525.60.11-0ubuntu0.22.04.1                  i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-525:amd64                  525.60.11-0ubuntu0.22.04.1                  amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-525:amd64                   525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-525:i386                    525.60.11-0ubuntu0.22.04.1                  i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-525:amd64                     525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-525:i386                      525.60.11-0ubuntu0.22.04.1                  i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  nvidia-compute-utils-525                   525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA compute utilities
ii  nvidia-dkms-525                            525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA DKMS package
ii  nvidia-driver-525                          525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-525                   525.60.11-0ubuntu0.22.04.1                  amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-525                   525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.17.1                                    all          Tools to enable NVIDIA's Prime
ii  nvidia-prime-applet                        1.3.4                                       all          An applet for NVIDIA Prime
ii  nvidia-settings                            510.47.03-0ubuntu1                          amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-525                           525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                    0.18.2                                      all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-525              525.60.11-0ubuntu0.22.04.1                  amd64        NVIDIA binary Xorg driver
SMG wrote: Sat Dec 10, 2022 4:57 pm When you run nvidia-xconfig, it assume you only have an Nvidia GPU installed. That is why you could not get to the desktop. You completely shut out Intel and Intel runs your laptop monitor.
well, kind of, in that my nvidia gpu is capable of running both the internal display and the external monitor, although for just a few minutes.
I would have expected the same behavior as when I run with `prime-select nvidia`, that is, working for a few minutes and then display freezing, rather than `prime-select on-demand` which is what I am currently working on.
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Sat Dec 10, 2022 5:06 pm
SMG wrote: Sat Dec 10, 2022 4:57 pm When you run nvidia-xconfig, it assume you only have an Nvidia GPU installed. That is why you could not get to the desktop. You completely shut out Intel and Intel runs your laptop monitor.
well, kind of, in that my nvidia gpu is capable of running both the internal display and the external monitor, although for just a few minutes.
I do not know what you are using for this criteria, but the xrandr --verbose information in the Nvidia Bug Report Log (NBRL) appears to indicate the laptop screen is only controlled by Intel. There are no line items for

Code: Select all

	PRIME Synchronization: 1 
		supported: 0, 1
for eDP-1, but there are for all the other ports.

There is only the one modeline both in xrandr and the Xorg log. (I do not recall seeing something like that in the past.) Here it is from Xorg log.

Code: Select all

[    22.937] (II) modeset(0): Printing probed modes for output eDP-1
[    22.937] (II) modeset(0): Modeline "1920x1080"x60.1  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
Considering you say the specs for your laptop do not match this, and the fact Linux-based distros can do more than 60Hz displays, this display seems to be key to the problem.

And I think the drm Display ID errors apply to both Intel and Nvidia based on where they were in the log and the multitude of them (but that is a guess on my part).

Within the NBRL are the following lines which may be when you had a monitor attached? I'm not completely sure because there were several boot logs attached and I don't know if the conditions were the same for all of them. The last lines only showed in one of the boot logs.

Code: Select all

Dec 09 09:51:09 blade nvidia-persistenced[1406]: Local RPC services initialized
Dec 09 09:52:16 blade kernel: NVRM: GPU at PCI:0000:01:00: GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7
Dec 09 09:52:16 blade kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Dec 09 09:52:16 blade kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Dec 09 09:52:16 blade kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
Dec 09 09:52:16 blade kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
Dec 09 09:52:16 blade kernel: nvidia-modeset: ERROR: GPU:0: Failed detecting connected display devices
There was also this information.

Code: Select all

/usr/bin/lspci -d "10de:*" -v -xxx

01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [Geforce RTX 3070 Ti Laptop GPU] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
This topic on the Nvidia developer's forum unknown header type 7f!! indicates:
This ‘​unknown header type 7f!!’ happens when PCIe device falls off the bus.
If something is happening to the PCIe link after some time? like pex_rst toggle or cutting refclk etc…because of which link is getting disconnected.
We already know it is falling off the bus.

This line in one of the logs surprised me.

Code: Select all

[    0.916858] Low-power S0 idle used by default for system suspend
It would seem to indicate there will be essentially no power savings to suspending your computer. S0 state is usually the normal wake state. I would guess S0 idle is just a notch below full power.

Code: Select all

[    8.213220] ACPI: video: [Firmware Bug]: ACPI(PEGP) defines _DOD but not _DOS
Is an ACPI GPE storm normal in an Ubuntu session? #4 of answer 1 mentions several possible kernel parameters to possibly address the above. However, given the errors about the display IDs, the newness of your hardware, and the age of that link, I'm not sure those are appropriate settings for your situation. I'm not sure they would help or if they might cause worse problems.

I am not used to reading NBRLs, but this part with question marks for Video BIOS seems odd.

Code: Select all

*** /proc/driver/nvidia/./gpus/0000:01:00.0/information
*** ls: -r--r--r-- 1 root root 0 2022-12-09 09:51:09.100000440 -0500 /proc/driver/nvidia/./gpus/0000:01:00.0/information
Model: 		 NVIDIA GeForce RTX 3070 Ti Laptop GPU
IRQ:   		 172
GPU UUID: 	 GPU-d7e3314f-0671-9225-6b48-39bfc97fc3c7
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCI
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:01:00.0
Device Minor: 	 0
GPU Excluded:	 No
This:

Code: Select all

nvidia-settings -q all:
ERROR: An internal driver error occurred
and this:

Code: Select all

/usr/bin/nvidia-smi --query

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
are not really that helpful, but they don't give me hope that an older driver might help. However, you are more than welcome to prove my suspicions incorrect and try the Nvidia-510 or 520.

The only other suggestion I have at this point is to wait and see if you get a response on your Nvidia developer forum topic.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

Thank you, SMG, I will take some time to read through your post and reply as soon as I can.

I got the suggestion from the NVIDIA forums to try and set the PowerMizer setting to Maximum Performance, and indeed that seems to keep the GPU "on the bus".
The solution is wonky and I still have trouble understanding it, but I will take another look at the logs and report back here.
Also, I now have problems with suspend, etc, that I did not see when my GPU was falling off the bus...
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Mon Dec 12, 2022 12:24 pm The solution is wonky and I still have trouble understanding it
If you provide the link to the topic, perhaps some of us here can help with understanding the solution.
piramiday wrote: Mon Dec 12, 2022 12:24 pm Also, I now have problems with suspend, etc, that I did not see when my GPU was falling off the bus...
Suspend usually works differently when the graphics drivers are properly loaded as compared to when they are not.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Mon Dec 12, 2022 12:35 pm If you provide the link to the topic, perhaps some of us here can help with understanding the solution.
sorry, of course! https://forums.developer.nvidia.com/t/r ... 022/236678
SMG wrote: Mon Dec 12, 2022 12:35 pm Suspend usually works differently when the graphics drivers are properly loaded as compared to when they are not.
I know, that is an entirely different can of worms. :)
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Sat Dec 10, 2022 11:17 pm I do not know what you are using for this criteria, but the xrandr --verbose information in the Nvidia Bug Report Log (NBRL) appears to indicate the laptop screen is only controlled by Intel.
That was due to my reasoning, since when I had `nvidia-prime` set to `nvidia`, that is, only use the nvidia GPU, any GPU crash would freeze the screen and laptop, but when I had `nvidia-prime` set to `on-demand` it would remain alive thanks to the internal GPU.
If it was the internal GPU responsible for running the integrated display in both cases, then why was there this difference, crash vs. no crash, depending on whether the nvidia GPU was loaded?
You are correct about the `xrandr` lines, even now with a "working" nvidia GPU I do not see `PRIME Synchronization` appearing under eDP-1.
SMG wrote: Sat Dec 10, 2022 11:17 pm There is only the one modeline both in xrandr and the Xorg log. (I do not recall seeing something like that in the past.) Here it is from Xorg log.
Here is what I currently see, using my nvidia GPU thanks to the PowerMizer hackaround.

Code: Select all

$ journalctl -b | grep -i modeset
Dec 13 08:20:21 blade kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.60.11  Wed Nov 23 22:49:17 UTC 2022
Dec 13 08:29:18 blade kernel:  parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic libcrc32c xor raid6_pq zstd_compress dm_crypt dm_mirror dm_region_hash dm_log r8153_ecm cdc_ether usbnet r8152 mii joydev input_leds hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) i915 i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec aesni_intel rc_core crypto_simd cryptd nvme thunderbolt drm xhci_pci nvme_core xhci_pci_renesas video pinctrl_tigerlake mac_hid

$ grep -i modeset /var/log/Xorg.0.log
[    43.866] (==) Matched modesetting as autoconfigured driver 2
[    43.879] (II) LoadModule: "modesetting"
[    43.879] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[    43.879] (II) Module modesetting: vendor="X.Org Foundation"
[    43.881] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
[    43.911] (II) modeset(0): using drv /dev/dri/card0
[    43.915] (II) modeset(0): Creating default Display subsection in Screen section
[    43.915] (==) modeset(0): Depth 24, (==) framebuffer bpp 32
[    43.915] (==) modeset(0): RGB weight 888
[    43.915] (==) modeset(0): Default visual is TrueColor
[    44.054] (II) modeset(0): glamor X acceleration enabled on Mesa Intel(R) Graphics (ADL GT2)
[    44.054] (II) modeset(0): glamor initialized
[    44.054] (==) modeset(0): VariableRefresh: disabled
[    44.054] (==) modeset(0): AsyncFlipSecondaries: disabled
[    44.056] (II) modeset(0): Output eDP-1 has no monitor section
[    44.058] (II) modeset(0): EDID for output eDP-1
[    44.058] (II) modeset(0): Manufacturer: TMX  Model: 1560  Serial#: 0
[    44.058] (II) modeset(0): Year: 2021  Week: 44
[    44.058] (II) modeset(0): EDID Version: 1.4
[    44.058] (II) modeset(0): Digital Display Input
[    44.058] (II) modeset(0): 8 bits per channel
[    44.058] (II) modeset(0): Digital interface is DisplayPort
[    44.058] (II) modeset(0): Max Image Size [cm]: horiz.: 34  vert.: 19
[    44.058] (II) modeset(0): Gamma: 2.20
[    44.058] (II) modeset(0): No DPMS capabilities specified
[    44.058] (II) modeset(0): Supported color encodings: RGB 4:4:4 
[    44.058] (II) modeset(0): Default color space is primary color space
[    44.058] (II) modeset(0): First detailed timing is preferred mode
[    44.058] (II) modeset(0): Preferred mode is native pixel format and refresh rate
[    44.059] (II) modeset(0): Display is continuous-frequency
[    44.059] (II) modeset(0): redX: 0.640 redY: 0.330   greenX: 0.300 greenY: 0.600
[    44.059] (II) modeset(0): blueX: 0.150 blueY: 0.060   whiteX: 0.312 whiteY: 0.329
[    44.059] (II) modeset(0): Manufacturer's mask: 0
[    44.059] (II) modeset(0): Supported detailed timing:
[    44.059] (II) modeset(0): clock: 133.4 MHz   Image Size:  344 x 194 mm
[    44.059] (II) modeset(0): h_active: 1920  h_sync: 1963  h_sync_end 1995 h_blank_end 2000 h_border: 0
[    44.059] (II) modeset(0): v_active: 1080  v_sync: 1103  v_sync_end 1108 v_blanking: 1111 v_border: 0
[    44.059] (II) modeset(0): Ranges: V min: 48 V max: 360 Hz, H min: 410 H max: 410 kHz, PixClock max 825 MHz
[    44.059] (II) modeset(0): Monitor name: TL156VDXP02-0
[    44.059] (II) modeset(0): Number of EDID sections to follow: 1
[    44.059] (II) modeset(0): EDID (in hex):
[    44.059] (II) modeset(0): 	00ffffffffffff0051b8601500000000
[    44.059] (II) modeset(0): 	2c1f0104a522137807ee91a3544c9926
[    44.059] (II) modeset(0): 	0f505400000001010101010101010101
[    44.059] (II) modeset(0): 	0101010101012034805070381f402b20
[    44.059] (II) modeset(0): 	750458c210000018000000fd0e30699b
[    44.059] (II) modeset(0): 	9b52000a20202020202000000010000a
[    44.059] (II) modeset(0): 	202020202020202020202020000000fc
[    44.059] (II) modeset(0): 	00544c3135365644585030322d3001a0
[    44.059] (II) modeset(0): 	7013790000030128773801847f074f00
[    44.059] (II) modeset(0): 	2a001f0037041e001600040000000000
[    44.059] (II) modeset(0): 	00000000000000000000000000000000
[    44.059] (II) modeset(0): 	00000000000000000000000000000000
[    44.059] (II) modeset(0): 	00000000000000000000000000000000
[    44.059] (II) modeset(0): 	00000000000000000000000000000000
[    44.059] (II) modeset(0): 	00000000000000000000000000000000
[    44.059] (II) modeset(0): 	0000000000000000000000000000977c
[    44.059] (II) modeset(0): Printing probed modes for output eDP-1
[    44.059] (II) modeset(0): Modeline "1920x1080"x60.1  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    44.059] (II) modeset(0): Output eDP-1 connected
[    44.059] (II) modeset(0): Using exact sizes for initial modes
[    44.059] (II) modeset(0): Output eDP-1 using initial mode 1920x1080 +0+0
[    44.059] (==) modeset(0): Using gamma correction (1.0, 1.0, 1.0)
[    44.059] (==) modeset(0): DPI set to (96, 96)
[    44.358] (==) modeset(0): Backing store enabled
[    44.358] (==) modeset(0): Silken mouse enabled
[    44.428] (II) modeset(0): Initializing kms color map for depth 24, 8 bpc.
[    44.428] (==) modeset(0): DPMS enabled
[    44.428] (II) modeset(0): [DRI2] Setup complete
[    44.428] (II) modeset(0): [DRI2]   DRI driver: iris
[    44.428] (II) modeset(0): [DRI2]   VDPAU driver: va_gl
[    44.507] (II) modeset(0): Damage tracking initialized
[    44.507] (II) modeset(0): Setting screen physical size to 508 x 285
[    45.310] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.310] (II) modeset(0): Using EDID range info for horizontal sync
[    45.310] (II) modeset(0): Using EDID range info for vertical refresh
[    45.310] (II) modeset(0): Printing DDC gathered Modelines:
[    45.310] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.326] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.326] (II) modeset(0): Using hsync ranges from config file
[    45.326] (II) modeset(0): Using vrefresh ranges from config file
[    45.326] (II) modeset(0): Printing DDC gathered Modelines:
[    45.326] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.343] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.343] (II) modeset(0): Using hsync ranges from config file
[    45.343] (II) modeset(0): Using vrefresh ranges from config file
[    45.343] (II) modeset(0): Printing DDC gathered Modelines:
[    45.343] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.360] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.360] (II) modeset(0): Using hsync ranges from config file
[    45.360] (II) modeset(0): Using vrefresh ranges from config file
[    45.360] (II) modeset(0): Printing DDC gathered Modelines:
[    45.360] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.376] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.376] (II) modeset(0): Using hsync ranges from config file
[    45.377] (II) modeset(0): Using vrefresh ranges from config file
[    45.377] (II) modeset(0): Printing DDC gathered Modelines:
[    45.377] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.384] (II) modeset(0): Allocate new frame buffer 3840x2160 stride
[    45.760] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.760] (II) modeset(0): Using hsync ranges from config file
[    45.760] (II) modeset(0): Using vrefresh ranges from config file
[    45.760] (II) modeset(0): Printing DDC gathered Modelines:
[    45.760] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    45.776] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    45.776] (II) modeset(0): Using hsync ranges from config file
[    45.776] (II) modeset(0): Using vrefresh ranges from config file
[    45.776] (II) modeset(0): Printing DDC gathered Modelines:
[    45.776] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.184] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.184] (II) modeset(0): Using hsync ranges from config file
[    47.184] (II) modeset(0): Using vrefresh ranges from config file
[    47.184] (II) modeset(0): Printing DDC gathered Modelines:
[    47.184] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.201] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.201] (II) modeset(0): Using hsync ranges from config file
[    47.201] (II) modeset(0): Using vrefresh ranges from config file
[    47.201] (II) modeset(0): Printing DDC gathered Modelines:
[    47.201] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.218] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.218] (II) modeset(0): Using hsync ranges from config file
[    47.218] (II) modeset(0): Using vrefresh ranges from config file
[    47.218] (II) modeset(0): Printing DDC gathered Modelines:
[    47.218] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.234] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.234] (II) modeset(0): Using hsync ranges from config file
[    47.234] (II) modeset(0): Using vrefresh ranges from config file
[    47.235] (II) modeset(0): Printing DDC gathered Modelines:
[    47.235] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.251] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.251] (II) modeset(0): Using hsync ranges from config file
[    47.251] (II) modeset(0): Using vrefresh ranges from config file
[    47.251] (II) modeset(0): Printing DDC gathered Modelines:
[    47.251] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.268] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    47.268] (II) modeset(0): Using hsync ranges from config file
[    47.268] (II) modeset(0): Using vrefresh ranges from config file
[    47.268] (II) modeset(0): Printing DDC gathered Modelines:
[    47.268] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    47.489] (II) modeset(0): Allocate new frame buffer 4480x1768 stride
[    48.083] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.083] (II) modeset(0): Using hsync ranges from config file
[    48.083] (II) modeset(0): Using vrefresh ranges from config file
[    48.083] (II) modeset(0): Printing DDC gathered Modelines:
[    48.084] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.100] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.100] (II) modeset(0): Using hsync ranges from config file
[    48.100] (II) modeset(0): Using vrefresh ranges from config file
[    48.100] (II) modeset(0): Printing DDC gathered Modelines:
[    48.100] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.250] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.250] (II) modeset(0): Using hsync ranges from config file
[    48.250] (II) modeset(0): Using vrefresh ranges from config file
[    48.250] (II) modeset(0): Printing DDC gathered Modelines:
[    48.250] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.283] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.283] (II) modeset(0): Using hsync ranges from config file
[    48.283] (II) modeset(0): Using vrefresh ranges from config file
[    48.283] (II) modeset(0): Printing DDC gathered Modelines:
[    48.283] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.317] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.317] (II) modeset(0): Using hsync ranges from config file
[    48.317] (II) modeset(0): Using vrefresh ranges from config file
[    48.317] (II) modeset(0): Printing DDC gathered Modelines:
[    48.317] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.333] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    48.333] (II) modeset(0): Using hsync ranges from config file
[    48.333] (II) modeset(0): Using vrefresh ranges from config file
[    48.333] (II) modeset(0): Printing DDC gathered Modelines:
[    48.333] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    48.576] (II) modeset(0): Allocate new frame buffer 1920x1080 stride
[    50.540] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    50.540] (II) modeset(0): Using hsync ranges from config file
[    50.540] (II) modeset(0): Using vrefresh ranges from config file
[    50.540] (II) modeset(0): Printing DDC gathered Modelines:
[    50.540] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    50.573] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    50.573] (II) modeset(0): Using hsync ranges from config file
[    50.573] (II) modeset(0): Using vrefresh ranges from config file
[    50.573] (II) modeset(0): Printing DDC gathered Modelines:
[    50.573] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    50.590] (II) modeset(0): EDID vendor "TMX", prod id 5472
[    50.590] (II) modeset(0): Using hsync ranges from config file
[    50.590] (II) modeset(0): Using vrefresh ranges from config file
[    50.590] (II) modeset(0): Printing DDC gathered Modelines:
[    50.590] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[    50.607] (II) modeset(0): Allocate new frame buffer 4480x1768 stride
[   985.836] (II) modeset(0): EDID vendor "TMX", prod id 5472
[   985.836] (II) modeset(0): Using hsync ranges from config file
[   985.836] (II) modeset(0): Using vrefresh ranges from config file
[   985.836] (II) modeset(0): Printing DDC gathered Modelines:
[   985.837] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[   985.853] (II) modeset(0): EDID vendor "TMX", prod id 5472
[   985.853] (II) modeset(0): Using hsync ranges from config file
[   985.853] (II) modeset(0): Using vrefresh ranges from config file
[   985.853] (II) modeset(0): Printing DDC gathered Modelines:
[   985.853] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[   993.080] (II) modeset(0): EDID vendor "TMX", prod id 5472
[   993.080] (II) modeset(0): Using hsync ranges from config file
[   993.080] (II) modeset(0): Using vrefresh ranges from config file
[   993.080] (II) modeset(0): Printing DDC gathered Modelines:
[   993.080] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[   993.096] (II) modeset(0): EDID vendor "TMX", prod id 5472
[   993.097] (II) modeset(0): Using hsync ranges from config file
[   993.097] (II) modeset(0): Using vrefresh ranges from config file
[   993.097] (II) modeset(0): Printing DDC gathered Modelines:
[   993.097] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[  1005.585] (II) modeset(0): EDID vendor "TMX", prod id 5472
[  1005.585] (II) modeset(0): Using hsync ranges from config file
[  1005.585] (II) modeset(0): Using vrefresh ranges from config file
[  1005.585] (II) modeset(0): Printing DDC gathered Modelines:
[  1005.585] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[  1005.602] (II) modeset(0): EDID vendor "TMX", prod id 5472
[  1005.602] (II) modeset(0): Using hsync ranges from config file
[  1005.602] (II) modeset(0): Using vrefresh ranges from config file
[  1005.602] (II) modeset(0): Printing DDC gathered Modelines:
[  1005.602] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[  1043.568] (II) modeset(0): EDID vendor "TMX", prod id 5472
[  1043.568] (II) modeset(0): Using hsync ranges from config file
[  1043.568] (II) modeset(0): Using vrefresh ranges from config file
[  1043.568] (II) modeset(0): Printing DDC gathered Modelines:
[  1043.568] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
[  1043.584] (II) modeset(0): EDID vendor "TMX", prod id 5472
[  1043.585] (II) modeset(0): Using hsync ranges from config file
[  1043.585] (II) modeset(0): Using vrefresh ranges from config file
[  1043.585] (II) modeset(0): Printing DDC gathered Modelines:
[  1043.585] (II) modeset(0): Modeline "1920x1080"x0.0  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
SMG wrote: Sat Dec 10, 2022 11:17 pm Considering you say the specs for your laptop do not match this, and the fact Linux-based distros can do more than 60Hz displays, this display seems to be key to the problem.
Full specs are here: https://mysupport.razer.com/app/answers ... /a_id/5901
Specifically, I have the model with RTX 3070 Ti and 360 Hz FHD panel.
I am also using a 180 Hz external monitor and from my Linux Mint setting for Displays I only see a 60 Hz, there, too.
SMG wrote: Sat Dec 10, 2022 11:17 pm Within the NBRL are the following lines which may be when you had a monitor attached? I'm not completely sure because there were several boot logs attached and I don't know if the conditions were the same for all of them. The last lines only showed in one of the boot logs.
I am not aware of what the nvidia debug parser is doing, but I have been trying everything in the past week, so it is clearly possible that reboots with external monitor were there along with reboots in which it was crashing on its own.
I'm sorry, but as far as I have seen so far the external monitor plays no part in this bug.
SMG wrote: Sat Dec 10, 2022 11:17 pm There was also this information.

Code: Select all

/usr/bin/lspci -d "10de:*" -v -xxx
Current output -- again, PowerMizer trick to have both eDP and external HDMI -- does not yield "unknown header type", so indeed that is due to the GPU falling off the bus:

Code: Select all

$ lspci -d "10de:*" -v -xxx
01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [Geforce RTX 3070 Ti Laptop GPU] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Razer USA Ltd. Device 201b
	Flags: bus master, fast devsel, latency 0, IRQ 172, IOMMU group 15
	Memory at 83000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 6000000000 (64-bit, prefetchable) [size=16G]
	Memory at 6400000000 (64-bit, prefetchable) [size=32M]
	I/O ports at 3000 [size=128]
	Expansion ROM at 84000000 [virtual] [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
00: de 10 a0 24 07 04 10 00 a1 00 00 03 00 00 80 00
10: 00 00 00 83 0c 00 00 00 60 00 00 00 0c 00 00 00
20: 64 00 00 00 01 30 00 00 00 00 00 00 58 1a 1b 20
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 01 00 00

01:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
	Subsystem: Razer USA Ltd. GA104 High Definition Audio Controller
	Flags: bus master, fast devsel, latency 0, IRQ 17, IOMMU group 15
	Memory at 84080000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
00: de 10 8b 22 06 00 10 00 a1 00 03 04 10 00 80 00
10: 00 00 08 84 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 58 1a 1b 20
30: 00 00 00 00 60 00 00 00 00 00 00 00 ff 02 00 00
SMG wrote: Sat Dec 10, 2022 11:17 pm It would seem to indicate there will be essentially no power savings to suspending your computer. S0 state is usually the normal wake state. I would guess S0 idle is just a notch below full power.
I have yet to try and debug the sleep issues... sigh.
Did you comment on this with respect to the GPU problem, instead?
SMG wrote: Sat Dec 10, 2022 11:17 pm are not really that helpful, but they don't give me hope that an older driver might help.
I believe those are not helpful because the nvidia debug script is just fetching information that, in the very specific case of the GPU falling off the bus, cannot even be queried!
In normal situations, I guess, you can still query `nvidia-settings` and so on to fetch info, but here the communication with the GPU is lost entirely.
So, based on what I understand of this, it is normal that those line are inconclusive.

It goes without saying that you are extremely helpful, SMG, and I thank you wholeheartedly.
I will try any further suggestion you have in between working days. :roll:
For now, I will not mark the topic as SOLVED but I will edit my initial post to suggest the temporary workaround.
I am very happy to continue this technical discussion, but even if you exhausted all troubleshooting avenues, I will report back for every new NVIDIA driver I see.
In case I do install CUDA, which was my original plan, I think I will be bumped to a downgraded NVIDIA version, and then the troubles might start again. :(
User avatar
SMG
Level 25
Level 25
Posts: 31990
Joined: Sun Jul 26, 2020 6:15 pm
Location: USA

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by SMG »

piramiday wrote: Tue Dec 13, 2022 10:08 amThat was due to my reasoning, since when I had `nvidia-prime` set to `nvidia`, that is, only use the nvidia GPU, any GPU crash would freeze the screen and laptop, but when I had `nvidia-prime` set to `on-demand` it would remain alive thanks to the internal GPU.
If it was the internal GPU responsible for running the integrated display in both cases, then why was there this difference, crash vs. no crash, depending on whether the nvidia GPU was loaded?
I do not know how your laptop's hardware is wired so I do not have any guesses to explain what you saw. While there are usual ways laptop GPUs are wired, there are no standards so it can be different for each manufacturer. Additionally, they might have information in the firmware which is also a factor in displaying the graphics.
piramiday wrote: Tue Dec 13, 2022 10:08 am
SMG wrote: Sat Dec 10, 2022 11:17 pm There is only the one modeline both in xrandr and the Xorg log. (I do not recall seeing something like that in the past.) Here it is from Xorg log.
Here is what I currently see, using my nvidia GPU thanks to the PowerMizer hackaround.

Code: Select all

$ journalctl -b | grep -i modeset

$ grep -i modeset /var/log/Xorg.0.log
My comment mentioned modeline not modeset. Is there a reason you were searching for modeset?

Here is the Modeline. There is only one in this output.

Code: Select all

[    44.059] (II) modeset(0): Modeline "1920x1080"x60.1  133.44  1920 1963 1995 2000  1080 1103 1108 1111 -hsync -vsync (66.7 kHz eP)
piramiday wrote: Tue Dec 13, 2022 10:08 am
SMG wrote: Sat Dec 10, 2022 11:17 pm It would seem to indicate there will be essentially no power savings to suspending your computer. S0 state is usually the normal wake state. I would guess S0 idle is just a notch below full power.
I have yet to try and debug the sleep issues... sigh.
Did you comment on this with respect to the GPU problem, instead?
I am not aware of it having anything to do with the GPU problem. I just mentioned it because I noticed it in the logs and wanted to make you aware of it.
piramiday wrote: Tue Dec 13, 2022 10:08 amFor now, I will not mark the topic as SOLVED but I will edit my initial post to suggest the temporary workaround.
And the only problem you are having right now is the below? (I copied the below from the topic on the Nvidia forum.)
BadWolf84 wrote:So far the only solution for me is to open nvidia-settings and in the Power Mixer change Prefered Mode to Prefer Maximum Performance. As long as it stays there it wont crash. Not the best solution but it works for now.
As a first test, it does not seem that the PowerMizer setting survives a reboot.
I also tried from the command line to type:

Code: Select all

nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"
but when I opened Nvidia settings immediately after that, the PowerMizer setting was still on “Auto”.
Image
A woman typing on a laptop with LM20.3 Cinnamon.
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

SMG wrote: Tue Dec 13, 2022 1:06 pm I do not know how your laptop's hardware is wired so I do not have any guesses to explain what you saw. While there are usual ways laptop GPUs are wired, there are no standards so it can be different for each manufacturer. Additionally, they might have information in the firmware which is also a factor in displaying the graphics.
oh, I have no idea either. :)
it's weird that it might come to this, that is, it might come to knowing exactly how the hardware behaves at such a low level.
SMG wrote: Tue Dec 13, 2022 1:06 pm My comment mentioned modeline not modeset. Is there a reason you were searching for modeset?
the reason was that I looked at a glance and saw the `(II) modeset(0)` header, rather than the "modeline" string.
in any case, I now have exactly the same line in the logs, no difference there.
SMG wrote: Tue Dec 13, 2022 1:06 pm I am not aware of it having anything to do with the GPU problem. I just mentioned it because I noticed it in the logs and wanted to make you aware of it.
thanks, it might have something to do with the sleep issue, which is that the laptop wakes up immediately after trying to suspend.
I will look into that next.
SMG wrote: Tue Dec 13, 2022 1:06 pm And the only problem you are having right now is the below?
nono, I was able to successfully restore the PowerMizer settings at boot through a `.desktop` file.

The problem now is that the bug is still alive and kicking, and I have no way to tell what happens if, say, I dowgrade the nvidia driver, or else.
I have not understood who is at fault here, or how to quickly solve the issue in case it comes up again.
In general, this does not feel like a solution but a barely functioning workaround.
What happens if I unplug the laptop from power and work on battery?
will this "maximum perfomance mode", with its minimum memory transfer rate of 10+ GHz rather than 800 MHz, burn through my battery in no time?

Yesterday I also tried, after hours of Maximum performance, switching the PowerMizer mode to Adaptive, and the system was stable for several more hours.
Today, I booted into Adaptive from the beginning, and the GPU fell off the bus as usual.
Is there anything that gets done after boot time, say in the first 15 mins, that never gets executed again?
It might not be a GPU clock issue, after all. :?
User avatar
roblm
Level 15
Level 15
Posts: 5939
Joined: Sun Feb 24, 2013 2:41 pm

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by roblm »

Have you tried using the Nvidia-470 driver?
piramiday
Level 2
Level 2
Posts: 70
Joined: Tue Jun 25, 2013 10:07 am

Re: NVIDIA GPU falls off the bus on Razer Blade 15 2022

Post by piramiday »

roblm wrote: Wed Dec 14, 2022 12:58 pm Have you tried using the Nvidia-470 driver?
I have not tried it, not yet.
My Driver Manager lists the following as options:
- nvidia-driver-525-open
- nouveau
- nvidia-driver-470
- nvidia-driver-510
- nvidia-driver-515-open
- nvidia-driver-515
- nvidia-driver-525

Why version 470, specifically? is it more stable?
I thought that newer drivers were better with newer cards.

It might be that if and when I install CUDA, then the 470 driver will be installed for me, since I believe that is what happened with my other laptop.
Locked

Return to “Graphics Cards & Monitors”