Linux Mint 20.1 Kernel Crashes, Full Hang

Questions about applications and software
Forum rules
Before you post please read how to get help
Post Reply
MCDELTAT
Level 1
Level 1
Posts: 30
Joined: Thu May 26, 2016 11:54 pm
Location: California, US

Linux Mint 20.1 Kernel Crashes, Full Hang

Post by MCDELTAT »

Hi everyone, I was wondering if you could help me track down what's happening with my Linux Mint install. Recently I've been having very random hard crashes when things that in memory will continue to work fine but it's clear something big has broken. I can't spawn any processes, can't connect from another host via SSH, can't ping the interface (thus the entire Networking stack is down), I can't open a terminal locally, nothing. Sometimes it will do it when I've done absolutely nothing to it. For example today, I was working from my work laptop, went to SSH into that host, can't connect.

The only change that I did recently was to move the main OS from a 512GB M.2 SSD to a 2TB M.2 SSD. I performed that with dd

Code: Select all

sudo dd status=progress if=/dev/nvme0n0 of=/dev/nvme0n1


What could the issue be? How do I trace it down? I dug through some of the logs and didn't see anything. Kernel logs get flushed every boot based on rsyslog configs no? Should I persist these so I have something to look at when it dies again? I'm fine with a clean install, but I would love to trace it down and learn something new.

I had another crash a few hours later and managed to pull this from the /var/log/kern.log logs.

Code: Select all

May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476406] BUG: Bad page state in process Compositor  pfn:e082be
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476417] page:fffffbacb820af80 refcount:0 mapcount:0 mapping:0800000000000000 index:0x0
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476428] general protection fault: 0000 [#1] SMP PTI
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476435] CPU: 8 PID: 11501 Comm: Compositor Tainted: P           OE     5.4.0-73-generic #82-Ubuntu
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476438] Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2401 07/12/2019
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476448] RIP: 0010:__dump_page.cold+0x224/0x280
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476454] Code: eb b5 48 8b 43 08 a8 01 74 16 48 83 e8 01 f6 40 18 01 74 11 48 c7 c6 ae 13 58 96 e9 37 fe ff ff 48
 89 d8 eb e9 4d 85 ed 74 38 <49> 8b 45 00 49 8b 75 70 48 85 c0 74 37 48 8b 80 38 01 00 00 48 85
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476458] RSP: 0018:ffffa84a837bfa28 EFLAGS: 00010006
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476462] RAX: fffffbacb820af80 RBX: fffffbacb820af80 RCX: 0000000000000006
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476466] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff970e0ea178c0
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476469] RBP: ffffa84a837bfa50 R08: 00000000000004e4 R09: 0000000000000004
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476472] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff965820a9
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476475] R13: 0800000000000000 R14: 0000000000000000 R15: 0000000000000000
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476479] FS:  00007f0434e1c700(0000) GS:ffff970e0ea00000(0000) knlGS:0000000000000000
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476486] CR2: 0000337259f8d000 CR3: 0000000ff2a22006 CR4: 00000000003606e0
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476493] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476495] Call Trace:
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476504]  bad_page.cold+0x59/0xb1
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476512]  check_new_page_bad+0x67/0x80
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476518]  rmqueue+0x72e/0xf00
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476525]  ? __switch_to_asm+0x40/0x70
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476532]  get_page_from_freelist+0xb8/0x3f0
May 17 16:08:42 aaron-Mint-Rig kernel: [ 9180.476540]  __alloc_pages_nodemask+0x173/0x320
May 17 16:08:42 aaron-Mint-Rig May 17 16:10:48 aaron-Mint-Rig kernel: [    0.000000] microcode: microcode updated early to revision 0xde, date = 2020-05-25

Code: Select all

            ...-:::::-...                 aaron@aaron-Mint-Rig 
          .-MMMMMMMMMMMMMMM-.              -------------------- 
      .-MMMM`..-:::::::-..`MMMM-.          OS: Linux Mint 20.1 x86_64 
    .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:.        Kernel: 5.4.0-73-generic 
   -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM-       Uptime: 4 hours, 18 mins 
 `:MMM:MM`  :MMMM:....::-...-MMMM:MMM:`    Packages: 3184 (dpkg), 27 (flatpak), 7 (snap) 
 :MMM:MMM`  :MM:`  ``    ``  `:MMM:MMM:    Shell: bash 5.0.17 
.MMM.MMMM`  :MM.  -MM.  .MM-  `MMMM.MMM.   Resolution: 2560x1440 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   DE: Cinnamon 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM:MMM:   WM: Mutter (Muffin) 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   WM Theme: Mint-Y-Dark (Mint-Y) 
.MMM.MMMM`  :MM:--:MM:--:MM:  `MMMM.MMM.   Theme: Mint-Y [GTK2/3] 
 :MMM:MMM-  `-MMMMMMMMMMMM-`  -MMM-MMM:    Icons: Mint-Y [GTK2/3] 
  :MMM:MMM:`                `:MMM:MMM:     Terminal: gnome-terminal 
   .MMM.MMMM:--------------:MMMM.MMM.      CPU: Intel i7-8700K (12) @ 4.700GHz 
     '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-'       GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 2484 
       '.-MMMM``--:::::--``MMMM-.'         Memory: 4706MiB / 64259MiB 
            '-MMMMMMMMMMMMM-'
               ``-:::::-``                                         
                                                                   

Notes on GPU since I suspect it might the compositor:
ASUS RTX 3070 Dual. Driver 460.73.01
Minor overclock in Green With Envy, but fails even with everything back at stock frequencies.

I also saw some other related issues that discussed failing RAM. I'll be testing my RAM from the boot menu tonight.
mikeflan
Level 7
Level 7
Posts: 1624
Joined: Sun Apr 26, 2020 9:28 am
Location: Houston, TX

Re: Linux Mint 20.1 Kernel Crashes, Full Hang

Post by mikeflan »

You sound like an advanced user to me.
Some general ideas:
Which Nvidia graphics do you have?

Code: Select all

inxi -G
Consider stepping down to the 450 drive in Driver Manager if it is an option. Might be worth a quick check.
Also, why aren't you using the 5.8 (or higher) kernel?
User avatar
spamegg
Level 6
Level 6
Posts: 1178
Joined: Mon Oct 28, 2019 2:34 am

Re: Linux Mint 20.1 Kernel Crashes, Full Hang

Post by spamegg »

The only change that I did recently was to move the main OS from a 512GB M.2 SSD to a 2TB M.2 SSD. I performed that with dd

Code: Select all

sudo dd status=progress if=/dev/nvme0n0 of=/dev/nvme0n1
Well, you can't just move an operating system to another drive, and expect it to work correctly like nothing happened. You probably broke your system irreparably. The fact that you can't even spawn processes or open up a terminal etc. shows that it has nothing to do with the GPU or whatever, very fundamental system functionality is broken. The system probably has a lot of things that depend on the drive/location, it would be virtually impossible to track them all down.

dd is just for copying a file (man dd says "convert and copy a file), it's not like it can create a system image/clone or anything, or restore a system like Timeshift.
mikeflan
Level 7
Level 7
Posts: 1624
Joined: Sun Apr 26, 2020 9:28 am
Location: Houston, TX

Re: Linux Mint 20.1 Kernel Crashes, Full Hang

Post by mikeflan »

Thank you spamegg. I should have seen that.
My myopic thinking was going to lead me down a rabbit hole. :roll:
MCDELTAT
Level 1
Level 1
Posts: 30
Joined: Thu May 26, 2016 11:54 pm
Location: California, US

Re: Linux Mint 20.1 Kernel Crashes, Full Hang

Post by MCDELTAT »

spamegg wrote:
Tue May 18, 2021 8:06 am
Well, you can't just move an operating system to another drive, and expect it to work correctly like nothing happened. You probably broke your system irreparably. The fact that you can't even spawn processes or open up a terminal etc. shows that it has nothing to do with the GPU or whatever, very fundamental system functionality is broken. The system probably has a lot of things that depend on the drive/location, it would be virtually impossible to track them all down.

dd is just for copying a file (man dd says "convert and copy a file), it's not like it can create a system image/clone or anything, or restore a system like Timeshift.
I've cloned systems with dd tons of times, it works at a block level so it will copy every single bit on the drive, bit for bit. My system currently works fine, all my configs, docker containers, came over fine. It's at very random intervals that it will crash and I can no longer do anything. In the case(s) that made me create this thread they were 4 hours apart. Before this, my system might be up for weeks.

I'm actually beginning to think this is an issue with my RAM. I ran memtest last night with

Code: Select all

sudo memtester 32G 2
which is only half of my system RAM but it came back with errors. So I fired up memtest86 on a bootable flash which runs through everything. In a single loop of those tests I came back with 44 errors. I'm going to test each stick individually to see what's going on there and see if there's a way to force or exacerbate the behavior.

Thus the error reported in kern.log might not be that the compositor crashed on it's own, but that my RAM sector got tainted the compositor went to fetch, and then crashed because it couldn't handle the values it got back.

For the sake of future readers however, here's in the output of

Code: Select all

inxi -G

Code: Select all

Graphics:  Device-1: NVIDIA driver: nvidia v: 460.73.01 
           Display: x11 server: X.Org 1.20.9 driver: nvidia resolution: 2560x1440~60Hz 
           OpenGL: renderer: GeForce RTX 3070/PCIe/SSE2 v: 4.6.0 NVIDIA 460.73.01 
Post Reply