System lockups with Ryzen 7 1700

Joined: Wed Aug 03, 2016 5:11 am

Post by unfa » Sun Jul 16, 2017 7:25 am

I'm running Linux Mint 18.1 KDE5 (64-bit).

My hardware is:

CPU: Ryzen 7 1700
Mobo: ASUS PRIME B350M-A, with BIOS 0604 04/06/2017

I'm not overclocking anything, and the temperatures are in control.

I have two problems.

A: Random complete system lockups
B: Random strange behaviour that makes htop not run, (but top runs fine).

Condition A - the system freezes and everything I can do is Alt+SysRq+B, which sometimes doesn't even work and I have to use hardware reset switch. I don't know how to debug this, I could run a dmesg | tail in an SSH on my phone to catch the last messages before the system dies, but I never know when this will happen so I'd have to somehow make this a 24/7 monitoring solution, not sure how to do this.

Condition B - the system seems to function, but many programs never run. If I run htop, it'll blank the terminal window, but nothing will ever happen after that. Top however runs without problems. In such case I witness soe other anomalous behaviour that doesn't make any sense to me. I dumped a full dmesg output and it says something like this:

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [khugepaged:111]
Full dmesg is here: - the interesting part starts at line 770.

It's basically repeating this over and over:

[37655.157050] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [khugepaged:111]
[37655.157059] Modules linked in: snd_seq_dummy snd_hrtimer nvidia_uvm(POE) uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_codec_hdmi binfmt_misc videobuf2_v4l2 videobuf2_core videodev snd_usb_audio joydev input_leds media snd_usbmidi_lib eeepc_wmi asus_wmi sparse_keymap video edac_mce_amd edac_core kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic pcbc nvidia_drm(POE) nvidia_modeset(POE) aesni_intel aes_x86_64 nvidia(POE) crypto_simd glue_helper snd_hda_intel cryptd snd_seq_midi snd_hda_codec snd_seq_midi_event snd_hda_core snd_hwdep snd_rawmidi snd_pcm snd_seq snd_seq_device snd_timer i2c_piix4 snd drm_kms_helper ccp soundcore drm fb_sys_fops syscopyarea sysfillrect sysimgblt shpchp wmi i2c_designware_platform i2c_designware_core 8250_dw
[37655.157107]  mac_hid lm78 hwmon_vid parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log uas usb_storage hid_generic usbhid hid ahci r8168(OE) libahci gpio_amdpt fjes gpio_generic
[37655.157126] CPU: 12 PID: 111 Comm: khugepaged Tainted: P      D    OEL  4.10.0-21-lowlatency #23~16.04.1-Ubuntu
[37655.157128] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 0604 04/06/2017
[37655.157132] task: ffff9fd70b970000 task.stack: ffffaf9001d64000
[37655.157139] RIP: 0010:native_queued_spin_lock_slowpath+0x17c/0x1a0
[37655.157141] RSP: 0018:ffffaf9001d677f8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[37655.157144] RAX: 0000000000000101 RBX: ffffe17a0ebcb9f0 RCX: 0000000000000001
[37655.157146] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffe17a0ebcb9f0
[37655.157148] RBP: ffffaf9001d677f8 R08: 0000000000000101 R09: 0000000000000000
[37655.157150] R10: 0000000000000000 R11: 0000000000000000 R12: ffffe17a055d5080
[37655.157152] R13: ffffaf9001d67878 R14: ffff9fd6af2e7490 R15: 0000000000000000
[37655.157154] FS:  0000000000000000(0000) GS:ffff9fd70e900000(0000) knlGS:0000000000000000
[37655.157156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37655.157158] CR2: 00007f3769e34008 CR3: 0000000130209000 CR4: 00000000003406e0
[37655.157160] Call Trace:
[37655.157167]  _raw_spin_lock+0x27/0x30
[37655.157172]  __page_check_address+0xd9/0x1c0
[37655.157175]  try_to_unmap_one+0x7d/0x650
[37655.157178]  rmap_walk_anon+0xd9/0x270
[37655.157181]  rmap_walk+0x48/0x60
[37655.157184]  try_to_unmap+0x10e/0x140
[37655.157187]  ? page_remove_rmap+0x260/0x260
[37655.157190]  ? __page_set_anon_rmap+0x70/0x70
[37655.157192]  ? page_get_anon_vma+0xa0/0xa0
[37655.157195]  ? invalid_mkclean_vma+0x20/0x20
[37655.157199]  migrate_pages+0x94a/0xa50
[37655.157202]  ? __ClearPageMovable+0x10/0x10
[37655.157204]  ? isolate_freepages_block+0x3a0/0x3a0
[37655.157207]  compact_zone+0x567/0x930
[37655.157210]  compact_zone_order+0x90/0xb0
[37655.157213]  try_to_compact_pages+0x1b9/0x2e0
[37655.157217]  __alloc_pages_direct_compact+0x46/0xf0
[37655.157219]  __alloc_pages_slowpath+0x7d5/0xb10
[37655.157223]  __alloc_pages_nodemask+0x22a/0x270
[37655.157226]  khugepaged_alloc_page+0x3d/0x70
[37655.157228]  khugepaged+0xdc5/0x1fc0
[37655.157233]  ? preempt_notifier_register+0x31/0x60
[37655.157236]  ? wake_atomic_t_function+0x60/0x60
[37655.157240]  kthread+0x101/0x140
[37655.157242]  ? collapse_shmem+0xbf0/0xbf0
[37655.157245]  ? kthread_create_on_node+0x60/0x60
[37655.157248]  ret_from_fork+0x2c/0x40
[37655.157250] Code: c0 74 e6 4d 85 c9 c6 07 01 74 30 41 c7 41 08 01 00 00 00 e9 51 ff ff ff 83 fa 01 0f 84 af fe ff ff 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f3 90 4c 8b 09 4d 85 
It always seems to be CPU#12 - I wonder if my CPU is broken and maybe I should get it replaced? Or maybe this is a kernel bug?

Also I have these random lockups (A) that happen more often than this (B), and are a bigger problem, I wonder if they could be related and if this could be a harwdare problem.

Anybody else had similar issues?

