Possible hardware failure - identifying the problem

Questions about hardware, drivers and peripherals
Forum rules
Before you post read how to get help. Topics in this forum are automatically closed 6 months after creation.
Locked
Kefren
Level 4
Level 4
Posts: 264
Joined: Fri Dec 10, 2021 3:45 pm
Location: Scotland
Contact:

Possible hardware failure - identifying the problem

Post by Kefren »

Possible hardware failure - identifying the problem

Code: Select all

System:    Kernel: 5.15.0-53-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 5.0.7 
           wm: muffin dm: LightDM Distro: Linux Mint 20.2 Uma base: Ubuntu 20.04 focal 
Machine:   Type: Desktop Mobo: Micro-Star model: MAG B550M MORTAR WIFI (MS-7C94) v: 1.0 
           serial: <filter> UEFI: American Megatrends LLC. v: 1.80 date: 07/01/2021 
CPU:       Topology: 6-Core model: AMD Ryzen 5 5600X bits: 64 type: MT MCP arch: Zen 3 
           L2 cache: 3072 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 88797 
           Speed: 3710 MHz min/max: 2200/3700 MHz Core speeds (MHz): 1: 3709 2: 3720 3: 3720 
           4: 3720 5: 3718 6: 4650 7: 3718 8: 3718 9: 3719 10: 3718 11: 3719 12: 4652 
Graphics:  Device-1: NVIDIA driver: nvidia v: 520.56.06 bus ID: 2b:00.0 chip ID: 10de:2489 
           Display: x11 server: X.Org 1.20.13 driver: nvidia 
           unloaded: fbdev,modesetting,nouveau,vesa resolution: 2560x1440~60Hz 
           OpenGL: renderer: NVIDIA GeForce RTX 3060 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 520.56.06 
           direct render: Yes 
Audio:     Device-1: NVIDIA driver: snd_hda_intel v: kernel bus ID: 2b:00.1 chip ID: 10de:228b 
           Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI driver: snd_hda_intel 
           v: kernel bus ID: 2d:00.4 chip ID: 1022:1487 
           Sound Server: ALSA v: k5.15.0-53-generic 
Network:   Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel bus ID: 29:00.0 
           chip ID: 8086:2723 
           IF: wlo1 state: up mac: <filter> 
           Device-2: Realtek RTL8125 2.5GbE vendor: Micro-Star MSI driver: r8169 v: kernel 
           port: f000 bus ID: 2a:00.0 chip ID: 10ec:8125 
           IF: enp42s0 state: down mac: <filter> 
Drives:    Local Storage: total: 4.10 TiB used: 1014.42 GiB (24.1%) 
           ID-1: /dev/nvme0n1 vendor: Samsung model: MZVL2512HCJQ-00B00 size: 476.94 GiB 
           speed: 63.2 Gb/s lanes: 4 serial: <filter> 
           ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO Plus 250GB size: 232.89 GiB 
           speed: 31.6 Gb/s lanes: 4 serial: <filter> 
           ID-3: /dev/sda vendor: Seagate model: ST4000DM004-2CV104 size: 3.64 TiB speed: 6.0 Gb/s 
           serial: <filter> 
Partition: ID-1: / size: 466.95 GiB used: 63.91 GiB (13.7%) fs: ext4 dev: /dev/dm-0 
           ID-2: swap-1 size: 980.0 MiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-1 
USB:       Hub: 1-0:1 info: Full speed (or root) Hub ports: 10 rev: 2.0 chip ID: 1d6b:0002 
           Device-1: 1-5:2 info: Logic3 Afterglow Gamepad for Xbox 360 type: <vendor specific> 
           driver: xpad rev: 2.0 chip ID: 0e6f:0213 
           Device-2: 1-6:3 info: Kingsis Peripherals ZOWIE Gaming mouse type: Mouse 
           driver: hid-generic,usbhid rev: 2.0 chip ID: 1af3:0001 
           Hub: 1-7:4 info: Genesys Logic Hub ports: 4 rev: 2.0 chip ID: 05e3:0608 
           Device-3: 1-8:5 info: Micro Star MYSTIC LIGHT type: HID driver: hid-generic,usbhid 
           rev: 1.1 chip ID: 1462:7c94 
           Device-4: 1-9:6 info: Intel type: Bluetooth driver: btusb rev: 2.0 chip ID: 8087:0029 
           Hub: 2-0:1 info: Full speed (or root) Hub ports: 4 rev: 3.1 chip ID: 1d6b:0003 
           Hub: 3-0:1 info: Full speed (or root) Hub ports: 4 rev: 2.0 chip ID: 1d6b:0002 
           Hub: 3-4:2 info: HP USB2.1 Hub ports: 2 rev: 2.1 chip ID: 03f0:1647 
           Hub: 3-4.1:3 info: HP ports: 4 rev: 2.1 chip ID: 03f0:1647 
           Hub: 4-0:1 info: Full speed (or root) Hub ports: 4 rev: 3.1 chip ID: 1d6b:0003 
           Hub: 4-4:2 info: HP USB3.1 Hub ports: 1 rev: 3.1 chip ID: 03f0:0620 
           Hub: 4-4.1:3 info: HP ports: 4 rev: 3.1 chip ID: 03f0:0620 
Sensors:   System Temperatures: cpu: 29.0 C mobo: N/A gpu: nvidia temp: 38 C 
           Fan Speeds (RPM): N/A gpu: nvidia fan: 0% 
Repos:     No active apt repos in: /etc/apt/sources.list 
           Active apt repos in: /etc/apt/sources.list.d/additional-repositories.list 
           1: deb https: //dl.winehq.org/wine-builds/ubuntu/ focal main
           Active apt repos in: /etc/apt/sources.list.d/official-package-repositories.list 
           1: deb https: //mirror.cov.ukservers.com/linuxmint uma main upstream import backport
           2: deb http: //archive.ubuntu.com/ubuntu focal main restricted universe multiverse
           3: deb http: //archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
           4: deb http: //archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
           5: deb http: //security.ubuntu.com/ubuntu/ focal-security main restricted universe multiverse
           6: deb http: //archive.canonical.com/ubuntu/ focal partner
Info:      Processes: 315 Uptime: 9m Memory: 31.27 GiB used: 1.69 GiB (5.4%) Init: systemd v: 245 
           runlevel: 5 Compilers: gcc: 9.4.0 alt: 9 Client: Unknown python3.8 client inxi: 3.0.38 

Hi,

my PC (desktop) is both the centre of my work (self employed) and my entertainment (games, and running films etc from the browser to the TV over HDMI). I chose the components with dual-booting in mind (even though I only go into Windows about once a fortnight now!) Ever since it arrived prebuilt in August 2021 it has been rock solid, super fast, no issues.

I have a 256GB SSD for Win 10 and a 512GB SSD for Linux. The PC has a 4TB HDD in two partitions. One 2TB NTFS, which Windows uses as my data drive and desktop storage (all the things I back up). The other 2TB is ext4 and my Linux data and desktop.

c.19th November I changed my Nvidia driver to what Mint recommended as an update - the 525 open. On reboot it failed to get to a desktop, just had a flashing cursor. I had all sorts of issues until I eventually found a way to revert the drivers to 515 proprietary. The point is, I got back to a desktop by going back to an old driver, but ever since then I have had problems with updates, and things being a bit flaky. So I put that down to software (since it seemed to match the timing) and made a mental note to reinstall Mint entirely at some point, maybe upgrade to 21.

Then at the end of last week the tabs in Firefox started crashing a lot and the problem got worse. Then Firefox itself crashed, then sometimes Thunderbird. Quite a few things have become unreliable, from Software Manager, to even opening some Libre Office Writer docs. So obviously I am worried about my PC. However, other programs have been fine, e.g. I use Text Editor all the time and that hasn't crashed once in the hundreds of Firefox crashes; nor has Gimp. So the problem is selective.

I still thought it was software related, for various reasons listed at viewtopic.php?f=47&t=386511 where I tried to get to the bottom of it, but just seemed to keep finding other errors (some shown in screenshots). That's a good summary of how I got to where I am now, and what I had tried or realised.

I tried to reinstall Linux, but the Mint 21 iso from live USB just turned off my monitor. So I went back to a 20.2 USB. And then that gave all sorts of errors and refused to install, plus more errors as I quit and rebooted. So I realised that if an OS running from USB is having errors, then it tells me:

1. It isn't just the installed Linux having problems, but something else, hardware. (Bad news.)
2. The chances are that my two SSDs and HDD are not the ones that are failing, since the live Mint was running off USB. (Good news?)

Certainly, my first thought had been that the HDD might have errors, since Firefox and Thunderbird have their profiles there, so it is a logical suspicion. However, I installed Opera and Vivaldi in Mint as tests, and they were fully installed on the Linux OS SSD - and they also had tabs crash as soon as I used them. So that ruled out the HDD being the issues.

And, to make it worse, when I tried to get into Windows just now, it completely fails, even though it was fine at the weekend. Starts to load, then for a second there is a huge sad-face smiley in the top left, then it reboots. So if the Linux OS was having issues due to its SSD, it wouldn't also be affecting the separate Windows SSD and vice versa, which again suggests the SSDs and HDD are not the problem.

My PC is:

CPU - AMD Ryzen 5 5600X
GPU - RTX3060Ti DUA
HDD - 4TB Seagate BarraCuda
M2 SSD - 250GB Samsung 970 EVO PLUS (Windows 10)
M2 SSD - 512GB Samsung PM9A1 (Linux Mint 20.2)
MB - MSI MAG B550M Mortar Wifi
PSU - Corsair CV 650W 80+ Bronze
RAM - 32GB Corsair Vengeance 3600MHz DDR4 (two 16GB sticks)

My motherboard only has self-tests for the two SSDs, and when I ran the tests just now it said no errors (which is what I expected).

So, if I work on the assumption that the SSDs and HDD are fine (though the two operating systems on the SSDs are having problems caused by some other failure, and probably need reinstalling once I can replace identify the fault and hopefully fix it), that leaves possible culprits as being:

- Motherboard
- RAM
- CPU
- Nvidia GPU

Can Linux Mint help me identify the possible problem? If so, how? I was thinking that some of the many error messages might give a clue as to what the issue is, or some test that will help.

Obviously there is also the physical checks. In theory I could dismantle the PC and try to put the bits back together, in case something is loose (a tentative test last time it was turned off didn't reveal anything) though that is a last resort as I can't afford to replace the GPU if I mess it up somehow. So I really don't want to mess with that or the CPU if possible.

The RAM is easier to access, and since there are two sticks I could remove one, see how things work, if it is the same then swap them over to test the other. (Of course, if Linux is currently a bit broken due to the problems already mentioned, I may not actually notice any change).

That's why I'd first like to see what Linux Mint can help with in terms of testing hardware, even if it is some out-of-the-box thinking of things that might generate an error that is meaningful and reveals a memory issue or whatever.

I'm truly grateful if anyone can help me out. I love computers but am not super confident with hardware or complex software processes and may need talking through things that seem simple to someone else. Plus, the problems on the PC may make some tests harder (it took five attempts to post this before I could do so without Firefox crashing), whilst also helping identify the issue by the actual crashes, logs or errors.

The ideal outcome would be some test that identifies the possible problem area; then more tests to confirm. If I can fix or replace something easily (I could afford to replace an SSD or RAM, just about, but not the GPU if that was faulty) then hopefully I'll be able to finally reinstall Linux and have things as they used to be (and also then work on fixing Windows, which I only use for Daz Studio and a few PC games that don't work in Linux).
Last edited by LockBot on Mon Jun 05, 2023 10:00 pm, edited 1 time in total.
Reason: Topic automatically closed 6 months after creation. New replies are no longer allowed.
Kefren
Level 4
Level 4
Posts: 264
Joined: Fri Dec 10, 2021 3:45 pm
Location: Scotland
Contact:

Re: Possible hardware failure - identifying the problem

Post by Kefren »

Additional info.

As I said, the BIOS doesn't show any errors or behave weirdly, or misreport hardware (doesn't rule it out, but may be relevant).

I just put memtest https://memtest.org/ on a USB and booted into it and ran it for 45 mins. No RAM errors reported, it just said PASS.
ffff.jpg
So maybe it isn't the RAM, and just leaves CPU, GPU and MB as possible things causing all these errors? Any similar tests I can do for those?

I also eventually got into Win10 after about five attempts and a system restore that only partially succeeded. Firefox tabs crashed almost as soon as I opened them. I tried Edge for the first time, so see if that was affected, and yes - most tabs immediately crashed with an "access violation". So, whatever is causing my problems, it seems to betray itself quickest in any browser, Windows or Linux. But then Windows crashed completely and threw me out. So this is obviously a serious issue, even though I can generally do some tasks in Linux Mint despite it.
Kefren
Level 4
Level 4
Posts: 264
Joined: Fri Dec 10, 2021 3:45 pm
Location: Scotland
Contact:

Re: Possible hardware failure - identifying the problem

Post by Kefren »

Additional.

This morning I updated the BIOS from https://www.msi.com/Motherboard/MAG-B55 ... pport#bios

[INSTALLED ON PURCHASE] AMI BIOS 7C94v18 2021-07-09 17.71 MB
Description:
- Update to AMD ComboAM4PIV2 1.2.0.3b

[INSTALLED 2020-12-06] AMI BIOS 7C94v1D 2022-08-19 18.30 MB
Description:
- Update to AGESA ComboAm4v2PI 1.2.0.7.
- Change the default setting of Secure Boot.

(Of course, the new one turned on all the Windows crud about TPM, secure boot etc, so I had to work out how to turn all that off again.)
Will see if that helps in any way.

UPDATE:

Linux is still having browser crashes (and one Cinnamon crash). But, of course, it may be that Linux has been a bit broken by failed updates, so even if a hardware issue is resolved, Linux might continue to behave like that until it is installed.

So I tried booting to the live Mint 21 USB. As before, once I choose to boot into it, the monitor displays a "no signal" message. HDMI disabled. I hear the desktop load, and use a keyboard shortcut to restart the PC.

So I tried booting to the live Mint 20.2 USB (I'd rather move to Mint 21, but I'll take what I can get). It got to the live desktop, as expected. It even seemed to let me reach the install Mint setup this time, and did fail at the multimedia codecs setting. However, when I pick the partition that currently has Mint on (which I want to replace) all the buttons are greyed out and it won't let me install there. It just tells me to use the partitioning menu, but there is no menu. Weird. So I had to quit in the end. (I remember my first Mint install being simple, not like this - it's why I wonder fi it would be simpler to erase the SSD completely first, all partitions, and reformat as one, so it is back like when I first got the PC).
IMAG5537.jpg
Maybe relevant - when I quit the live USB, this text appears - error messages that indicate a hardware issue?
IMAG55238.jpg
Kefren
Level 4
Level 4
Posts: 264
Joined: Fri Dec 10, 2021 3:45 pm
Location: Scotland
Contact:

Re: Possible hardware failure - identifying the problem

Post by Kefren »

I took the PC apart, checked everything, reseated. Some MB screws were a bit loose so I tightened them.

Still no change. I can't install Linux on my SSD, so am stuck using a live USB. I am self-employed so am now a week behind on my work, and in an even worse situation than when this started (I had an OS I could boot into back then, and access my files!)

I tried the local PC repair people but they said the same - without parts to swap out, they wouldn't have any way of fixing it. I am very stressed, since I have no idea what to do now.
citfta
Level 2
Level 2
Posts: 99
Joined: Sat Apr 02, 2022 10:02 pm
Location: Georgia, USA

Re: Possible hardware failure - identifying the problem

Post by citfta »

I don't recall if you ever tried to find a forum for your motherboard to ask for help there? If not I suggest you do that. Also I did a search for hardware diagnostic tools for Linux and found a few. Some of them you need to install and some of them you just need to make a cd or memory stick and boot them up live to use them. I noticed in your shut down picture that it said it failed to unmount the CD. You might try disconnecting the CD and seeing if that helps. Sometimes a hardware problem can throw glitches into a system that will cause all kinds of problems that don't seem to be related to the actual hardware that is causing the problem. Another thing you can do is to disconnect everything that you don't actually need to operate your PC. For instance you can probably unplug a cable or cables going to your USB ports. And then see what happens. Or if you have an external dongle for your wifi you could try temporarily disconnecting that and see what happens. If you have handy another mouse or keyboard you can also try swapping them out. Keyboards have been known to cause some serious issues if one or more of the keys decides to short out. Here are some links to follow for hardware diagnostics:

https://www.ubuntupit.com/best-linux-ha ... nfo-tools/

https://askubuntu.com/questions/109935/ ... stic-tools

https://www.linux.com/news/hardware-dia ... rce-tools/

If you get your problem solved please post what you found as it may help others and also lets us know you got it fixed.

Good luck!
User avatar
coffee412
Level 8
Level 8
Posts: 2259
Joined: Mon Nov 12, 2012 7:38 pm
Location: I dont know
Contact:

Re: Possible hardware failure - identifying the problem

Post by coffee412 »

This is what I would do:

Disconnect all drives from the MB. Then boot from a live usb and see if your still having issues. If you are still having issues then find or buy a cheap used AMD video card and swap out the video card. Boot again from the live usb and see if you are still having issues. Keep the drives out while doing this. The AMD video card will use the kernel drivers instead of the installed nvidia drivers.

As a side note, The one Nvme drive that reports unknown partition - This should be looked at from the live usb later to rule out it has an issue.
Ryzen x1800 Asus Prime x370-Pro 32 gigs Ram RX480 graphics
Dell PE T610, Dell PE T710
- List your hardware Profile: inxi -Fxpmrz
MeshCentral * Virtualbox * Debian * InvoiceNinja * NextCloud * Linux since kernel 2.0.36
Kefren
Level 4
Level 4
Posts: 264
Joined: Fri Dec 10, 2021 3:45 pm
Location: Scotland
Contact:

Re: Possible hardware failure - identifying the problem

Post by Kefren »

Many thanks both of you - excellent suggestions!

I did lots more testing, with both software and hardware swaps, but it was hard to conclude anything. Also, I really don't feel confident messing with the m.2 drives as they are really fiddly to get to (one is housed under the RTX3060Ti, so that has to be removed first, but the case is mid-sized and the catch to release it is hidden beneath!)

In the end I spoke to the people who sold me the PC, they were really helpful. We did some more tests and it was clear I can't solve this alone, so it is going back to them, where they can swap parts as tests. It's a difficult time - I'm self-employed and behind on my deadline (my next novel is due out 10th Jan, I'll now have to change that with my editor nd distributors, no way I'll make it when I am currently using a super old laptop that takes 20 seconds to open a tab). I am hoping it will be something they can easily fix, and the part will be under warranty or not too expensive. I am a lot more hopeful having spoken to them that things will eventually go back to normal! So at present I am boxing my PC up for repair. I'll revisit this when it comes back, hopefully working as well as it did for the first eleven months!

Best wishes and thanks again for suggestions that I'll try, or file for the future should this happen again. I would have replied sooner but it has taken me a day to install Xfce Mint on this laptop and get some of my files on it, software set up etc. I'll never take my fast PC for granted when it comes back. This laptop struggles to run Heroes of Might and Magic 3 ... :-(
Locked

Return to “Hardware Support”