Quote:
I would first go get the motherboard rev # and read the writing on the realtek chip that is on the motherboard edge behind the pcie slot next to the atx back panel. My chip is an rtl8111c. On my gigabyte board at the board's website, the official specs say it was equiped with a rtl8111c/d. Your board specs say it was equiped with a rtl8111d/e -- I guess that means yours is slightly newer, but at times, they could have had exact same chips on different manufacturing runs. I looked at the pinout for rtl8111c & rtl8111d, and it makes sense. There is a package which means these chips should be drop in replacement for each other. So they intended to switch when the old chip supply dried up I would guess. Same for your board model. However, I have no idea if internally these revisions are compatible. After all, their BIOS engineers seem to be out of sync and must have been dropping in the wrong BIOS code for a number of these boards, maybe having to do with the realtek revision swapping, or going from board design to board design with different chip manufactures. Whatever the case, they seem to have seen the problem, but why in the world is it ok to only rely on the BIOS in this way??? I thought Linux was being designed to do things right, and it seems no one noticed the wrong phy id for a number of these boards, maybe because the chip/driver manufacturers never bothered really using it in their code? So this new method seems to be wrong to rely on as the only way. Fixes some, and breaks some. So anyway, then I would go to the maintainer for realtek.ko Johnson Leung or the phylib maintainer Andy Fleming and ask them to add a kernel param to allow you to set a phy id. I would use email to contact them. If they ignore this some more, I would then go to the subsystem maintainer (basically where they pipe all their patches through) and tell them there is a regression that you have to keep patching for and would like a kernel param to address the situation you can then describe to him as well. Hopefully they don't say send in the patch to make you get it done, but at least that might be something that at least go on that they would be open to accepting a patch to add a param. I have had success many years ago getting patches in by sending them directly subsystem maintainers and getting attention of very important individuals, but this was many years ago before git or them using bugzilla. It was a much tighter community back then, quite open. I don't know today... but it should work if they haven't gone to the dark side. Oh, if you do get a patch, or if you want to update your patch, I would try to use the methods defined for my phy_id 0x001cc912. I can't imagine it's gonna be that much different rev to rev, and also, I can't figure out why it's called a "rtl8211b" in the kernel, but somehow my network chip just works... But you can use whatever it was being detected as previously. |
Quote:
The integrated PHY is a derivate of the RTL8211b PHY (identifying as 0x001cc912, same as yours) which was available also standalone. It's not the case that the kernel expects the BIOS to initialize the PHY in a specific way, it just expects the BIOS not to break detection. Also on the OP's board the PHY later identifies as 0x001cc912, BIOS just brings it to an invalid state initially, resulting in the PHY reporting a more or less random PHY ID value that doesn't match the Realtek numbering scheme. It seems that a certain later access to the PHY makes it recover from the initial invalid state. Hard to say in detail because neither Realtek nor Gigabyte release errata information. On a side note: If the system is so critical to the OP, and he faced also other issues due to kernel upgrades: Why not stick to a specific LTS kernel version? |
For the OP: For affected users it sometimes helped to simply rmmod/modprobe module r8169 (provided r8169 is built as module).
Would be worth a try. |
He is using an LTS version stream provided by Slackware.
Oh ok. So it's not even a BIOS bug. It's just the kernel relies on some BIOSes initializing the hardware early to get the phy id, and some don't do it as expected. But then that would point to the kernel could do the same, but the kernel devs don't know the expected proper sequence. I've seen stuff like this before and of course is hurt by lacking documentation. This also explains the "use the BIOS option to enable boot rom", forcing the initialization of the hardware. If he can trigger the proper initialization later, then that means the kernel has done it and just need to track down which bits where sent and when and then rework the driver to do it. Do you even get the proper phy_id when reading it yourself after initial boot? |
Quote:
Maybe the NIC version we talk about here has some silicon bug that requires a fix or workaround in software (as part of BIOS code). We don't know because, as I said, the involved companies don't publish errata information. |
Ok? If it's a bug in silicon, that some BIOS patch over, then he needs a patch in the kernel to set which phy to use when realtek.ko is loaded. So whatever, either there is a magic sequence that can be triggered, or he is left with needing a patch in the kernel.
|
Still would be good to know whether reloading module r8169 helps on the OP's system.
Wrt the "magic sequence", would be interesting whether the following helps. diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 5c879a5c8..03a22e67c 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -5141,6 +5141,10 @@ static int r8169_mdio_register(struct rtl8169_private *tp) struct mii_bus *new_bus; int ret; + if (tp->mac_version == RTL_GIGA_MAC_VER_25 || + tp->mac_version == RTL_GIGA_MAC_VER_26) + r8169_mdio_write(tp, 0x1f, 0); + new_bus = devm_mdiobus_alloc(&pdev->dev); if (!new_bus) return -ENOMEM; |
Would force unloading the module and reloading the module do the same thing ?
I currently am running a patched kernel, and that recognizes the PHY, so it would not be a valid test for what you want (I think). I will have to think how to get to a version of the kernel that does not work, probably by installing the Slackware huge kernel. But then I cannot edit that source code. I would have to make a special version of the kernel just to test that. |
I'm not 100% sure, for now I'd say that both tests are independent.
The module reload test can be done with the stock Slackware kernel that doesn't work out of the box. Alternatively you could remove your patch from the code base you used to build on own kernel, and rebuild. Then again you should have a kernel that doesn't work out of the box. Next step would be to apply the patch proposed in #37 and rebuild the kernel. Maybe it works w/o reloading the module (if reloading the module helps for you at all). |
I have made linux-5.15.117t, which has the nons1 patch, and disabled the other GA-880 patch. Otherwise it is a copy of my existing linux-5.15.117 source, with same config.
My previous kernel was complied with gcc 11.2. This kernel was compiled with gcc 12.3. Code:
diff -r -U4 linux-5.15.117/drivers/net/ethernet/realtek/r8169_main.c linux-5.15.117t/drivers/net/ethernet/realtek/r8169_main.c There are 2 internet devices, one eth0, and other wlan0. Code:
[ 4.893974] r8169 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control This seems to have been a success. I believe that this ought to be reported to dmesg as a quirk detected, so that the user knows, and then it can be documented that there is also a possible BIOS update. That may not be practical if this is just going to be a "Reset the thing because SOME BIOS cannot be trusted to get it done right". In which case I believe a comment in code would be needed to protect it against those who discover this later and wonder why that is there, and may decide to take it out again because they do not know of any reason for it. |
BusinsessKid asked about chip ids: I went over the board with a magnifier to get numbers before I installed it. I did not get much because of heatsinks, and few identifiable chips.
This is some of what I have found (from my hardware detect file). Code:
*** Motherboard There are safer ways to deal with this. |
Thorough listing, but it's difficult to tell where your chips leave off and your comments begin. On the 811D realtek nic, these comments stood out to me.
1. Are you disabling ASPM? 2&4. I very much imagine that Realtek_phylib ≠ libphy. Libphy sounds generic, and part of the kernel Realtek_phylib sounds like is a pile of fixes cobbled together to cope with the inadequacies of Realtek phys. The realtek component of libphy starts with rtl 820x and the numbers go up. There's no mention of 81xx devices. 3. The phy ID gives you a search term for targeted online searches. I would be thinking of buying a pcie or usb3 nic and blacklisting all realtek modules. If the equation "time=money" makes sense to you, that's the way to go. OTOH, if you have your teeth in this and want to see it through, I totally understand, and confess to propping up sh***y hardware myself in the past. Lastly, I had read your comments to be about an RTL8110 nic, but in your last post you had it as an RTL 811D part. Which actually is it? If it's rtl811D, we could be loading the wrong module. You see, Realtek have more different components than imagination, so there's bound to be overlap. Half a dozen modules might work badly. |
The hardware notes were added to, as more information was discovered. Due to the concerns, I have applied a magnifying glass to that part of the board more than once, and could not discover anything more than what is in the notes.
Note: The hardware was working fine before the kernel update. I did get the Ethernet working after the kernel update, and have not had any eth0 network failures other than the kernel changing the driver behavior regarding PHY id codes. There is nothing wrong with the hardware. This problem was caused by changing the kernel driver in a way that made it dependent upon BIOS in a way that was not previously tested. Linux developers have known for a long time to not trust the BIOS on any details that Windows might not rely upon. What I know is what the drivers and BIOS report. As it is working, when patched, the disabling ASPM comment has been ignored. There are so many odd comments in the dmesg, that I don't have time to investigate every odd thing that does not appear to be broken. I have not found a good explanation of what phylib is, or who supplies it. I do note that there is a libphy module and that it got used by huge kernel. Note: proc/modules shows that libphy is still being loaded. Do not know which phy within that is actually being used. Will have to see if what PHY id is reported now, supposing I find out how to access it. Any hardware and financial suggestions are irrelevant, for so many reasons. I prefer to treat every question on Linuxquestions like that. There is value in restoring the driver beyond one users considerations because there are multiple users what are affected by the driver changes, and they all benefit from from fixing that. |
Quote:
Because the proposed patch fixes the issue, I think the following is the root cause of the issue: The PHY has many more registers than the 32 which can be directly addressed on the MDIO bus. Many Realtek PHY's (including this one here) solve this by grouping registers in banks, and a write to register 0x1f selects a specific bank. Presumably the faulty BIOS programs something in the PHY and misses to reset the bank selector to default 0. Therefore reading the PHY ID accesses registers in a different bank, returning a more or less random value. The proposed patch resets the bank selector before reading the PHY ID. Regarding the NIC version numbers: RTL8111D (I think one 1 was missing) is the version of the MAC + PHY combination. The integrated PHY is derived from standalone PHY RTL8211B. Therefore dmesg shows different version numbers. |
Quote:
Leaving the digit out on the part number makes sense to me, btw. And it's easily done. |
All times are GMT -5. The time now is 11:47 AM. |