2024-09-05T01:44:56 It's not just you! forums.opensuse.org is down. 2024-09-05T01:47:28 It's not just you! status.opensuse.org is down. 2024-09-05T04:53:36 status.opensuse.org doesn't load and https://monitor.opensuse.org/heroes/ doesn't work as well, so I can't see this chat history. Is anybody on it? 2024-09-05T06:01:25 Good morning. We seem to have some outage since 00:40 UTC 2024-09-05T06:14:37 I think, the problem is with keepalived on atlas1+2 2024-09-05T06:17:33 and it came back after commenting out 2 "d-os-public" interfaces. Those caused the start to fail from non-existing interface. 2024-09-05T06:17:35 anikitin: Not that I have seen - I actually don't know much about those services and what could be the issue. 2024-09-05T06:19:05 anikitin: monitor.o.o is back now. 2024-09-05T06:32:36 *** teepee_ is now known as teepee 2024-09-05T06:46:01 there are a two problems 2024-09-05T06:46:26 first we are apparently still pending maintenance update with my patch in https://bugzilla.opensuse.org/show_bug.cgi?id=1229555 2024-09-05T06:47:22 second sytemd "OnCalendar" seems to not work as expected, atlas2 should update at 00:30 at which time it would have detected that atlas1 did not come back and blocked the execution 2024-09-05T06:48:19 instead it started the timer at 01.51 which can probably be attributed to RandomizedDelaySec=2h 2024-09-05T07:18:30 but that would mean that the outage started before atlas2 updated? 2024-09-05T07:19:39 in theory it should work like this: atlas1 updates/restarts -> VIP moves to atlas2 -> atlas1 does not come back due to $problem -> atlas2 blocks update and keeps VIP 2024-09-05T07:20:07 but here atlas2 updated while atlas1 was not done 2024-09-05T07:21:50 I made https://github.com/openSUSE/salt-formulas/pull/189 and https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/2054 2024-09-05T07:22:28 I don't fully grasp the explanation in systemd.timer(5) but assume that should make it respect our calendar time? 2024-09-05T07:23:03 there is also AccuracySec but that is 1m by default which does not seem too bad 2024-09-05T07:27:38 it seems, this night they rebooted 15m apart. 2024-09-05T07:28:18 right but the problem is they updated simultaneously 2024-09-05T07:28:37 the second machine must update and reboot only after the first one completed so that it can check if any services on it failed 2024-09-05T07:29:09 does an update take longer than 15m? 2024-09-05T07:30:23 from what I can tell, no - it takes ~2 minutes 2024-09-05T07:56:52 https://etherpad.opensuse.org/p/20240905-timeline - so atlas2 probably took over the VIP and kept the services up, but did not block the update. 2024-09-05T08:00:32 turns out, the updates were installed 22h earlier on both of them 2024-09-05T08:05:49 and the updated packages seem unrelated to the issue. Could "buddycheck" have broken things? It was installed after the last reboot. 2024-09-05T08:13:05 commented on https://github.com/openSUSE/salt-formulas/pull/189 ; but it will probably not help with what caused today's problem. 2024-09-05T08:21:34 I still don't understand why the d-os-public interface is missing and why this has not been a problem before today? Can you shed some light on that? Were there some changes related to that? 2024-09-05T08:26:04 2024-08-23 had systemd+udev updates - before the previous reboot. So also not the cause? 2024-09-05T08:47:47 bmwiedemann1: the problem is that atlas2 installed updates while atlas1 was still installing. so at that time, there were no failing services yet. 2024-09-05T08:48:13 buddycheck will prevent update if there are failing services on the partner node. you can see how it works if timing is correct on hel{1,2}: https://paste.opensuse.org/pastes/943a021e7c63 2024-09-05T08:49:05 bmwiedemann1: the d-* interfaces do not get created automatically because wicked is broken, I linked you the bug report earlier. 2024-09-05T08:49:48 how that broken update even made it into Leap is beyond me 2024-09-05T08:51:57 acidsys: the updates happened a day earlier, so no failing services could have been detected anyway. Only the reboot activated it. 2024-09-05T08:52:48 aha, so update was scheduled yesterday but only executed today 2024-09-05T08:53:22 it ran yesterday. Only the reboot came today. 2024-09-05T08:53:22 2024-09-04T01:54:08.002201+00:00 atlas1 os-update[31748]: Trigger reboot with rebootmgrctl reboot 2024-09-05T08:53:25 2024-09-05T00:19:59.007585+00:00 atlas1 rebootmgrd: rebootmgr: reboot triggered now! 2024-09-05T08:53:35 the reboot window is set to *-*-* 00:20:00, lasting 00h10m. 2024-09-05T08:53:55 so at the time atlas1 updated yesterday, its reboot window was already over. 2024-09-05T08:54:18 and the reboot for atlas2 was already scheduled 2024-09-05T08:57:01 I see, also from yesterday 2024-09-05T08:57:56 hmm. atlas2 logs say, "2024-09-04T01:52:28.231804+00:00 atlas2 os-update[25146]: Reboot is probably not necessary." - so I wonder why it did reboot? 2024-09-05T08:58:41 or does "2024-09-04T01:52:28.254044+00:00 atlas2 os-update[15899]: Trigger reboot with rebootmgrctl reboot" mean it did trigger it already? 2024-09-05T08:59:05 trigger means schedule with rebootmgr 2024-09-05T08:59:37 but I thought, this is a note to the admin, that the admin should run that command to trigger a reboot. 2024-09-05T09:00:07 only on cluster machines 2024-09-05T09:00:29 are atlas1+2 cluster machines? 2024-09-05T09:00:53 ok, broken terminology 2024-09-05T09:01:06 cluster machines with reboot_safe=no need manual reboot 2024-09-05T09:01:20 reboot_safe=yes get automatic reboot as per reboot/maintenance window 2024-09-05T09:01:42 HA pairs like atlas{1,2} are reboot_safe=yes because automatic reboot is supposed to be safe 2024-09-05T09:02:05 hence you will find them reporting an automatic maintenance window in `rebootmgrctl get-window` 2024-09-05T09:03:24 OK. So for the update to work correctly, we need to ensure that updates are installed right before the reboot window. 2024-09-05T09:03:47 yes and that one machines fully completes updates (incl. reboot) before the other one starts them, so buddycheck can notice failure 2024-09-05T09:03:56 is there any reason, we don't move the reboot windows 6h or 12h apart? 2024-09-05T09:04:42 I can't think of a reason on the spot 2024-09-05T09:05:05 because with just 15m the timings get hard to meet, because we don't know how long an update will take 2024-09-05T09:07:29 then we could have 00:05 atlas1 updates, 00:20-30 atlas1 reboot window ; 12:05 atlas2 updates ; 12:20 atlas2 reboot window 2024-09-05T09:07:41 fair, but does an update take longer than 15min? I thought the issue that sometimes they do not start on time, hence the 15 min might become less 2024-09-05T09:08:24 that could work, if we don't mind reboot windows in the middle of the day 2024-09-05T09:08:27 usually the are faster. But we dont have 15m. 2024-09-05T09:08:58 at least for atlas2 it is not a problem, because it is only the fallback and no outage should be visible 2024-09-05T09:09:06 right 2024-09-05T09:09:27 the same would be applied for any $host with index > 1 2024-09-05T09:09:47 and would we notice that atlas1 had a failure within 12h? 2024-09-05T09:10:17 depending on the failure, it will alert in monitoring 2024-09-05T09:10:45 that could be an additional benefit 2024-09-05T09:10:46 for systemd health I have alerting rule in TODO, is rather simple 2024-09-05T09:11:02 but haproxy alerts already exist, which cover atlas 2024-09-05T09:18:13 currently the formula for the update hour in pillar/common/update.sls is (simplified): "$host_index * 15 / 60" . with your example that would be *360 instead of *15 ? 2024-09-05T09:37:35 this is the hour portion. It could be $host_index * 6 2024-09-05T09:38:11 unless we have clusters with >3 nodes 2024-09-05T09:39:47 * >4 2024-09-05T09:42:13 and the minute portion could be a constant :20 2024-09-05T09:43:40 and if we have larger clusters, we could use ($host_index * 6) % 24 and use a minutes value of (20 + $host_index * 7) % 60 2024-09-05T09:45:06 that should result in 00:20, 06:27, 12:34, 18:41, 00:48, 06:55, 12:02 ... 2024-09-05T09:48:19 instead of 7, 11, 19 and 23 should produce nicely distributed times as well. 13 and 17 are a bit too close to 15, so that 4*13 would nearly wrap around to a full hour. 2024-09-05T09:50:45 Hmm. 8*7 is 56, also not ideal, but then we probably don't have 8-node clusters anyway. 2024-09-05T10:01:06 we do have narwal* which go up to 8 :-/ 2024-09-05T10:08:55 so we would need some overflow handling. I think that was a reason for smaller intervals. 2024-09-05T10:09:35 also I realize it will cause "large" clusters to take multiple days to all updated 2024-09-05T10:10:18 it would still be finished within 19h - the 5th node updates at the same hour as the 1st node 2024-09-05T10:10:54 then the best should be for minutes to be (20 + $host_index * 11) % 60 2024-09-05T10:13:16 you mean every 4th? ( index * 6 ) % 24 for example every 4th will yield 0 2024-09-05T10:14:32 ah I guess everything >=4 will wrap around 2024-09-05T10:15:07 in case indexes only start at 1 it might mean that node 4 will be updated before 1, but that's not really a problem 2024-09-05T10:28:46 yes, should be fine. 2024-09-05T10:37:55 I will try it later 2024-09-05T11:18:42 status.o.o issue solved as well, I forgot to remove a httpd include which no longer exists 2024-09-05T15:06:15 late discourse update 2024-09-05T15:08:31 * matrix-o-o malcolmlewis go makes coffee 2024-09-05T15:08:52 * matrix-o-o malcolmlewis * goes and 2024-09-05T15:11:30 and maxmind update again failed, like the previous times no descriptive error message (just "Zlib::BufError: buffer error (Zlib::BufError)", commented out the rake task 2024-09-05T15:12:00 and done 2024-09-05T15:13:09 one would need to put some printf debugging into the rake file to check what its problem with the archive is; with curl it works fine 2024-09-05T17:57:11 When I went to look at applying these updates to my sandbox, I'm seeing some issues pop up around rubygem-bigdecimal and rubygem-fast_xs: 2024-09-05T17:57:36 Thought I might see if I can dupe the maxmind update issue to see if I could see a way to address it. 2024-09-05T17:58:24 (I guess I should have mentioned this is a forums sandbox for those who are unaware - not a general support question, but related to @acidsys comments about the forums software update) 2024-09-05T18:13:24 Looks like it might be related to having previously pulled stuff from "devel:languages:ruby", and I see that on prod, that repo is no longer present. Going to try to clean things up so I match that. 2024-09-05T18:35:30 I think I've got it figured out - we'll see after the reboot. 2024-09-05T18:39:16 hi hendersj. with the discourse stack I found best is to pull everything ruby* from home:darix:apps so the versions align 2024-09-05T18:46:53 Yeah, that's what seems to be working now. I've aligned my repos with prod, and that's helped. Seems I had at least one package installed that didn't need to be there (or maybe used to be needed and isn't now) that was causing some issues. 2024-09-05T19:09:33 Tried the maxminddb:get rake task here, and it seemed to work fine (complained about not being in a git repo, but the downloads seemed to be fine) 2024-09-05T19:10:18 hm maybe trying it manually on production would yield something useful 2024-09-05T19:10:26 I only observed it through discourse-update 2024-09-05T19:10:35 which does the maxmind stuff as part of assets:precompile 2024-09-05T19:10:45 Ah, that makes sense. Let me check my logs 2024-09-05T19:12:51 It was skipped on my system as it apparently downloaded at 1:15 AM GMT-7 here. 2024-09-05T19:51:41 *** teepee_ is now known as teepee