2020-03-23T01:12:04 -heroes-bot- PROBLEM: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS CRITICAL: DB postgres total locks: 61 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks 2020-03-23T02:22:03 -heroes-bot- RECOVERY: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS OK: DB postgres total=36 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks 2020-03-23T03:09:49 *** okurz_ is now known as okurz 2020-03-23T12:08:31 *** Eighth_Doctor is now known as Conan_Kudo 2020-03-23T12:09:15 *** Conan_Kudo is now known as Eighth_Doctor 2020-03-23T12:57:56 kl_eisbaer: speaking of domains, I kinda wonder if we could also get our hands on opensu.se, it is owned by *somebody* in the community (I have no clue who), and is unused nowadays, but would be good for registry and paste for short links 2020-03-23T12:59:13 bascially, how does openSUSE ask SUSE to get us domains that we need 😛 2020-03-23T13:16:09 lcp: this is indeed a very good question ... :-) 2020-03-23T13:16:46 eh, alright, I will ask the board I guess 2020-03-23T13:17:45 lcp: let me ask the IT guys at SUSE, how they want to do it. At the moment, they are busy with migrating everything from MF-IT into their area - but as we saved them some work with the wiki and forum migration, I will ask them to pay us back by ordering the domains ;-) 2020-03-23T13:18:04 lcp: => so you ask the board and I will ask SUSE-IT - perfect :-) 2020-03-23T13:18:18 alright, that would be great 2020-03-23T14:44:04 -heroes-bot- PROBLEM: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS CRITICAL: DB postgres total locks: 57 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks 2020-03-23T15:01:11 -heroes-bot- PROBLEM: PostgreSQL standby on mirrordb1.infra.opensuse.org - POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB mb_opensuse2 (host:mirrordb2) 87795384 and 1 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PostgreSQL%20standby 2020-03-23T15:11:11 -heroes-bot- RECOVERY: PostgreSQL standby on mirrordb1.infra.opensuse.org - POSTGRES_HOT_STANDBY_DELAY OK: DB mb_opensuse2 (host:mirrordb2) 1377368 and 0 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PostgreSQL%20standby 2020-03-23T16:24:35 -heroes-bot- PROBLEM: MySQL WSREP recv on galera1.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 1.308790 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera1.infra.opensuse.org&service=MySQL%20WSREP%20recv 2020-03-23T16:34:04 -heroes-bot- RECOVERY: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS OK: DB postgres total=45 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks 2020-03-23T17:24:34 -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - CRITICAL - Socket timeout after 10 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki 2020-03-23T17:34:25 -heroes-bot- RECOVERY: HTTP wiki on riesling.infra.opensuse.org - HTTP OK: HTTP/1.1 301 Moved Permanently - 401 bytes in 0.070 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki 2020-03-23T17:50:27 -heroes-bot- PROBLEM: MySQL WSREP recv on galera3.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 1.703761 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera3.infra.opensuse.org&service=MySQL%20WSREP%20recv 2020-03-23T17:55:06 another gitlab upgrade.... 2020-03-23T18:08:34 -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - CRITICAL - Socket timeout after 10 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki 2020-03-23T18:27:02 the redmine instance on progress.o.o produces some internal errors, 500, seems to not be specific for the ticket update I want to write 2020-03-23T18:28:08 the wiki also has problems - wild guess: database issues? 2020-03-23T18:29:10 I guess. "ActiveRecord::StatementInvalid (Mysql2::Error: Field 'id' doesn't have a default value: INSERT INTO `journals` (`created_on`, `journalized_id`, `journalized_type`, `notes`, `private_notes`, `user_id`) VALUES ('2020-03-23 18:28:36', 63772, 'Issue', '/etc/openqa/workers.ini on seattle10.arch defines QEMURAM which overrides any values set by openQA or by manually specified test variables', 0, 17668)):" 2020-03-23T18:29:16 what can we do to check? 2020-03-23T18:30:08 just accessing https://progress.opensuse.org/diary_entries fails for me 2020-03-23T18:30:27 -heroes-bot- PROBLEM: MySQL WSREP recv on galera2.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 382.573212 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera2.infra.opensuse.org&service=MySQL%20WSREP%20recv 2020-03-23T18:33:48 "PROBLEM: MySQL WSREP", I don't know what that means but that does not sound good 2020-03-23T18:35:08 indeed, and that's probably what causes the problems we see 2020-03-23T18:35:20 kl_eisbaer: are you around and can check what's going on? 2020-03-23T18:38:06 I can login to proxy (anna) but would not know how to continue from there nor have root 2020-03-23T18:38:29 the problem is probably in the galera cluster 2020-03-23T18:38:53 on galera1, messages like these started some minutes ago: 2020-03-23T18:38:57 Mar 22 18:24:59 galera1 systemd[1]: mariadb.service: Got notification message from PID 24497, but reception only permitted for main PID 1437 2020-03-23T18:39:50 well, I can also login there, e.g. galera3 but again, don't know about root 2020-03-23T18:40:07 so I also can't read logfiles 2020-03-23T18:40:37 I can use sudo at least on galera1 (looks like galera2 and galera3 need a highstate to fix sudo) 2020-03-23T18:41:17 the problem with galera is that is's a bit "sensitive", so I'm not sure if I should dare to restart mysql on one of the servers, or if this causes more problems than it fixes... 2020-03-23T18:42:17 (using the salt cmd.run backdoor, I see a few (but less) messages like I mentioned for galera1 on galera2. galera3 doesn't show this message.) 2020-03-23T18:44:20 looks like kl_eisbaer is logged in on all galera machines, so maybe we should give him some time ;-) 2020-03-23T18:59:24 I don't see him logged in anymore and not fixed though :( 2020-03-23T19:00:13 AFAIK he's the .202 IP logged in as root 2020-03-23T19:00:25 ah, root :) 2020-03-23T19:05:20 cboltz: I did some changes on riesling ... 2020-03-23T19:05:42 cboltz: profile memcached /usr/sbin/memcached flags=(complain) { ... } 2020-03-23T19:05:51 ^^ removed complain 2020-03-23T19:06:02 and also enhanced the used memory to 4096 2020-03-23T19:06:15 good to know - and as usual: that's salted ;-) 2020-03-23T19:06:29 more important question: any idea what's wrong with galera? 2020-03-23T19:06:34 cboltz: sad to hear that you run such a service in complain mode... 2020-03-23T19:06:38 kl_eisbaer: we were discussing potential database problems as the wiki "has problems" and progress.o.o seems borked 2020-03-23T19:06:45 galera is currently migrating the tables ... 2020-03-23T19:07:13 I do it one table after the other - but it slows down the database, as expected :-/ 2020-03-23T19:07:24 can that explain that e.g. https://progress.opensuse.org/diary_entries shows "internal error" and updating tickets does not work? Is the database read-only? 2020-03-23T19:07:29 I can stop the migration, but the script is nearly done 2020-03-23T19:07:41 okurz: it's not read only 2020-03-23T19:07:45 hm 2020-03-23T19:07:50 okurz: the "problem" is that I'm doing it on galera2 2020-03-23T19:07:52 what's the ETA on migration done? 2020-03-23T19:07:58 which is also the machine used by haproxy for writes 2020-03-23T19:08:24 ...and as long as the mysqld process is working in general (haproxy tests this), it is still used. 2020-03-23T19:08:28 and I think it's not critical if the problem is gone until Asian day starts :) 2020-03-23T19:08:34 I can switch haproxy manually to use another node 2020-03-23T19:08:50 I expect it to be finished in the next 2 hours 2020-03-23T19:09:28 cboltz: just if you wonder why I enabled apparmor for memcached: https://github.com/memcached/memcached/issues/629 2020-03-23T19:09:39 I'd prefer to have working wikis and progress ;-) so yes, please switch haproxy 2020-03-23T19:09:47 yes, I've seen that (via blog.fefe.de) 2020-03-23T19:09:58 ok 2020-03-23T19:10:10 (aka nightmare delivery service ;-) 2020-03-23T19:10:12 omw to anna 2020-03-23T19:11:13 switched 2020-03-23T19:11:45 hm: might be not a good time to restart memcached and switching the database node 2020-03-23T19:12:25 should I expect progress.o.o to work again? 2020-03-23T19:12:35 indeed, better keep the cache ;-) 2020-03-23T19:13:50 ok: I kill the script for now 2020-03-23T19:15:39 FYI: https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/360 should cover your memcached changes 2020-03-23T19:16:04 hm: script stopped, haproxy migrated to galera3, apache on riesling restarted 2020-03-23T19:16:20 but I still see 403 2020-03-23T19:16:32 btw: the apache has no webmaster Email address configured 2020-03-23T19:16:42 ah: back now 2020-03-23T19:16:49 yes :-) 2020-03-23T19:18:54 sorry: this was worse than expected 2020-03-23T19:19:00 still no progress on progress though ;) 2020-03-23T19:19:19 progress is ruby sh*t 2020-03-23T19:19:25 this means the app needs a kick 2020-03-23T19:19:28 as usual 2020-03-23T19:19:40 I can try to restart redmine 2020-03-23T19:19:54 done already 2020-03-23T19:20:28 can't say that this helped 2020-03-23T19:20:40 need to wait for haproxy to catch up - for me it's up 2020-03-23T19:20:57 yes, it's up but still showing the same symptoms 2020-03-23T19:21:07 okurz: what exactly does not work? 2020-03-23T19:21:27 do `tail -f /srv/www/vhosts/redmine/log/production.log` and try to access e.g. https://progress.opensuse.org/diary_entries when logged in 2020-03-23T19:21:35 something about "ActiveRecord::StatementInvalid (Mysql2::Error: Field 'id' doesn't have a default value: INSERT INTO `journals` (`created_on`, `journalized_id`, `journalized_type`, `notes`, `private_notes`, `user_id`) VALUES ('2020-03-23 18:28:36', 63772, 'Issue', '/etc/openqa/workers.ini on seattle10.arch defines QEMURAM which overrides any values set by openQA or by manually specified test variables', 0, 17668)):" 2020-03-23T19:22:37 I also get an internal error for https://progress.opensuse.org/projects/opensuse-admin/issues 2020-03-23T19:22:46 good, reproducible :) 2020-03-23T19:22:47 (while some other pages work, so it depends what you test) 2020-03-23T19:22:56 yes, accessing some pages is fine 2020-03-23T19:23:52 omw 2020-03-23T19:31:35 I hope that means you try to fix it? ;) 2020-03-23T19:38:12 at least this is what I hope ... 2020-03-23T19:38:41 and I will be grateful :) 2020-03-23T19:54:54 looks like I need to reset the DB: this means that the last 3 hours will get lost 2020-03-23T19:55:14 but this allows to go on instead of analyzing for another 2 hours or so 2020-03-23T20:03:04 -heroes-bot- PROBLEM: HAProxy on elsa.infra.opensuse.org - HAPROXY CRITICAL - Active service galera3 is DOWN on galera proxy ! Active service redmine is DOWN on redmine proxy ! ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=elsa.infra.opensuse.org&service=HAProxy 2020-03-23T20:03:07 -heroes-bot- PROBLEM: HAProxy on anna.infra.opensuse.org - HAPROXY CRITICAL - Active service galera3 is DOWN on galera proxy ! Active service redmine is DOWN on redmine proxy ! ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=anna.infra.opensuse.org&service=HAProxy 2020-03-23T20:03:11 -heroes-bot- PROBLEM: HTTP progress on redmine.infra.opensuse.org - HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 311 bytes in 0.001 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=redmine.infra.opensuse.org&service=HTTP%20progress 2020-03-23T20:03:28 I guess that's ok 2020-03-23T20:07:26 * kl_eisbaer questions himself when we can finally switch over to the upgraded redmine instance and leave the SLE11 one where it belongs to ... 2020-03-23T20:09:59 Estu plans to work on it in the next days - last time I tested, creating tickets by mail (admin@) didn't work yet 2020-03-23T20:10:23 that would be very, very welcome. 2020-03-23T20:10:33 Meanwhile this system is the oldest one in our Fuhrpark 2020-03-23T20:17:28 btw, the most recent updates of tickets from progress.o.o I have received email notifications for are from 2020-03-23 1508 UTC, so no significant data lost if we go three hours back … or even more :) 2020-03-23T20:44:46 -heroes-bot- PROBLEM: HTTP etherpad on etherpad.infra.opensuse.org - connect to address 192.168.47.56 and port 9001: Connection refused ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=etherpad.infra.opensuse.org&service=HTTP%20etherpad 2020-03-23T22:10:43 hmm, interesting maintenance mode message: 2020-03-23T22:11:17 Sorry, api.opensuse.org is in maintenance mode at the moment. If the situation persists, check the opensuse-buildservice@opensuse.org mailinglist or https://status.opensuse.org/ . 2020-03-23T22:11:17 especially because it's shown on etherpad.o.o ;-) 2020-03-23T22:40:24 -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1381 bytes in 0.029 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki 2020-03-23T22:45:47 kl_eisbaer: just wondering - when do you expect the databases to be back? 2020-03-23T22:46:38 cboltz: galera1 went of ouf service with filesystem corruption :-( 2020-03-23T22:47:52 :-( 2020-03-23T22:48:47 an funnily, the wiki's currently bombard the remaining DBs with requests: set max_connections is set to 200 - but the slots are filled up like crazy 2020-03-23T22:49:47 maybe queued requests 2020-03-23T22:49:56 I just restarted apache, let's see if it helps 2020-03-23T22:50:28 ps Zaux looks more normal again 2020-03-23T22:51:14 but en.o.o says "cannot access the database" 2020-03-23T22:54:45 cboltz: the cluster is currently not in a good shape 2020-03-23T22:54:52 and I'm too tired to debug 2020-03-23T22:55:18 looking at the time, that's completely understandable 2020-03-23T22:56:31 maybe get some sleep and continue tomorrow morning? 2020-03-23T22:56:53 jip, might be better 2020-03-23T22:57:31 flush hosts; -> hangs :-/ 2020-03-23T22:58:20 I'll put a notice on status.o.o that we have database issues 2020-03-23T22:58:26 jip, thanks 2020-03-23T22:58:45 any wishes/proposals for the text, or should I just write something? 2020-03-23T22:59:01 feel free to write whatever you like. 2020-03-23T22:59:05 My brain is dead atm 2020-03-23T22:59:17 ok 2020-03-23T22:59:41 then - good night, sleep well 2020-03-23T22:59:50 and good luck with fixing everything tomorrow ;-) 2020-03-23T22:59:50 Aborted_connects                              | 14068 2020-03-23T22:59:55 JFYI ... 2020-03-23T23:00:02 Connections                                   | 431853 |