2020-03-23T01:12:04  -heroes-bot- PROBLEM: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS CRITICAL: DB postgres total locks: 61 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks
2020-03-23T02:22:03  -heroes-bot- RECOVERY: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS OK: DB postgres total=36 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks
2020-03-23T03:09:49  *** okurz_ is now known as okurz
2020-03-23T12:08:31  *** Eighth_Doctor is now known as Conan_Kudo
2020-03-23T12:09:15  *** Conan_Kudo is now known as Eighth_Doctor
2020-03-23T12:57:56  <lcp> kl_eisbaer: speaking of domains, I kinda wonder if we could also get our hands on opensu.se, it is owned by *somebody* in the community (I have no clue who), and is unused nowadays, but would be good for registry and paste for short links
2020-03-23T12:59:13  <lcp> bascially, how does openSUSE ask SUSE to get us domains that we need 😛
2020-03-23T13:16:09  <kl_eisbaer> lcp: this is indeed a very good question ... :-)
2020-03-23T13:16:46  <lcp> eh, alright, I will ask the board I guess
2020-03-23T13:17:45  <kl_eisbaer> lcp: let me ask the IT guys at SUSE, how they want to do it. At the moment, they are busy with migrating everything from MF-IT into their area - but as we saved them some work with the wiki and forum migration, I will ask them to pay us back by ordering the domains ;-)
2020-03-23T13:18:04  <kl_eisbaer> lcp: => so you ask the board and I will ask SUSE-IT - perfect :-)
2020-03-23T13:18:18  <lcp> alright, that would be great
2020-03-23T14:44:04  -heroes-bot- PROBLEM: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS CRITICAL: DB postgres total locks: 57 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks
2020-03-23T15:01:11  -heroes-bot- PROBLEM: PostgreSQL standby on mirrordb1.infra.opensuse.org - POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB mb_opensuse2 (host:mirrordb2) 87795384 and 1 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PostgreSQL%20standby
2020-03-23T15:11:11  -heroes-bot- RECOVERY: PostgreSQL standby on mirrordb1.infra.opensuse.org - POSTGRES_HOT_STANDBY_DELAY OK: DB mb_opensuse2 (host:mirrordb2) 1377368 and 0 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PostgreSQL%20standby
2020-03-23T16:24:35  -heroes-bot- PROBLEM: MySQL WSREP recv on galera1.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 1.308790 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera1.infra.opensuse.org&service=MySQL%20WSREP%20recv
2020-03-23T16:34:04  -heroes-bot- RECOVERY: PSQL locks on mirrordb1.infra.opensuse.org - POSTGRES_LOCKS OK: DB postgres total=45 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=mirrordb1.infra.opensuse.org&service=PSQL%20locks
2020-03-23T17:24:34  -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - CRITICAL - Socket timeout after 10 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki
2020-03-23T17:34:25  -heroes-bot- RECOVERY: HTTP wiki on riesling.infra.opensuse.org - HTTP OK: HTTP/1.1 301 Moved Permanently - 401 bytes in 0.070 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki
2020-03-23T17:50:27  -heroes-bot- PROBLEM: MySQL WSREP recv on galera3.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 1.703761 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera3.infra.opensuse.org&service=MySQL%20WSREP%20recv
2020-03-23T17:55:06  <klein> another gitlab upgrade....
2020-03-23T18:08:34  -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - CRITICAL - Socket timeout after 10 seconds ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki
2020-03-23T18:27:02  <okurz> the redmine instance on progress.o.o produces some internal errors, 500, seems to not be specific for the ticket update I want to write
2020-03-23T18:28:08  <cboltz> the wiki also has problems - wild guess: database issues?
2020-03-23T18:29:10  <okurz> I guess. "ActiveRecord::StatementInvalid (Mysql2::Error: Field 'id' doesn't have a default value: INSERT INTO `journals` (`created_on`, `journalized_id`, `journalized_type`, `notes`, `private_notes`, `user_id`) VALUES ('2020-03-23 18:28:36', 63772, 'Issue', '/etc/openqa/workers.ini on seattle10.arch defines QEMURAM which overrides any values set by openQA or by manually specified test variables', 0, 17668)):"
2020-03-23T18:29:16  <okurz> what can we do to check?
2020-03-23T18:30:08  <okurz> just accessing https://progress.opensuse.org/diary_entries fails for me
2020-03-23T18:30:27  -heroes-bot- PROBLEM: MySQL WSREP recv on galera2.infra.opensuse.org - CRIT wsrep_local_recv_queue_avg = 382.573212 ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=galera2.infra.opensuse.org&service=MySQL%20WSREP%20recv
2020-03-23T18:33:48  <okurz> "PROBLEM: MySQL WSREP", I don't know what that means but that does not sound good
2020-03-23T18:35:08  <cboltz> indeed, and that's probably what causes the problems we see
2020-03-23T18:35:20  <cboltz> kl_eisbaer: are you around and can check what's going on?
2020-03-23T18:38:06  <okurz> I can login to proxy (anna) but would not know how to continue from there nor have root
2020-03-23T18:38:29  <cboltz> the problem is probably in the galera cluster
2020-03-23T18:38:53  <cboltz> on galera1, messages like these started some minutes ago:
2020-03-23T18:38:57  <cboltz> Mar 22 18:24:59 galera1 systemd[1]: mariadb.service: Got notification message from PID 24497, but reception only permitted for main PID 1437
2020-03-23T18:39:50  <okurz> well, I can also login there, e.g. galera3 but again, don't know about root
2020-03-23T18:40:07  <okurz> so I also can't read logfiles
2020-03-23T18:40:37  <cboltz> I can use sudo at least on galera1 (looks like galera2 and galera3 need a highstate to fix sudo)
2020-03-23T18:41:17  <cboltz> the problem with galera is that is's a bit "sensitive", so I'm not sure if I should dare to restart mysql on one of the servers, or if this causes more problems than it fixes...
2020-03-23T18:42:17  <cboltz> (using the salt cmd.run backdoor, I see a few (but less) messages like I mentioned for galera1 on galera2. galera3 doesn't show this message.)
2020-03-23T18:44:20  <cboltz> looks like kl_eisbaer is logged in on all galera machines, so maybe we should give him some time ;-)
2020-03-23T18:59:24  <okurz> I don't see him logged in anymore and not fixed though :(
2020-03-23T19:00:13  <cboltz> AFAIK he's the .202 IP logged in as root
2020-03-23T19:00:25  <okurz> ah, root :)
2020-03-23T19:05:20  <kl_eisbaer> cboltz: I did some changes on riesling ...
2020-03-23T19:05:42  <kl_eisbaer> cboltz: profile memcached /usr/sbin/memcached flags=(complain) { ... }
2020-03-23T19:05:51  <kl_eisbaer> ^^ removed complain
2020-03-23T19:06:02  <kl_eisbaer> and also enhanced the used memory to 4096
2020-03-23T19:06:15  <cboltz> good to know - and as usual: that's salted ;-)
2020-03-23T19:06:29  <cboltz> more important question: any idea what's wrong with galera?
2020-03-23T19:06:34  <kl_eisbaer> cboltz: sad to hear that you run such a service in complain mode...
2020-03-23T19:06:38  <okurz> kl_eisbaer: we were discussing potential database problems as the wiki "has problems" and progress.o.o seems borked
2020-03-23T19:06:45  <kl_eisbaer> galera is currently migrating the tables ...
2020-03-23T19:07:13  <kl_eisbaer> I do it one table after the other - but it slows down the database, as expected :-/
2020-03-23T19:07:24  <okurz> can that explain that e.g. https://progress.opensuse.org/diary_entries shows "internal error" and updating tickets does not work? Is the database read-only?
2020-03-23T19:07:29  <kl_eisbaer> I can stop the migration, but the script is nearly done
2020-03-23T19:07:41  <kl_eisbaer> okurz: it's not read only
2020-03-23T19:07:45  <okurz> hm
2020-03-23T19:07:50  <kl_eisbaer> okurz: the "problem" is that I'm doing it on galera2
2020-03-23T19:07:52  <okurz> what's the ETA on migration done?
2020-03-23T19:07:58  <kl_eisbaer> which is also the machine used by haproxy for writes
2020-03-23T19:08:24  <kl_eisbaer> ...and as long as the mysqld process is working in general (haproxy tests this), it is still used.
2020-03-23T19:08:28  <okurz> and I think it's not critical if the problem is gone until Asian day starts :)
2020-03-23T19:08:34  <kl_eisbaer> I can switch haproxy manually to use another node
2020-03-23T19:08:50  <kl_eisbaer> I expect it to be finished in the next 2 hours
2020-03-23T19:09:28  <kl_eisbaer> cboltz: just if you wonder why I enabled apparmor for memcached: https://github.com/memcached/memcached/issues/629
2020-03-23T19:09:39  <cboltz> I'd prefer to have working wikis and progress ;-) so yes, please switch haproxy
2020-03-23T19:09:47  <cboltz> yes, I've seen that (via blog.fefe.de)
2020-03-23T19:09:58  <kl_eisbaer> ok
2020-03-23T19:10:10  <cboltz> (aka nightmare delivery service ;-)
2020-03-23T19:10:12  <kl_eisbaer> omw to anna
2020-03-23T19:11:13  <kl_eisbaer> switched
2020-03-23T19:11:45  <kl_eisbaer> hm: might be not a good time to restart memcached and switching the database node
2020-03-23T19:12:25  <okurz> should I expect progress.o.o to work again?
2020-03-23T19:12:35  <cboltz> indeed, better keep the cache ;-)
2020-03-23T19:13:50  <kl_eisbaer> ok: I kill the script for now
2020-03-23T19:15:39  <cboltz> FYI: https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/360 should cover your memcached changes
2020-03-23T19:16:04  <kl_eisbaer> hm: script stopped, haproxy migrated to galera3, apache on riesling restarted
2020-03-23T19:16:20  <kl_eisbaer> but I still see 403
2020-03-23T19:16:32  <kl_eisbaer> btw: the apache has no webmaster Email address configured
2020-03-23T19:16:42  <kl_eisbaer> ah: back now
2020-03-23T19:16:49  <cboltz> yes :-)
2020-03-23T19:18:54  <kl_eisbaer> sorry: this was worse than expected
2020-03-23T19:19:00  <okurz> still no progress on progress though ;)
2020-03-23T19:19:19  <kl_eisbaer> progress is ruby sh*t
2020-03-23T19:19:25  <kl_eisbaer> this means the app needs a kick
2020-03-23T19:19:28  <kl_eisbaer> as usual
2020-03-23T19:19:40  <okurz> I can try to restart redmine
2020-03-23T19:19:54  <kl_eisbaer> done already
2020-03-23T19:20:28  <okurz> can't say that this helped
2020-03-23T19:20:40  <kl_eisbaer> need to wait for haproxy to catch up - for me it's up
2020-03-23T19:20:57  <okurz> yes, it's up but still showing the same symptoms
2020-03-23T19:21:07  <kl_eisbaer> okurz: what exactly does not work?
2020-03-23T19:21:27  <okurz> do `tail -f /srv/www/vhosts/redmine/log/production.log` and try to access e.g. https://progress.opensuse.org/diary_entries when logged in
2020-03-23T19:21:35  <okurz> something about "ActiveRecord::StatementInvalid (Mysql2::Error: Field 'id' doesn't have a default value: INSERT INTO `journals` (`created_on`, `journalized_id`, `journalized_type`, `notes`, `private_notes`, `user_id`) VALUES ('2020-03-23 18:28:36', 63772, 'Issue', '/etc/openqa/workers.ini on seattle10.arch defines QEMURAM which overrides any values set by openQA or by manually specified test variables', 0, 17668)):"
2020-03-23T19:22:37  <cboltz> I also get an internal error for https://progress.opensuse.org/projects/opensuse-admin/issues
2020-03-23T19:22:46  <okurz> good, reproducible :)
2020-03-23T19:22:47  <cboltz> (while some other pages work, so it depends what you test)
2020-03-23T19:22:56  <okurz> yes, accessing some pages is fine
2020-03-23T19:23:52  <kl_eisbaer> omw
2020-03-23T19:31:35  <okurz> I hope that means you try to fix it? ;)
2020-03-23T19:38:12  <kl_eisbaer> at least this is what I hope ...
2020-03-23T19:38:41  <okurz> and I will be grateful :)
2020-03-23T19:54:54  <kl_eisbaer> looks like I need to reset the DB: this means that the last 3 hours will get lost
2020-03-23T19:55:14  <kl_eisbaer> but this allows to go on instead of analyzing for another 2 hours or so
2020-03-23T20:03:04  -heroes-bot- PROBLEM: HAProxy on elsa.infra.opensuse.org - HAPROXY CRITICAL - Active service galera3 is DOWN on galera proxy ! Active service redmine is DOWN on redmine proxy ! ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=elsa.infra.opensuse.org&service=HAProxy
2020-03-23T20:03:07  -heroes-bot- PROBLEM: HAProxy on anna.infra.opensuse.org - HAPROXY CRITICAL - Active service galera3 is DOWN on galera proxy ! Active service redmine is DOWN on redmine proxy ! ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=anna.infra.opensuse.org&service=HAProxy
2020-03-23T20:03:11  -heroes-bot- PROBLEM: HTTP progress on redmine.infra.opensuse.org - HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 311 bytes in 0.001 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=redmine.infra.opensuse.org&service=HTTP%20progress
2020-03-23T20:03:28  <okurz> I guess that's ok
2020-03-23T20:07:26  * kl_eisbaer questions himself when we can finally switch over to the upgraded redmine instance and leave the SLE11 one where it belongs to ...
2020-03-23T20:09:59  <cboltz> Estu plans to work on it in the next days - last time I tested, creating tickets by mail (admin@) didn't work yet
2020-03-23T20:10:23  <kl_eisbaer> that would be very, very welcome.
2020-03-23T20:10:33  <kl_eisbaer> Meanwhile this system is the oldest one in our Fuhrpark
2020-03-23T20:17:28  <okurz> btw, the most recent updates of tickets from progress.o.o I have received email notifications for are from 2020-03-23 1508 UTC, so no significant data lost if we go three hours back … or even more :)
2020-03-23T20:44:46  -heroes-bot- PROBLEM: HTTP etherpad on etherpad.infra.opensuse.org - connect to address 192.168.47.56 and port 9001: Connection refused ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=etherpad.infra.opensuse.org&service=HTTP%20etherpad
2020-03-23T22:10:43  <cboltz> hmm, interesting maintenance mode message:
2020-03-23T22:11:17  <cboltz>     Sorry, api.opensuse.org is in maintenance mode at the moment. If the situation persists, check the opensuse-buildservice@opensuse.org mailinglist or https://status.opensuse.org/ .
2020-03-23T22:11:17  <cboltz> especially because it's shown on etherpad.o.o ;-)
2020-03-23T22:40:24  -heroes-bot- PROBLEM: HTTP wiki on riesling.infra.opensuse.org - HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1381 bytes in 0.029 second response time ; See https://monitor.opensuse.org/icinga/cgi-bin/extinfo.cgi?type=2&host=riesling.infra.opensuse.org&service=HTTP%20wiki
2020-03-23T22:45:47  <cboltz> kl_eisbaer: just wondering - when do you expect the databases to be back?
2020-03-23T22:46:38  <kl_eisbaer> cboltz: galera1 went of ouf service with filesystem corruption :-(
2020-03-23T22:47:52  <cboltz> :-(
2020-03-23T22:48:47  <kl_eisbaer> an funnily, the wiki's currently bombard the remaining DBs with requests: set max_connections is set to 200 - but the slots are filled up like crazy
2020-03-23T22:49:47  <cboltz> maybe queued requests
2020-03-23T22:49:56  <cboltz> I just restarted apache, let's see if it helps
2020-03-23T22:50:28  <cboltz> ps Zaux   looks more normal again
2020-03-23T22:51:14  <cboltz> but en.o.o says "cannot access the database"
2020-03-23T22:54:45  <kl_eisbaer> cboltz: the cluster is currently not in a good shape
2020-03-23T22:54:52  <kl_eisbaer> and I'm too tired to debug
2020-03-23T22:55:18  <cboltz> looking at the time, that's completely understandable
2020-03-23T22:56:31  <cboltz> maybe get some sleep and continue tomorrow morning?
2020-03-23T22:56:53  <kl_eisbaer> jip, might be better
2020-03-23T22:57:31  <kl_eisbaer> flush hosts; -> hangs :-/
2020-03-23T22:58:20  <cboltz> I'll put a notice on status.o.o that we have database issues
2020-03-23T22:58:26  <kl_eisbaer> jip, thanks
2020-03-23T22:58:45  <cboltz> any wishes/proposals for the text, or should I just write something?
2020-03-23T22:59:01  <kl_eisbaer> feel free to write whatever you like.
2020-03-23T22:59:05  <kl_eisbaer> My brain is dead atm
2020-03-23T22:59:17  <cboltz> ok
2020-03-23T22:59:41  <cboltz> then - good night, sleep well
2020-03-23T22:59:50  <cboltz> and good luck with fixing everything tomorrow ;-)
2020-03-23T22:59:50  <kl_eisbaer>  Aborted_connects                              | 14068
2020-03-23T22:59:55  <kl_eisbaer> JFYI ...
2020-03-23T23:00:02  <kl_eisbaer>  Connections                                   | 431853 |