2020-07-20T02:14:23 *** okurz_ is now known as okurz 2020-07-20T08:57:37 JFYI: there is a new (private) DNS domain now: vpn.opensuse.org (with subdomains {udp,tcp}.vpn.opensuse.org - incl. reverse DNS. I hope that makes it easier in the future, to detect who is currently logged in. 2020-07-20T11:55:09 cboltz, is the pdns-recursor service ok, things are starting to slow down, not as bad as last time, but noticeable.... 2020-07-20T12:36:23 looks like it's dnsmasq on both servers now ;-) - but both are up 2020-07-20T12:40:45 sorry: the two dnsmasq services are my "fault". I saw pdns-recursor becoming unresponsive again - and decided to switch over... 2020-07-20T12:41:41 no problem, but I'd hope for a better response time ;-) 2020-07-20T12:42:18 possibly related question: resolv.conf on anna and elsa also have some external nameservers listed 2020-07-20T12:42:36 I'm not sure if this is a good idea because those servers can't resolv *.infra.o.o 2020-07-20T12:44:27 * malcolmlewis forums seem to be getting slower :-( 2020-07-20T12:56:54 cboltz: see "options timeout:1 attempts:1" - there is no rotate in that line. This means that the first "nameserver" line has preference. And only if this nameserver does not respond (after 1 attempt or if there is a timeout), the next nameserver is taken. Which means: the external DNS servers are there only as very, very last resort. 2020-07-20T12:58:05 but dnsmasq uses these DNS servers as well for queries not in the cache - or not defined via "servers=/$ZONE/$ip" in the config 2020-07-20T13:03:24 is someone still playing along with the DNS servers on anna/elsa ? 2020-07-20T13:03:44 I get connection timeouts or delayed answers from time to time :-/ 2020-07-20T13:20:10 * malcolmlewis has no infrastructure access... but see a couple of open DNS tickets have been opened... 2020-07-20T13:22:11 thank you to whomever fixed the DNS issue... :) 2020-07-20T13:25:50 this is not really fixed - at least this is how it looks to me. Might be, that we overload the two internal DNS servers from time to time? - Needs further investigation. 2020-07-20T13:26:06 kl_eisbaer: I only looked (without changing or restarting anything) - and I still sometimes get timeouts 2020-07-20T13:26:16 so indeed not fixed yet 2020-07-20T13:32:50 party: I increased the dns-forward-max and cache-size settings in dnsmasq, and tried to restart (not reload) the daemon afterward. Maybe I was not brave enough to wait for the last process to exit, but I decided to hard-kill the last dnsmasq processes after some seconds. 2020-07-20T13:33:13 But right now, with fresh restarted dnsmasq processes on anna/elsa, I still see timeouts. 2020-07-20T13:33:28 But I experience even timeouts to download.infra.opensuse.org 2020-07-20T13:33:38 which is not DNS related (IP is cached) 2020-07-20T13:41:27 bah humbug, gone all slow again 2020-07-20T13:47:43 malcolmlewis FYI: I deployed an own, caching DNS on the forum machine now. This should hopefully fix the problem with the forums. 2020-07-20T13:48:23 But I still wonder why a) there is some "overall traffic slowness" inside the intra network and b) why the forum needs to make so many DNS queries 2020-07-20T13:50:48 looks like I need to dive a bit deeper into the vbforums stuff :-/ 2020-07-20T14:20:36 kl_eisbaer, thanks :) I have no idea, hopefully we can move away fro vB sooner rather than later... 2020-07-20T14:22:50 kl_eisbaer, not sure that worked :( it all seems related to the png files 2020-07-20T14:23:59 residing down in the images directory 2020-07-20T14:31:57 looks like matrix.infra.opensuse.org is running DoS against the internal DNS servers :-/ 2020-07-20T14:33:37 looks like the synapse server is constantly re-starting his federation - and somehow trying to connect to all(?) matrix servers ... 2020-07-20T14:34:04 lcp: is synapse/matrix already in production? 2020-07-20T14:38:06 lcp: FYI: I installed pdns-recursor now on matrix.infra.opensuse.org - looks like this reduces the DNS load 2020-07-20T14:41:01 malcolmlewis: at least DNS queries are now down to a "normal" level. I don't know if you changed something else, but forums.opensuse.org is speeding up again 2020-07-20T14:47:12 kl_eisbaer, nope, nothing as don't have server access.... 2020-07-20T14:47:46 malcolmlewis ^^ this sounds like something we should definitively fix :-) 2020-07-20T14:48:03 malcolmlewis do you already have an heroes (aka: FreeIPA) account ? 2020-07-20T14:49:23 kl_eisbaer, nope, just admin cp access via web ui, I do notice every thing is cached (images) it wasn't before 2020-07-20T14:50:16 and if go to new thread it downloads what is missing 2020-07-20T14:50:21 all quick again 2020-07-20T14:50:30 hm: what does cached mean exactly? IMHO the forum is behind the login machines - they normally do no caching at all 2020-07-20T14:50:54 so you mean browser caching locally ? 2020-07-20T14:51:10 kl_eisbaer, in firefox network debugger, it says cached 2020-07-20T14:51:36 ah, ok. So this is your local machine / browser doing the caching 2020-07-20T14:52:17 kl_eisbaer, let me close it for a few, let the connections disappear and see what happens 2020-07-20T14:52:17 I was starting to wonder where in the chain (from the forum machine over to the login-proxy over to your ISP and your machine some proxy servers cache content for you ;-) 2020-07-20T14:52:35 I hope it should be as fast as before. 2020-07-20T14:53:24 kl_eisbaer, well do run a couple of pihole machines locally 2020-07-20T14:54:45 The reason for all the sloweness seemed to be a DNS problem: the forum is doing quite some DNS lookups - and as lcp decided to somehow DoS our internal DNS servers with his matrix machine (hehe: no worries, lcp, I do similar stuff), the forum server could - for example - not resolve the mysql server from time to time. Resulting in very slow responsiveness. 2020-07-20T14:54:54 kl_eisbaer, well it looks good, images that aren't cached download in less than a second, some where taking 20 secods 2020-07-20T14:56:03 I think that somehow either the apache or the forum software is currently trying to resolve each and every request to a DNS name. Now think about how many requests your browser alone is initiating to the forums for one single page... 2020-07-20T14:56:42 I tried to reduce the amount of queries going to our internal DNS now with caching DNS servers on the forum and matrix servers. 2020-07-20T14:57:14 kl_eisbaer, I see forums, static and beans 2020-07-20T14:57:15 but we should really have a look at both machines to identify why they need so many DNS queries at all - and if we can reduce them 2020-07-20T14:57:34 malcolmlewis: yes, and one additional connect for each image or css ... 2020-07-20T14:58:14 kl_eisbaer, ahh I thread i just looked at was 29 get requests 2020-07-20T14:59:17 jip. now multiply this with each real user looking at a forum page ... and you might imagine that a DNS request for each of these requests might slow down a machine ;-) 2020-07-20T14:59:22 looks like anywhere between 20 and 30 per thread 2020-07-20T14:59:53 But I see that there is a >1.1G debug1_log file on the forum machine. Maybe pjessen was alreaddy debugging 2020-07-20T15:01:48 JFYI: forums see ~30-50 requests/sec - sometimes 80-100 requests/sec, if everything works. 2020-07-20T15:02:04 wow, a busy place ;) 2020-07-20T15:02:13 https://monitor.opensuse.org/grafana/d-solo/YyV2BduWk/base-metrics?var-hostname=forum.infra.opensuse.org&var-service=apache-status&var-command=apache-status&panelId=11&orgId=1&theme=light&from=1589567188789&to=1595257305396 2020-07-20T15:02:57 ^^ I hope you can see the graph. This is the statistic provided by the apache running on the forums machine 2020-07-20T15:03:22 kl_eisbaer, the logo is bouncing... 2020-07-20T15:05:46 hm. I have to admit that I'm not sure if you need to be logged in (with a heroes account) for that graph.... 2020-07-20T15:05:48 I see it.... 2020-07-20T15:07:33 perfect :-) There should be even more "machine statistics" here: https://monitor.opensuse.org/grafana/d/YyV2BduWk/base-metrics?orgId=1&refresh=30s&var-hostname=forum.infra.opensuse.org - but I did not find the time (yet) to get all the pictures up and running... 2020-07-20T15:08:57 kl_eisbaer, so both those graphs are able to be revisited and see the latest info? 2020-07-20T15:11:07 jip 2020-07-20T15:11:57 https://monitor.opensuse.org/grafana/d/L3NnUp8Zz/elasticsearch?search=open&orgId=1 should show you all open dashboards. 2020-07-20T15:12:21 kl_eisbaer, ok, thanks :) 2020-07-20T15:12:23 kl_eisbaer: if you don't have enough problems yet ;-) - there's a chronyc on daffy1 eating 100% CPU, and a dead chronyd 2020-07-20T15:12:51 This includes statistics about our databases (elasticsearch for the wiki search, PostgreSQL and Galera for the applications that store their data in them, like forums does in Galera) and the "base metrics" from our monitoring 2020-07-20T15:13:16 cboltz: hehe: once daffy2 is back again, I will reboot daffy1 ;-) 2020-07-20T15:13:30 kl_eisbaer: haha, I didn't expect it was the matrix machine 2020-07-20T15:14:37 lcp: sometimes it's easy to figure out: I just enabled "log-queries" on anna to see the flood of requests ... 2020-07-20T15:15:34 I can tell the machine can't stop logging because it sure enough run out of space because of that again 2020-07-20T15:15:36 but this is hopefully solved now. So you can try the next thing ;-) 2020-07-20T15:26:28 cboltz: salt '*' cmd.run 'rm /var/run/chrony-helper/lock ; systemctl try-restart chronyd.service;' 2020-07-20T15:28:17 cboltz: can I kick you out of daffy1? (reboot needed) 2020-07-20T15:28:55 I just logged out ;-) 2020-07-20T16:16:40 time for Feierabend - CU! 2020-07-20T17:27:57 cboltz: eh, it seems synapse is now hitting a wall, by failing a lot of dns lookups 2020-07-20T17:27:59 dns lookup errors* 2020-07-20T18:22:55 nice :-/ 2020-07-20T18:31:32 cboltz: I replaced pdns-recursor with unbound, let's see if the ddos repeats 2020-07-20T18:31:59 yeah, let's benchmark some DNS resolvers ;-) 2020-07-20T18:33:16 at the very least, synapse isn't freaking out anymore 2020-07-20T18:33:29 I am more curious if the rest of the infra does 2020-07-20T18:34:10 we'll see ;-) 2020-07-20T19:08:10 alright, the first dns resolution error, when trying to get metadata for centos paste 2020-07-20T19:09:38 now the question is - is this a problem in unbound, or with the upstream DNS server (I guess anna/elsa)? 2020-07-20T19:17:42 I couldn't tell you, I decided against unbound because it sucks at logging >:D 2020-07-20T19:18:00 and switched back to pdns 2020-07-20T19:18:03 ok, so what will you break next? ;-) 2020-07-20T19:25:44 idk, I was thinking the mailing lists >:D 2020-07-20T19:25:49 or should I take care of the freeipa update 2020-07-20T19:26:09 whatever you prefer ;-) 2020-07-20T19:27:02 I could also offer to move travel support from connect.o.o to its own VM (that VM already exists and is bored) 2020-07-20T19:29:38 cboltz: I'm curious what happened with the opensuse.org -> www.opensuse.org redirect cors stuff 2020-07-20T19:30:01 because looking into the console, nothing has changed :c 2020-07-20T19:30:37 good question - if in doubt, your console is right 2020-07-20T19:32:39 oh yeah, that's a thing 2020-07-20T19:33:12 I was gonna remake the tsp in bootstrap 4 first though 2020-07-20T19:33:40 which made me bother with obs and osem theming stuff 2020-07-20T19:33:46 that was a disaster >;D 2020-07-20T19:34:10 I'm happy about everything that gets moved off the old SLE 11 that hosts connect, so maybe move tsp first and then style it? ;-) 2020-07-20T19:34:48 tsp is on the same vm as connect? 2020-07-20T19:34:54 yes 2020-07-20T19:35:05 so I have access there >:D 2020-07-20T19:35:21 there's also a bored tsp.infra.o.o 2020-07-20T19:35:59 was there ever anything done to it? 2020-07-20T19:36:19 I have to look at how tsp is set up on the connect machine first 2020-07-20T19:36:26 we have a web_tsp role in salt that does some base setup (nginx etc.) 2020-07-20T19:37:43 it's mysql database setup, is that in haproxy already? 2020-07-20T19:37:58 AFAIK it's using sqlite currently 2020-07-20T19:38:29 well, I don't see that in the database.yml 2020-07-20T19:40:55 no, it has a `boosters_tsp` mysql db on 192.168.47.4 according to the server 2020-07-20T19:41:16 ok, then I remembered this wrong, sorry 2020-07-20T19:42:16 I should probably see if that db is real tbh 2020-07-20T19:42:28 because trusting configs might lead me nowhere fast >:D 2020-07-20T19:42:36 hmm, database.yml looks more like sqlite 2020-07-20T19:43:09 well, what matters is the production environment 2020-07-20T19:43:16 /home/ancor/travel-support-program/db/production.sqlite3 - last changed today 2020-07-20T19:43:28 oh 2020-07-20T19:43:38 my problem is looking at a wrong place 2020-07-20T19:44:01 I vote for /home/ancor/travel-support-program/config/database.yml ;-) 2020-07-20T19:44:39 you are probably right, I was looking at /srv/www 2020-07-20T19:44:44 which would be a more obvious place to keep that stuff 2020-07-20T19:45:15 I tend to start looking at the webserver config ;-) 2020-07-20T19:45:33 that's probably a good idea to do 2020-07-20T19:45:37 (especially on "historically grown" servers) 2020-07-20T19:45:57 yeah, I have too much trust in order 2020-07-20T19:46:24 and to make things more funny - on boosters, you'll first have to check which webserver is running 2020-07-20T19:46:41 there's config for (at least) apache and lighttpd in /etc ;-) 2020-07-20T19:47:53 we could migrate that db to haproxy, assuming tsp supports mysql well enough (and assuming there is something like pgloader for mysql) 2020-07-20T19:48:34 no idea - the only thing I know about tsp is that it's ruby - a language I don't know ;-) 2020-07-20T19:49:04 I don't like sqlite databases just being around the systems 2020-07-20T19:49:39 (I had to have like 3 on matrix server and it's just a bad experience) 2020-07-20T19:49:48 right - that sqlite database is the only file we'd need to backup (while everything else is on github or salt) 2020-07-20T19:49:54 at least I moved dimension into postgres since that's supported now 2020-07-20T19:50:04 you should, it's a pretty nice language 2020-07-20T19:50:11 on the positive side, tsp is low-traffic, so sqlite shouldn't have load problems ;-) 2020-07-20T19:50:42 well, I looked at ruby once (in a "ruby for beginners" session at an oSC some years ago) 2020-07-20T19:51:02 it's not a bad language, but it feels "too close to python", and that would confuse me ;-) 2020-07-20T19:51:03 yeah, it's not an issue of load thankfully 2020-07-20T19:51:15 I have had too much of that today 2020-07-20T19:52:02 too much of python? I can't imagine that this is possible :-P 2020-07-20T19:52:18 too much of load issues 2020-07-20T19:52:29 ah ;-) 2020-07-20T19:52:50 running git status in /home/ancor/travel-support-program shows quite a bit more than the db 2020-07-20T19:53:22 and this has more commits than upstream :o 2020-07-20T19:54:03 sounds interesting[tm]... 2020-07-20T19:54:24 oh yeah, ancorgs had a field day with stuff in there 2020-07-20T19:54:40 * lcp sent a long message: < https://matrix.org/_matrix/media/r0/download/matrix.org/NALkyjiDgHXFahPzvxWagVpL > 2020-07-20T19:54:59 it's pretty much patches like that in a lot of places 2020-07-20T19:55:03 lol 2020-07-20T19:55:54 and ancorgs personally gets emails about errors in tsp according to this config 2020-07-20T19:56:25 yeah, I've just noticed this in the git diff output 2020-07-20T19:57:17 quite a while ago, he promised to help with the move - but hasn't been too active 2020-07-20T19:57:34 maybe we need to ask him if his promise is still valid 2020-07-20T19:57:49 well, he is quite busy with yast, so his actual job ;) 2020-07-20T19:58:20 I know, but nevertheless - I have a feeling that he's the only one who really understands these changes 2020-07-20T19:58:59 the alternative is to simply deploy the latest version from github (ignoring the git diff etc.) and to wait what explodes - but I won't call this a good idea ;-) 2020-07-20T19:59:42 a little more extensive than the current one at least ;) 2020-07-20T20:00:36 at the very least, we could prep a baseline salt config for tsp 2020-07-20T20:00:44 nice, the messages sent our of order 2020-07-20T20:00:52 just reverse them in your head >:D 2020-07-20T20:01:52 well, there's already some base config (roles/web_tsp) - but it's mostly the nginx config + some packages 2020-07-20T20:02:11 (in other words: the things I was able to do) 2020-07-20T21:45:08 cboltz: let's start with that https://gitlab.infra.opensuse.org/infra/salt/-/merge_requests/424 2020-07-20T21:45:29 I will try to get it working with puma instead of the current rails server 2020-07-20T21:50:00 looking at "connect_api_key" in the secrets - wouldn't it make sense to make tsp standalone (without any connection to connect)? 2020-07-20T21:50:28 it doesn't have any connection, but I am keeping that api key for the sake of safekeeping 2020-07-20T21:50:45 ah, ok 2020-07-20T21:51:04 just noticed that when scrolling down so site.yml ;-) 2020-07-20T21:53:51 tsp.service: I tend to use a separate user (not wwwrun) whenever possible - but since tsp will be alone on that VM, it isn't too critical 2020-07-20T21:55:42 but looking at profile/tsp/init.sls, you also do the git.latest and bundler install as wwwrun - and that's where I'm too paranoid ;-) 2020-07-20T21:56:00 unless the running service needs to modify these files, they should be owned by a different user 2020-07-20T21:57:52 yeah, I thought so, it doesn't make much sense 2020-07-20T22:03:27 actually it probably doesn't 2020-07-20T22:03:39 it could be root owned 2020-07-20T22:03:57 that would mean you have to run git and bundler install as root 2020-07-20T22:04:04 better create a tsp user 2020-07-20T22:04:15 good point 2020-07-20T22:07:28 done 2020-07-20T22:12:53 looks good :-) 2020-07-20T22:13:41 do you want to add a sudo rule (for the to-be-created "tsp-admins" group), or should I merge as is? 2020-07-20T22:14:36 I will add it tbh 2020-07-20T22:15:53 added 2020-07-20T22:16:22 thanks 2020-07-20T22:16:25 set to automerge 2020-07-20T22:49:59 lcp: you should now be able to use sudo on tsp.i.o.o 2020-07-20T22:50:05 I also did half of the highstate 2020-07-20T22:50:21 and then opened port 443 in the firewall so that the next highstate run can actually access github ;-) 2020-07-20T22:50:56 cool, thanks 2020-07-20T22:51:18 I moved over the db to tsp machine, so I have something to work with 2020-07-20T22:51:50 "moved" as in "copied" in actual terms 2020-07-20T22:53:01 I will add tsp to postgres and pgbouncer pg_hba files so I can test that part 2020-07-20T22:53:18 aaaand I won't be able to test much more because I will need login proxy 2020-07-20T22:54:57 I'm too tired to do the login proxy stuff, please remind me tomorrow ;-) 2020-07-20T22:58:46 you got it 2020-07-20T23:00:00 I'm never too tired to understand such hints ;-)