close
Jump to content

Gerrit/Operations

From Wikitech

Restarting

Restarting Gerrit is a last resort. We used to have to restart it often due to misunderstanding of some of its behavior as well as nasty memory leak. As of February 2021, restart should not be conducted without a thorough review of the current behavior and taking traces. They will be of dramatic help to identify a potential bug or a configuration tuning.

If after all investigations you get clueless or really have no other options, you can restart Gerrit through a cookbook run: sudo cookbook sre.gerrit.restart-gerrit --host gerrit2002.

The service will take a few seconds before it comes back during which any end user operations would error out (some Puppet catalogues, CI, developers).

Monitoring

JavaMelody monitors the state of the Gerrit JVM. They are collected by Prometheus from https://gerrit.wikimedia.org/r/monitoring?prometheus

Important Graphs

Gerrit metrics

On top of the JavaMelody data, Gerrit has internal metrics.

For users having the | viewCaches or View Metrics capabilities, various internal Gerrit metrics can be retrieved via:

Which obviously requires authentication. That complements gerrit show-caches.

We use the metrics-reporter-prometheus plugin which exposes collected by Prometheus from https://gerrit.wikimedia.org/r/plugins/metrics-reporter-prometheus/metrics . Those Gerrit metrics can also be seen on the JavaMelody MBeans page under the metrics branch.

See Gerrit Grafana dashboards folder.

Use gerrit ssh commands

To use Gerrit ssh commands (see command list documentation), you will need to be a member of the ldap group gerritadmin. To do so, (more info on the SRE/LDAP documentation), you will need to run ldapvi -b ou=groups cn=gerritadmin and then wait for the synchronization to happen. Then, you'll be able to run commands such as Gerrit/Operations#Killing ssh connections or ssh user@gerrit.wikimedia.org -p 29418 gerrit show-caches etc.

Please avoid running replication command without asking RelEng or Collab first (see https://wikitech.wikimedia.org/wiki/Gerrit/Administration#Forcing_Replication_re-runs)

Logs

They are consumed by our logging infrastructure and available in the Kibana dashboard for Gerrit (application logs) and Apache access logs.

Main logs

Logs are available on the gerrit servers at: /var/log/gerrit/. There are a number of logfiles:

  • gerrit.log: This is the main log file and will show stacktraces and errors
  • gerrit.json: Like gerrit.log bug not really human readable. For sending structured logs to logstash.
  • sshd_log: Log of sshd events
  • gc_log: Logs for git gc not the JVM garbage collection (those logs are available in /srv/gerrit/jvmlogs)
  • plugin_log: Info about plugins being loaded and reloaded, this information is also in gerrit.log

HTTP Logs

Gerrit sits behind Apache, access and error logs are both in /var/log/apache2:

  • gerrit.wikimedia.org.https.access.log
  • gerrit.wikimedia.org.https.error.log

find its logs by searching with type:log4j.

JVM

Thread Dump

A thread dump is often useful in troubleshooting. To capture a thread dump use jstack. This code should be safe to run at any time, and is run frequently while Gerrit is running:

sudo -u gerrit2 jstack -l $(pgrep java) > "/srv/gerrit/jstack-$(date +%Y-%m-%d-%H%M%S).dump"

It's often useful to upload the resulting file to https://fastthread.io/ to detect problems.

Java trace

This command isn't run very often, unsure how safe it is to run; kept here for folks who are familiar with jstat

Display a summary of garbage collection statistics every 1000 ms:

sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jstat -gcutil "$(pgrep -u gerrit2 java)" 1000

Java heap usage

Requires openjdk-X-dbg for the debugging symbols

  sudo /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap -heap "$( pgrep -u gerrit2 java)"

Access h2 account_patch_reviews

On copies of account_patch_reviews* files:

java -cp h2-1.3.176.jar org.h2.tools.Shell -url jdbc:h2:/home/hashar/account_patch_reviews

Which gives you a sql prompt:

sql> show columns from ACCOUNT_PATCH_REVIEWS
...> ;
FIELD        | TYPE         | NULL | KEY | DEFAULT
ACCOUNT_ID   | INTEGER(10)  | NO   | PRI | 0
CHANGE_ID    | INTEGER(10)  | NO   | PRI | 0
PATCH_SET_ID | INTEGER(10)  | NO   | PRI | 0
FILE_NAME    | VARCHAR(255) | NO   | PRI | ''
(4 rows, 16 ms)

Blocking misbehaving bots / IPs

If necessary either IP addresses or user agents that are misbehaving can be blocked by making edits to modules/profile/templates/gerrit/apache.erb in the operations/puppet public git repository and merging them.

example change

Throttling IPs

Since September 2024, implemented in phab:T365259 there is another method of throttling abusive traffice using nftables. See Firewall#Throttling_with_nftables and the profile::firewall::nftables_throttling keys in Hiera.

You can also observe data related to this on the grafana dashboard for gerrit.

Killing ssh connections

It can happen that a user reaches the limit of 8 concurrent ssh connections and then says they can't push to Gerrit anymore over ssh.

A member of Gerrit admins can run commands like these to kill connections for them:


ssh user@gerrit.wikimedia.org -p 29418 gerrit show-connections
ssh user@gerrit.wikimedia.org -p 29418 gerrit close-connection <connection ID>

Failover (in case primary host is unavailable)

  1. Disable puppet on the elected replacement
  2. Enable read only with the cookbook
    sudo cookbook sre.gerrit.read-only-toggle --host gerrit1003 --toggle on
    
  3. Run the local_backup cookbook (Gerrit/Operations#Warming_up_local_backup)
    sudo cookbook sre.gerrit.localbackup --source gerrit1003
    
  4. Switch role locally on the gerrit systemd unit with
    systemctl edit gerrit.service --full
    
    1. Remove --replica from ExecStart=
    2. Remove --enable-httpd from ExecStart=
    3. Save and quit
    4. Run systemctl daemon-reload
  5. Switch the CNAME discovery record in our DNS repo in templates/wmnet#1044 as described in this task to designate the newly promoted replacement. For this you'll need to follow DNS#Emergency Measures
  6. Restart gerrit with systemctl restart gerrit && journalctl -fln50 -u gerrit
  7. Confirm that you can now access the new primary and clone projects both via SSH and HTTPS
    git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/extensions.git
    git clone --depth=1 "ssh://gerrit.wikimedia.org:29418/mediawiki/extensions"
    
  8. Disable read only mode:
    sudo cookbook sre.gerrit.read-only-toggle --host gerrit1003 --toggle off
    
  9. Backport the changes to puppet
  10. Resume replication

Switch replicas over

In T406334 we switched replicas, editing DNS and Puppet as follows:


Switchover (planned maintenance)

If you're migrating on gerrit2003, please see https://phabricator.wikimedia.org/T338470#10506291

Try to ensure the user running Gerrit will be also owning the synced data.

Schedule and Announce Downtime

Announce the scheduled downtime for Gerrit services.

Please also take note of that specific section of Gerrit's administration page: https://wikitech.wikimedia.org/wiki/Gerrit/Administration#Forcing_Replication_re-runs

Prepare patches

Those hiera keys have to be updated:

profile::gerrit::active_host
profile::gerrit::replica_hosts
profile::gerrit::lfs_sync_dest

(see example patchset)

You can submit those patches but do not merge them.

You also have to update the DNS configuration (see example patchset).

Automated switchover

Caution: Be aware that warming up the transfers will erase all data on the target instance.

Warming up transfer

sudo cookbook sre.gerrit.sync-instances --source gerrit1003 --replica gerrit2003 --chown --distrust

The args --distrust and --chown are given to bypass Gerrit internal replication, this will change in the future. Those arguments imply we run the rsync commands listed below in the manual switchover procedure via the cookbook and then ensure ownership of the transferred files.

Warming up local backup

sudo cookbook sre.gerrit.localbackup --source gerrit1003

Here, we run a local backup to ensure having a local (on the source instance) snapshot of our data before doing anything.

Switching Over

sudo cookbook sre.gerrit.switchover --switch-from-host gerrit1003 --switch-to-host gerrit2003 --distrust --chown

The --distrust and --chown args are also required here to be passed on to the other cookbooks that will be used to perform the switchover. The cookbook will guide you through the next steps, asking you to merge the Puppet patch and then the DNS patch.

Manual switchover

Prepare emergency rollback commands for DNS

Preparing the commands listed in DNS#Update DNS if Gerrit is down could help revert if needed.

Stop Puppet across all gerrit instances

sudo disable-puppet 'gerrit maintenance'

Merge the puppet changes to prepare the switch over to dst-gerrit

Merge the Puppet changes to prepare for the desired state.

Begin Scheduled Downtime

Announce the start of the scheduled downtime on IRC #wikimedia-operations and on Slack #engineering-all.

Downtime Management

sudo cookbook sre.hosts.downtime -r 'maintenance' -D 30 src-gerrit.wikimedia.org && sudo cookbook sre.hosts.downtime -r 'maintenance' -H 1 dst-gerrit.wikimedia.org

Manually schedule downtime for checks connected to the virtual server "gerrit.wikimedia.org" on icinga.wikimedia.org.

Update DNS

Run sudo -i authdns-update on ns0.wikimedia.org, review the diff but do not commit yet.

Stop Gerrit on dst-gerrit

Execute the following commands on dst-gerrit:

sudo systemctl stop gerrit

Make source and destination read-only

Execute the following commands on both instances:

touch /etc/gerrit/gerrit.readonly

That will disable write operation and offer some safety from data corruption

Data Synchronization from src-gerrit

rsync -avpPz --delete /var/lib/gerrit2/review_site/ rsync://dst-gerrit.wikimedia.org/gerrit-var-lib/
rsync -avpPz --delete /srv/gerrit/ rsync://dst-gerrit.wikimedia.org/gerrit-data/ --exclude=*.hprof

Stop Gerrit on src-gerrit

Execute the following commands on src-gerrit:

sudo systemctl stop gerrit

Repeat Data Synchronization on src-gerrit

Repeat the rsync commands as in this step.

This is to ensure consistency across nodes.

Start Gerrit on dst-gerrit

Execute the following command on dst-gerrit:

sudo systemctl start gerrit

Finalize DNS Update

Confirm the DNS change and merge it.

Finalize Puppet Update

You can also speed-up the subsequent deployment by running sudo run-puppet-agent on dst-gerrit.

Testing

Announce Downtime Conclusion

Announce that the downtime is over.

Post-Migration Tasks

  1. Check if replication is running: ssh gerrit.wikimedia.org -p 29418 gerrit show-queue --by-queue --wide
  2. Validate that https://gerrit-replica.wikimedia.org/ returns a HTTP/404 error on /
  3. Determine the grace period duration.
  4. Ensure src-gerrit has Puppet disabled and/or services are masked, if it needs to be decommissioned.
  5. If needed, decommission the old host as per T336427.

Runbooks

GerritHAProxyBackendUnavailable

Gerrit service is not available to the tcp-proxy hosts. Most likely the active Gerrit server is down and the tcp-proxy hosts can not reach the active Gerrit server anymore. Maintenance on the active Gerrit server could be the issue and a downtime is missing for the tcp-proxy* hosts.

If there is no planned maintenance the Gerrit active server should be troubleshooted (see steps above), make sure the gerrit and apache2 and envoy services are running. Also network connectivity from the tcp-proxy hosts should be possible.

GerritHAProxyServiceUnavailable

Gerrit tcp-proxy service is not available in at least one PoP and there is less than one available tcp-proxy per DC. Most likely the active Gerrit server is down and the tcp-proxy hosts can not reach the active Gerrit server anymore so they are marked as DOWN. Also a PoP might be depooled but not properly downtimed.

If there is no planned maintenance the network connectivity from the tcp-proxy hosts to the Gerrit active server should be troubleshooted and make sure the gerrit and apache2 and envoy services are running.