Troubleshooting Certificate mismatch in Harbor in TKGi

I recently deployed harbor for a customer. This is the version of Harbor that has been pre-packaged into a 'Tile' , for use in Tanzu Kubernetes Grid Integrated edition [TKGi] (formerly known as Pivotal Container Services [PKS].

The tool that will deploy Harbor, in this case, is the Ops Manager, and it gives you a nice interface where you can set up all the essential settings for Harbor.

One of the things you can set, is the certificate to be used by Harbor.

In this case, we had the customer generate a certificate for us, for Harbor. The bottom field is meant for the certificate of the Certificate Authority, that generated the Harbor certificate.

I made a simple mistake here. Previously, the customer had generated Certificates from their root CA. However, this time, they had set up an intermediate, issuing CA. I did not know this, and had assumed the certificate chain was the same as previously. So I pasted the wrong certificate into this field. The Root-CA, instead of the issuing, intermediate CA.

When I tried to deploy Harbor using Opsman, the deployment failed. "Error: 'harbor-app/4d891315-d61e-4891-8512-486b7f93e5a2 (0)' is not running after update. Review logs for failed jobs: harbor"

How it failed is interesting, and that is what this post is mostly about.

Opsman uses the BOSH, under the covers, to create and manage the VM's, and their content, for any product it deploys.

It uses the Monit tool, to monitor the health of its VMs, and the result of Monit is also used to determine, whether a deployment was successfully completed or not.

In fact it is left to Monit to start, and stop, the various processes on a BOSH-managed VM. So Monit will contain a config for specific processes, to start, stop, and monitor their health. This can be as simple as monitoring a process ID, or can be custom scripts. In the case of Harbor, its some custom scripts that I will detail below.

In order to further troubleshoot this issue, we had to dig a bit deeper into the logs. There are 2 ways to do this; you can download a log-bundle using the Opsman UI

Or you can SSH into the VM, using the BOSH commandline tool, and view the logs live in the /var/vcap/sys/log directory.

Examining the log structure, there are some things to note.

First of all, because this is a BOSH deployment of Harbor, there are various folders that refer to BOSH-specific items.
Harbor itself, runs as a set of Docker containers. So there you will also find a split between logs coming from Docker, or in this case, the results of Docker-Compose, and the Harbor app components itself.

If you look in the Harbor folder, we find the various logs that relate to starting and stopping of Harbor, and then a further folder, that contains the Harbor app-component logs (1 per container).

Opsman told us, that the Harbor app itself was not starting. And we know its actually using Monit to start, stop and monitor Harbor. And its a set of scripts to do this.

The monit log can be found here: /var/vcap/monit/monit.log

As I said, Habor consists of a set of running Docker containers. If you wish to view this directly, we can actually use Docker on the VM.

SSH into the VM:

We need to run as root, and need to make sure the Docker client, can find the local docker daemon running.

sudo su -
alias docker='/var/vcap/packages/docker/bin/docker -H unix:///var/vcap/sys/run/docker/dockerd.sock'

Now we can simply run a 'Docker ps' and see our containers

Now the cool thing is, you can simply kill all these processes if you want, and Monit will restart them. That can be very useful when testing things.

Lets have a look at the monit configuration for Harbor:

monit -v status

Monit is using a specific script, to start and stop Harbor, /var/vcap/jobs/docker/bin/ctl
If it meets the failure condition, it will use the same script to try and restart it.

The results of the ctl script, are being saved to ctl.stdout.log
In that file, we can see that the Harbor startup, is timing out, well at least according to the script.

[Mon Feb 15 15:30:43 UTC 2021] Harbor service is not ready. Waiting for 5 seconds then check again.
[Mon Feb 15 15:30:48 UTC 2021] Harbor service is not ready. Waiting for 5 seconds then check again.
[Mon Feb 15 15:30:53 UTC 2021] Error: Harbor Service failed to start in 180 seconds.

Now the odd thing here was, that when I checked 'Docker ps', all the containers where actually running. And in fact, I could even reach the Harbor webUI without any problem.

So Harbor was actually working. Why then was the ctl script concluding that the startup had failed? What was it tripping over?

Harbor is actually coming up normally every time. Its monit that is confused.

Monit is set to check the existence of the file ‘/var/vcap/sys/run/harbor/harbor.pid’

However, for whatever reason, when I checked this file did not exist, so monit keeps thinking it failed to start, and tries to restart the whole container set.

This actually keeps failing, as all containers are already starting, the command '/var/vcap/jobs/loggr-system-metrics-agent/bin/ctl start' doesn’t seem to actually do anything in this case.
So monit ends up in a loop, and bosh (monit) reports the VM as ‘failing’ state. (this is why the deployment ‘fails’, but it didn’t, actually).

So what is in the location file ‘/var/vcap/sys/run/harbor/’ ?

harbor.tmp.pid, not harbor.pid, as monit is expecting.

Now there was another thing that caught my attention; the cron.log file, was filling up with these mysterious python errors:

curl -s --cacert /var/vcap/jobs/harbor/config/ca.crt https://<harbor FQDN>/api/v2.0/systeminfo
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/__init__.py", line 291, in load
**kw)
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/var/vcap/packages/python/python2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

This log output in cron.log, was a bit confusing. We can see it doing a curl command using the CA cert. But then it spits out a bunch of Python errors? Are the two related? Where is this coming from?

To understand what is going on, I needed to dig into the scripts.

As it turns out, the ctl script is actually using a different script altogether, to do a healthcheck on Harbor.

Ctl contains a function called ‘waitForHarbor’
This merely calls ‘/bin/status_check’ and waits for it to complete for 180 seconds.

And it is the results of this ‘/bin/status_check’ script that are being logged to cron.log.

The ctl script is also responsible for maintaining the harbor.pid file. And this file is the health indicator that monit is actually triggering on.

So that explains the behavior we are seeing. But why is it not passing 'waitForHarbor' aka, the ‘/bin/status_check’ script?

When we look at the ‘/bin/status_check’ script, it contains a bunch of healthchecks.

The source of the file, can actually be found here, if you want to see for yourself: https://github.com/vmware/harbor-boshrelease/blob/master/jobs/harbor/templates/bin/status_check.erb.sh

This section immediately caught my eye:

You can actually run the entire script yourself, and now it becomes obvious where those python errors where coming from:

So what is it doing here?

curl --cacert verifies a CA cert, against the URL you specify.

If it fails, it will produce the text below.

However, in the check script, its set to curl -s for silent. In this case, it will fail silently.. curl wont produce any output at all.

url=`${curl_command} ${protocol}://${harbor_url}/api/v2.0/systeminfo | python -c "import sys, json; print json.load(sys.stdin)['registry_url']"`

But its still trying to pipe it to Python to do some kind of json breakdown of the output.

If curl doesn't fail, and the CA cert validates against the URL, the the python json filter will simply return the url again.

And this is where it fails. This section in the script contains no failure handling, in case the CA cert that you set in the config, doesn't actually validate against cert used by Harbot itself. And this was the case with me. I set the wrong CA cert (the root CA, instead of the intermediate, issuing CA).

So this was the root cause that Monit was failing the VM. It was not getting passed this part of the check_script. But its not really obvious from the logs, not even the cron.log, what is going wrong exactly!

The irony here, is that Harbor actually was working fine. In fact, I have not been able to find anywhere or any reason that Harbor actually requires CA cert at all! Its only the check-script that requires it, and that seems to be the only reason you have to give it the CA cert in the Opsman Tile config!

The original article was posted on: thefluffyadmin.net

Search our website

Troubleshooting Certificate mismatch in Harbor in TKGi

Related articles

Let's talk!