GitLab CI/CD vs. Heat: How a Hardware Failure Sabotaged CI/CD Jobs
A ROCK 5 A has become a node in my Homelab Kubernetes (k8s) cluster. For example, Gitlab CI jobs with container builds (Buildah, Kaniko) are executed on it. An unusually high error rate in the CI/CD system turned out to be a hardware error after some debugging and extensive troubleshooting. After several years of experience with Gitlab CI, this was one of the “craziest” bugs for me.
Error description
The Gitlab CI jobs are hosted in a Kubernetes cluster via the Gitlab Runner Kubernetes executor. For native arm64 builds, a ROCK 5 A is used as a node in the cluster.
From the beginning, my CI jobs were experiencing sporadic errors - TLS connections, layer caching, compression, decompression, and container pull/push. Due to the test status and continuous optimization of various settings in the CI jobs and the Gitlab Runner, I assumed that the sporadic errors could be caused by misconfigurations or insufficient resources, such as insufficient memory.
As part of further stabilization measures, I therefore checked and adjusted the resource requests and limits in particular. There were always phases in which everything worked smoothly until various errors suddenly occurred again.
Error messages related to TLS/SSL or HTTPS connections also suggested that there might be a problem with the network or internet connection. However, an initial analysis of the traffic with Wireshark showed nothing unusual. Since most of the errors only occurred on the arm64 node, I suspected a problem related to the arm64 CPU architecture.
Only a parallel test with different build tools such as Buildah and Skopeo produced noticeable results – both tools produced almost identical sporadic error patterns.
Various error patterns in detail
tls: bad record MAC
$ /syft scan ${TRACE+-vv} $DOCKER_SNAPSHOT_IMAGE $DOCKER_SBOM_OPTS -o cyclonedx-json=reports/docker-sbom-${basename}.cyclonedx.json
could not determine source: errors occurred attempting to resolve 'registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-feat-retry-build-amd64':
- docker: docker not available: failed to connect to Docker daemon. Ensure Docker is running and accessible
- podman: podman not available: no host address
- containerd: containerd not available: no grpc connection or services is available: unavailable
- oci-registry: unable to populate layer cache dir="/tmp/stereoscope-3483325947/oci-registry-image-1395683340/sha256:d7d53246cf32b5617c62ec6db76daadad242335800b3767fb8ab5372d0c24df7" : local error: tls: bad record MAC
- additionally, the following providers failed with file does not exist: docker-archive, oci-archive, oci-dir, singularity, oci-dir, local-file, local-directory
51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.59 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.59 MiB p/s ETA 0s
51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.49 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.49 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [---------------------------------------------------] 100.00% 1.92 MiB p/s 27s
2024/08/17 16:30:30 WARN '--vuln-type' is deprecated. Use '--pkg-types' instead.
2024-08-17T16:30:30Z INFO [vuln] Vulnerability scanning is enabled
2024-08-17T16:30:30Z INFO [secret] Secret scanning is enabled
2024-08-17T16:30:30Z INFO [secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-08-17T16:30:30Z INFO [secret] Please see also https://aquasecurity.github.io/trivy/v0.54/docs/scanner/secret#recommendation for faster secret detection
2024-08-17T16:31:53Z FATAL Fatal error image scan error: scan error: scan failed: failed analysis: analyze error: pipeline error: failed to analyze layer (sha256:4b6bc5885679abe54fc17878bb74d45875f6b7747e9ff89adf0620a631354146): walk error: failed to extract the archive: local error: tls: bad record MAC
error verifying sha256 checksum after reading x bytes; got “sha256:x”, want “sha256:y”
$ /syft scan ${TRACE+-vv} $DOCKER_SNAPSHOT_IMAGE $DOCKER_SBOM_OPTS -o cyclonedx-json=reports/docker-sbom-${basename}.cyclonedx.json
could not determine source: errors occurred attempting to resolve 'registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-main-amd64':
- docker: docker not available: failed to connect to Docker daemon. Ensure Docker is running and accessible
- podman: podman not available: no host address
- containerd: containerd not available: no grpc connection or services is available: unavailable
- oci-registry: unable to populate layer cache dir="/tmp/stereoscope-601115183/oci-registry-image-1244415430/sha256:cc9e89b59b07f31c85950499be0fb00281e767f543e3518240cadef521456b0d" : error verifying sha256 checksum after reading 48836061 bytes; got "sha256:e615f70755ea5f5e51525e97c94f7f8eb79df2f61427e5cb549676fae6e3a19e", want "sha256:8314597ff06fa934871480701d61ccf24999f5a025e70a7a196e5b1f9690cae4"
- additionally, the following providers failed with file does not exist: docker-archive, oci-archive, oci-dir, singularity, oci-dir, local-file, local-directory
error building image: error building stage: failed to get filesystem from image: error reading tar 3: archive/tar: invalid tar header
INFO[0018] Retrieving image manifest ghcr.io/containerd/nerdctl:v1.7.6@sha256:c7ed7c98f3da0d60d1b0b3e897460bd6fe306a7e4febe600533da035a3651f0e
INFO[0018] Returning cached image manifest
INFO[0018] Executing 0 build triggers
INFO[0018] Building stage 'ghcr.io/containerd/nerdctl:v1.7.6@sha256:c7ed7c98f3da0d60d1b0b3e897460bd6fe306a7e4febe600533da035a3651f0e' [idx: '2', base-idx: '-1']
error building image: error building stage: failed to get filesystem from image: error reading tar 3: archive/tar: invalid tar header
DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:x Reason:map[]]
INFO[0446] Pushing image to registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-renovate-ghcr-io-terraform-linters-tflint-0-x-arm64
error pushing image: failed to push to destination registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-renovate-ghcr-io-terraform-linters-tflint-0-x-arm64: PUT https://registry.gitlab.com/v2/csautter/gitlab-ci-cd-templates/snapshot/blobs/uploads/604cd097-2ce3-4ae9-8744-30b39ce78849?_state=REDACTED&digest=sha256%3A2348448717b77b6e478a55d3d317a57eb4d6630f09bf36b038946749daa893a1: DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:2348448717b77b6e478a55d3d317a57eb4d6630f09bf36b038946749daa893a1 Reason:map[]]
Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: copy error: error verifying sha256 checksum after reading 54016433 bytes; got “sha256:x”, want “sha256:y”
49.91 MiB / 51.51 MiB [----------------------------------------------->_] 96.88% 1.79 MiB p/s ETA 0s50.39 MiB / 51.51 MiB [----------------------------------------------->_] 97.82% 1.79 MiB p/s ETA 0s
50.94 MiB / 51.51 MiB [------------------------------------------------>] 98.88% 1.79 MiB p/s ETA 0s51.45 MiB / 51.51 MiB [------------------------------------------------>] 99.88% 1.84 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [---------------------------------------------------] 100.00% 1.99 MiB p/s 26s
2024-08-17T16:30:30Z FATAL Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: copy error: error verifying sha256 checksum after reading 54016433 bytes; got "sha256:48972fb93363b0612c2bc4f715a82b92ed8387bbfffeae77dbe7f557918130c3", want "sha256:f0bef38f54917f45165ae49dfcbbbeddf0d02bb9f698946390be766a0ee3f8cf"
System test
To identify hardware problems, I tested the CPU with stress-ng
in verify mode. Some cores showed clear errors.
Stress test the CPU with stress-ng in verify mode
Numerous calculation errors occurred, such as inaccurate calculations of constants or incorrect square roots.
$ sudo apt-get install stress-ng
$ stress-ng --cpu 0 --verify --verbose --timeout 5m
stress-ng: debug: [886181] stress-ng 0.13.12
stress-ng: debug: [886181] system: Linux rock-5a 5.10.110-37-rockchip #27a257394 SMP Thu May 23 02:38:59 UTC 2024 aarch64
stress-ng: debug: [886181] RAM total: 15.6G, RAM free: 4.1G, swap free: 7.8G
stress-ng: debug: [886181] 8 processors online, 8 processors configured
stress-ng: info: [886181] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [886181] dispatching hogs: 8 cpu
stress-ng: debug: [886181] cache allocate: shared cache buffer size: 3072K
stress-ng: debug: [886181] starting stressors
stress-ng: debug: [886182] stress-ng-cpu: started [886182] (instance 0)
stress-ng: debug: [886182] stress-ng-cpu using method 'all'
stress-ng: debug: [886183] stress-ng-cpu: started [886183] (instance 1)
stress-ng: debug: [886183] stress-ng-cpu using method 'all'
stress-ng: debug: [886184] stress-ng-cpu: started [886184] (instance 2)
stress-ng: debug: [886184] stress-ng-cpu using method 'all'
stress-ng: debug: [886185] stress-ng-cpu: started [886185] (instance 3)
stress-ng: debug: [886185] stress-ng-cpu using method 'all'
stress-ng: debug: [886186] stress-ng-cpu: started [886186] (instance 4)
stress-ng: debug: [886187] stress-ng-cpu: started [886187] (instance 5)
stress-ng: debug: [886187] stress-ng-cpu using method 'all'
stress-ng: debug: [886188] stress-ng-cpu: started [886188] (instance 6)
stress-ng: debug: [886188] stress-ng-cpu using method 'all'
stress-ng: debug: [886181] 8 stressors started
stress-ng: debug: [886186] stress-ng-cpu using method 'all'
stress-ng: debug: [886189] stress-ng-cpu: started [886189] (instance 7)
stress-ng: debug: [886189] stress-ng-cpu using method 'all'
stress-ng: fail: [886187] stress-ng-cpu: sqrtf error detected on sqrt(2308358344)
stress-ng: fail: [886187] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886186] stress-ng-cpu: Apéry's const not accurate enough
stress-ng: fail: [886186] stress-ng-cpu: sqrtf error detected on sqrt(3853404901)
stress-ng: fail: [886183] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886183] stress-ng-cpu: Apéry's const not accurate enough
stress-ng: fail: [886188] stress-ng-cpu: Stirling's approximation of factorial(59) out of range
stress-ng: fail: [886182] stress-ng-cpu: sqrtf error detected on sqrt(1186671297)
stress-ng: fail: [886185] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886182] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886183] stress-ng-cpu: sqrtf error detected on sqrt(2136759488)
stress-ng: fail: [886188] stress-ng-cpu: Newton-Raphson sqrt computation took more iterations than expected
stress-ng: fail: [886188] stress-ng-cpu: Newton-Raphson sqrt not accurate enough
stress-ng: fail: [886187] stress-ng-cpu: sqrtf error detected on sqrt(1951583473)
stress-ng: fail: [886187] stress-ng-cpu: Stirling's approximation of factorial(117) out of range
stress-ng: fail: [886185] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886182] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886184] stress-ng-cpu: prime error detected, number of primes has been miscalculated
stress-ng: fail: [886184] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886189] stress-ng-cpu: sqrtf error detected on sqrt(3283974744)
stress-ng: fail: [886183] stress-ng-cpu: sqrtf error detected on sqrt(2730736404)
stress-ng: fail: [886188] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail: [886189] stress-ng-cpu: sqrtf error detected on sqrt(4136192236)
stress-ng: fail: [886185] stress-ng-cpu: Stirling's approximation of factorial(100) out of range
stress-ng: fail: [886183] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
info: 5 failures reached, aborting stress process
stress-ng: debug: [886183] stress-ng-cpu: exited [886183] (instance 1)
stress-ng: debug: [886185] stress-ng-cpu: exited [886185] (instance 3)
stress-ng: debug: [886186] stress-ng-cpu: exited [886186] (instance 4)
stress-ng: debug: [886182] stress-ng-cpu: exited [886182] (instance 0)
stress-ng: debug: [886181] process [886182] terminated
stress-ng: error: [886181] process 886183 (stress-ng-cpu) terminated with an error, exit status=2 (stressor failed)
stress-ng: debug: [886181] process [886183] terminated
stress-ng: debug: [886187] stress-ng-cpu: exited [886187] (instance 5)
stress-ng: debug: [886188] stress-ng-cpu: exited [886188] (instance 6)
stress-ng: debug: [886184] stress-ng-cpu: exited [886184] (instance 2)
stress-ng: debug: [886181] process [886184] terminated
stress-ng: debug: [886181] process [886185] terminated
stress-ng: debug: [886181] process [886186] terminated
stress-ng: debug: [886181] process [886187] terminated
stress-ng: debug: [886181] process [886188] terminated
stress-ng: debug: [886189] stress-ng-cpu: exited [886189] (instance 7)
stress-ng: debug: [886181] process [886189] terminated
stress-ng: info: [886181] unsuccessful run completed in 300.10s (5 mins, 0.10 secs)
stress-ng: debug: [886181] metrics-check: all stressor metrics validated and sane
As a workaround, I temporarily deactivated the faulty cores. This temporarily prevented the errors in the CI jobs.
cd /sys/devices/system/cpu
ls cpu*
# disable specific core
echo 0 > /sys/devices/system/cpu/cpu3/online
Later, errors occurred again despite the cores being deactivated.
Monitoring CPU temperature with lm-sensors
In the next step, I reactivated all cores and monitored the CPU temperature during the stress test using the lm-sensors
package.
$ sudo apt-get install lm-sensors
$ sudo sensors-detect
$ sudo watch sensors
Every 2.0s: sensors rock-5a: Tue Aug 20 19:48:13 2024
npu_thermal-virtual-0
Adapter: Virtual device
temp1: +72.1°C
center_thermal-virtual-0
Adapter: Virtual device
temp1: +70.2°C
bigcore1_thermal-virtual-0
Adapter: Virtual device
temp1: +79.5°C
soc_thermal-virtual-0
Adapter: Virtual device
temp1: +75.8°C (crit = +115.0°C)
gpu_thermal-virtual-0
Adapter: Virtual device
temp1: +69.3°C
littlecore_thermal-virtual-0
Adapter: Virtual device
temp1: +76.7°C
bigcore0_thermal-virtual-0
Adapter: Virtual device
temp1: +79.5°C
The CPU temperature rose to over 80°C in some cases, which is theoretically still within the permissible range. However, I could observe an increase in faulty calculations at temperatures of around 70-80°C.
The SoC (RK3588S) is specified to limit its maximum internal temperature to 80°C before
throttling the clock speeds to maintain reliability within the allowed temperature range. If
the ROCK 5A is intended to be used continuously in high performance applications, it may
be necessary to use external cooling methods (for example, heat sink, fan, etc.) which will
allow the SoC to continue running at maximum clock speed indefinitely below its prede‑
fined 80°C peak temperature limiter.
Source: https://docs.rs-online.com/a9f0/A700000010117420.pdf
A further stress test with an additional external fan reduced the SoC temperature to below 50°C and the errors disappeared completely.
Conclusion
Apparently my ROCK 5 A works within the specified temperature specification, but additional cooling is required to ensure error-free operation. The error was particularly interesting because at first glance it did not indicate a hardware problem and the resulting error messages were very unspecific.
I reported the error to the dealer RS Components and received a replacement device for my defective ROCK 5 A without any problems. With the new ROCK 5 A I no longer have any problems.