Gitlab CI/CD nodes overheat

GitLab CI/CD vs. Heat: How a Hardware Failure Sabotaged CI/CD Jobs

Gitlab CI/CD nodes overheat

A ROCK 5 A has become a node in my Homelab Kubernetes (k8s) cluster. For example, Gitlab CI jobs with container builds (Buildah, Kaniko) are executed on it. An unusually high error rate in the CI/CD system turned out to be a hardware error after some debugging and extensive troubleshooting. After several years of experience with Gitlab CI, this was one of the “craziest” bugs for me.

Error description

The Gitlab CI jobs are hosted in a Kubernetes cluster via the Gitlab Runner Kubernetes executor. For native arm64 builds, a ROCK 5 A is used as a node in the cluster.

From the beginning, my CI jobs were experiencing sporadic errors - TLS connections, layer caching, compression, decompression, and container pull/push. Due to the test status and continuous optimization of various settings in the CI jobs and the Gitlab Runner, I assumed that the sporadic errors could be caused by misconfigurations or insufficient resources, such as insufficient memory.

As part of further stabilization measures, I therefore checked and adjusted the resource requests and limits in particular. There were always phases in which everything worked smoothly until various errors suddenly occurred again.

Error messages related to TLS/SSL or HTTPS connections also suggested that there might be a problem with the network or internet connection. However, an initial analysis of the traffic with Wireshark showed nothing unusual. Since most of the errors only occurred on the arm64 node, I suspected a problem related to the arm64 CPU architecture.

Only a parallel test with different build tools such as Buildah and Skopeo produced noticeable results – both tools produced almost identical sporadic error patterns.

Various error patterns in detail

tls: bad record MAC

tls: bad record MAC
$ /syft scan ${TRACE+-vv} $DOCKER_SNAPSHOT_IMAGE $DOCKER_SBOM_OPTS -o cyclonedx-json=reports/docker-sbom-${basename}.cyclonedx.json
could not determine source: errors occurred attempting to resolve 'registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-feat-retry-build-amd64':
  - docker: docker not available: failed to connect to Docker daemon. Ensure Docker is running and accessible
  - podman: podman not available: no host address
  - containerd: containerd not available: no grpc connection or services is available: unavailable
  - oci-registry: unable to populate layer cache dir="/tmp/stereoscope-3483325947/oci-registry-image-1395683340/sha256:d7d53246cf32b5617c62ec6db76daadad242335800b3767fb8ab5372d0c24df7" : local error: tls: bad record MAC
  - additionally, the following providers failed with file does not exist: docker-archive, oci-archive, oci-dir, singularity, oci-dir, local-file, local-directory
tls: bad record MAC
51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.59 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.59 MiB p/s ETA 0s
51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.49 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [----------------------------------------------->] 100.00% 1.49 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [---------------------------------------------------] 100.00% 1.92 MiB p/s 27s
2024/08/17 16:30:30 WARN '--vuln-type' is deprecated. Use '--pkg-types' instead.
2024-08-17T16:30:30Z	INFO	[vuln] Vulnerability scanning is enabled
2024-08-17T16:30:30Z	INFO	[secret] Secret scanning is enabled
2024-08-17T16:30:30Z	INFO	[secret] If your scanning is slow, please try '--scanners vuln' to disable secret scanning
2024-08-17T16:30:30Z	INFO	[secret] Please see also https://aquasecurity.github.io/trivy/v0.54/docs/scanner/secret#recommendation for faster secret detection
2024-08-17T16:31:53Z	FATAL	Fatal error	image scan error: scan error: scan failed: failed analysis: analyze error: pipeline error: failed to analyze layer (sha256:4b6bc5885679abe54fc17878bb74d45875f6b7747e9ff89adf0620a631354146): walk error: failed to extract the archive: local error: tls: bad record MAC

error verifying sha256 checksum after reading x bytes; got “sha256:x”, want “sha256:y”

error verifying sha256 checksum after reading x bytes; got „sha256:x“, want „sha256:y“
$ /syft scan ${TRACE+-vv} $DOCKER_SNAPSHOT_IMAGE $DOCKER_SBOM_OPTS -o cyclonedx-json=reports/docker-sbom-${basename}.cyclonedx.json
could not determine source: errors occurred attempting to resolve 'registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-main-amd64':
  - docker: docker not available: failed to connect to Docker daemon. Ensure Docker is running and accessible
  - podman: podman not available: no host address
  - containerd: containerd not available: no grpc connection or services is available: unavailable
  - oci-registry: unable to populate layer cache dir="/tmp/stereoscope-601115183/oci-registry-image-1244415430/sha256:cc9e89b59b07f31c85950499be0fb00281e767f543e3518240cadef521456b0d" : error verifying sha256 checksum after reading 48836061 bytes; got "sha256:e615f70755ea5f5e51525e97c94f7f8eb79df2f61427e5cb549676fae6e3a19e", want "sha256:8314597ff06fa934871480701d61ccf24999f5a025e70a7a196e5b1f9690cae4"
  - additionally, the following providers failed with file does not exist: docker-archive, oci-archive, oci-dir, singularity, oci-dir, local-file, local-directory

error building image: error building stage: failed to get filesystem from image: error reading tar 3: archive/tar: invalid tar header

error building image: error building stage: failed to get filesystem from image: error reading tar 3: archive/tar: invalid tar header
INFO[0018] Retrieving image manifest ghcr.io/containerd/nerdctl:v1.7.6@sha256:c7ed7c98f3da0d60d1b0b3e897460bd6fe306a7e4febe600533da035a3651f0e 
INFO[0018] Returning cached image manifest              
INFO[0018] Executing 0 build triggers                   
INFO[0018] Building stage 'ghcr.io/containerd/nerdctl:v1.7.6@sha256:c7ed7c98f3da0d60d1b0b3e897460bd6fe306a7e4febe600533da035a3651f0e' [idx: '2', base-idx: '-1'] 
error building image: error building stage: failed to get filesystem from image: error reading tar 3: archive/tar: invalid tar header

DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:x Reason:map[]]

DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:x Reason:map[]]
INFO[0446] Pushing image to registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-renovate-ghcr-io-terraform-linters-tflint-0-x-arm64 
error pushing image: failed to push to destination registry.gitlab.com/csautter/gitlab-ci-cd-templates/snapshot:3tcd-renovate-ghcr-io-terraform-linters-tflint-0-x-arm64: PUT https://registry.gitlab.com/v2/csautter/gitlab-ci-cd-templates/snapshot/blobs/uploads/604cd097-2ce3-4ae9-8744-30b39ce78849?_state=REDACTED&digest=sha256%3A2348448717b77b6e478a55d3d317a57eb4d6630f09bf36b038946749daa893a1: DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:2348448717b77b6e478a55d3d317a57eb4d6630f09bf36b038946749daa893a1 Reason:map[]]

Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: copy error: error verifying sha256 checksum after reading 54016433 bytes; got “sha256:x”, want “sha256:y”

Fatal error init error: DB error: failed to download vulnerability DB: database download error: oci download error: copy error: error verifying sha256 checksum after reading 54016433 bytes; got „sha256:x“, want „sha256:y“
49.91 MiB / 51.51 MiB [----------------------------------------------->_] 96.88% 1.79 MiB p/s ETA 0s50.39 MiB / 51.51 MiB [----------------------------------------------->_] 97.82% 1.79 MiB p/s ETA 0s
50.94 MiB / 51.51 MiB [------------------------------------------------>] 98.88% 1.79 MiB p/s ETA 0s51.45 MiB / 51.51 MiB [------------------------------------------------>] 99.88% 1.84 MiB p/s ETA 0s51.51 MiB / 51.51 MiB [---------------------------------------------------] 100.00% 1.99 MiB p/s 26s
2024-08-17T16:30:30Z	FATAL	Fatal error	init error: DB error: failed to download vulnerability DB: database download error: oci download error: copy error: error verifying sha256 checksum after reading 54016433 bytes; got "sha256:48972fb93363b0612c2bc4f715a82b92ed8387bbfffeae77dbe7f557918130c3", want "sha256:f0bef38f54917f45165ae49dfcbbbeddf0d02bb9f698946390be766a0ee3f8cf"

System test

To identify hardware problems, I tested the CPU with stress-ng in verify mode. Some cores showed clear errors.

Stress test the CPU with stress-ng in verify mode

Numerous calculation errors occurred, such as inaccurate calculations of constants or incorrect square roots.

test with stress-ng
$ sudo apt-get install stress-ng
$ stress-ng --cpu 0 --verify --verbose --timeout 5m
stress-ng: debug: [886181] stress-ng 0.13.12
stress-ng: debug: [886181] system: Linux rock-5a 5.10.110-37-rockchip #27a257394 SMP Thu May 23 02:38:59 UTC 2024 aarch64
stress-ng: debug: [886181] RAM total: 15.6G, RAM free: 4.1G, swap free: 7.8G
stress-ng: debug: [886181] 8 processors online, 8 processors configured
stress-ng: info:  [886181] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info:  [886181] dispatching hogs: 8 cpu
stress-ng: debug: [886181] cache allocate: shared cache buffer size: 3072K
stress-ng: debug: [886181] starting stressors
stress-ng: debug: [886182] stress-ng-cpu: started [886182] (instance 0)
stress-ng: debug: [886182] stress-ng-cpu using method 'all'
stress-ng: debug: [886183] stress-ng-cpu: started [886183] (instance 1)
stress-ng: debug: [886183] stress-ng-cpu using method 'all'
stress-ng: debug: [886184] stress-ng-cpu: started [886184] (instance 2)
stress-ng: debug: [886184] stress-ng-cpu using method 'all'
stress-ng: debug: [886185] stress-ng-cpu: started [886185] (instance 3)
stress-ng: debug: [886185] stress-ng-cpu using method 'all'
stress-ng: debug: [886186] stress-ng-cpu: started [886186] (instance 4)
stress-ng: debug: [886187] stress-ng-cpu: started [886187] (instance 5)
stress-ng: debug: [886187] stress-ng-cpu using method 'all'
stress-ng: debug: [886188] stress-ng-cpu: started [886188] (instance 6)
stress-ng: debug: [886188] stress-ng-cpu using method 'all'
stress-ng: debug: [886181] 8 stressors started
stress-ng: debug: [886186] stress-ng-cpu using method 'all'
stress-ng: debug: [886189] stress-ng-cpu: started [886189] (instance 7)
stress-ng: debug: [886189] stress-ng-cpu using method 'all'
stress-ng: fail:  [886187] stress-ng-cpu: sqrtf error detected on sqrt(2308358344)
stress-ng: fail:  [886187] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886186] stress-ng-cpu: Apéry's const not accurate enough
stress-ng: fail:  [886186] stress-ng-cpu: sqrtf error detected on sqrt(3853404901)
stress-ng: fail:  [886183] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886183] stress-ng-cpu: Apéry's const not accurate enough
stress-ng: fail:  [886188] stress-ng-cpu: Stirling's approximation of factorial(59) out of range
stress-ng: fail:  [886182] stress-ng-cpu: sqrtf error detected on sqrt(1186671297)
stress-ng: fail:  [886185] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886182] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886183] stress-ng-cpu: sqrtf error detected on sqrt(2136759488)
stress-ng: fail:  [886188] stress-ng-cpu: Newton-Raphson sqrt computation took more iterations than expected
stress-ng: fail:  [886188] stress-ng-cpu: Newton-Raphson sqrt not accurate enough
stress-ng: fail:  [886187] stress-ng-cpu: sqrtf error detected on sqrt(1951583473)
stress-ng: fail:  [886187] stress-ng-cpu: Stirling's approximation of factorial(117) out of range
stress-ng: fail:  [886185] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886182] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886184] stress-ng-cpu: prime error detected, number of primes has been miscalculated
stress-ng: fail:  [886184] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886189] stress-ng-cpu: sqrtf error detected on sqrt(3283974744)
stress-ng: fail:  [886183] stress-ng-cpu: sqrtf error detected on sqrt(2730736404)
stress-ng: fail:  [886188] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
stress-ng: fail:  [886189] stress-ng-cpu: sqrtf error detected on sqrt(4136192236)
stress-ng: fail:  [886185] stress-ng-cpu: Stirling's approximation of factorial(100) out of range
stress-ng: fail:  [886183] stress-ng-cpu: calculation of Euler-Mascheroni constant not as accurate as expected
info: 5 failures reached, aborting stress process
stress-ng: debug: [886183] stress-ng-cpu: exited [886183] (instance 1)
stress-ng: debug: [886185] stress-ng-cpu: exited [886185] (instance 3)
stress-ng: debug: [886186] stress-ng-cpu: exited [886186] (instance 4)
stress-ng: debug: [886182] stress-ng-cpu: exited [886182] (instance 0)
stress-ng: debug: [886181] process [886182] terminated
stress-ng: error: [886181] process 886183 (stress-ng-cpu) terminated with an error, exit status=2 (stressor failed)
stress-ng: debug: [886181] process [886183] terminated
stress-ng: debug: [886187] stress-ng-cpu: exited [886187] (instance 5)
stress-ng: debug: [886188] stress-ng-cpu: exited [886188] (instance 6)
stress-ng: debug: [886184] stress-ng-cpu: exited [886184] (instance 2)
stress-ng: debug: [886181] process [886184] terminated
stress-ng: debug: [886181] process [886185] terminated
stress-ng: debug: [886181] process [886186] terminated
stress-ng: debug: [886181] process [886187] terminated
stress-ng: debug: [886181] process [886188] terminated
stress-ng: debug: [886189] stress-ng-cpu: exited [886189] (instance 7)
stress-ng: debug: [886181] process [886189] terminated
stress-ng: info:  [886181] unsuccessful run completed in 300.10s (5 mins, 0.10 secs)
stress-ng: debug: [886181] metrics-check: all stressor metrics validated and sane

As a workaround, I temporarily deactivated the faulty cores. This temporarily prevented the errors in the CI jobs.

disable defective cpu core
cd /sys/devices/system/cpu
ls cpu*
# disable specific core
echo 0 > /sys/devices/system/cpu/cpu3/online

Later, errors occurred again despite the cores being deactivated.

Monitoring CPU temperature with lm-sensors

In the next step, I reactivated all cores and monitored the CPU temperature during the stress test using the lm-sensors package.

watch temperatures with lm-sensors
$ sudo apt-get install lm-sensors
$ sudo sensors-detect
$ sudo watch sensors

Every 2.0s: sensors                                                                    rock-5a: Tue Aug 20 19:48:13 2024

npu_thermal-virtual-0
Adapter: Virtual device
temp1:        +72.1°C

center_thermal-virtual-0
Adapter: Virtual device
temp1:        +70.2°C

bigcore1_thermal-virtual-0
Adapter: Virtual device
temp1:        +79.5°C

soc_thermal-virtual-0
Adapter: Virtual device
temp1:        +75.8°C  (crit = +115.0°C)

gpu_thermal-virtual-0
Adapter: Virtual device
temp1:        +69.3°C

littlecore_thermal-virtual-0
Adapter: Virtual device
temp1:        +76.7°C

bigcore0_thermal-virtual-0
Adapter: Virtual device
temp1:        +79.5°C

The CPU temperature rose to over 80°C in some cases, which is theoretically still within the permissible range. However, I could observe an increase in faulty calculations at temperatures of around 70-80°C.

The SoC (RK3588S) is specified to limit its maximum internal temperature to 80°C before
throttling the clock speeds to maintain reliability within the allowed temperature range. If
the ROCK 5A is intended to be used continuously in high performance applications, it may
be necessary to use external cooling methods (for example, heat sink, fan, etc.) which will
allow the SoC to continue running at maximum clock speed indefinitely below its prede‑
fined 80°C peak temperature limiter.

Source: https://docs.rs-online.com/a9f0/A700000010117420.pdf

A further stress test with an additional external fan reduced the SoC temperature to below 50°C and the errors disappeared completely.

Conclusion

Apparently my ROCK 5 A works within the specified temperature specification, but additional cooling is required to ensure error-free operation. The error was particularly interesting because at first glance it did not indicate a hardware problem and the resulting error messages were very unspecific.

I reported the error to the dealer RS ​​Components and received a replacement device for my defective ROCK 5 A without any problems. With the new ROCK 5 A I no longer have any problems.

Leave a Comment

en_USEnglish