Skip to content

Conversation

@rohitreddy1698
Copy link

@rohitreddy1698 rohitreddy1698 commented Jun 18, 2025

Problem
TLS Certs for Kafka Not Initialised
Currently the Milvus Operator does not have the provision to perform Validation Checks while connecting to an Externally Deployed Kafka which is SSL enabled and has certificates signed with a private CA.
When deployed the Milvus CR deployment fails stating that the MsgStream is NotReady, and throws an SSL Certificate signed with unknown Authority Error and the operator does not proceed to deploy the Milvus Components at all.

Credentials being passed as plain strings
Milvus components (librdkafka) require plain fields for Kafka auth/TLS:
saslUsername, saslPassword, tlsCACert, tlsCert, tlsKey, tlsKeyPassword.
The Operator today passes SecretRefs in user.yaml, and defaults can inject
Kerberos options, causing librdkafka failures during validation.

What this PR does

  • Controller: resolves Kafka SecretRefs during reconcile into runtime-visible paths/strings.
  • Runtime: enhances scripts/run.sh to materialize SASL/TLS from mounted Secret volumes
    (or env fallbacks) into the final milvus.yaml consumed by Milvus.
  • Clears stray kafka.properties (e.g., accidental Kerberos defaults) to avoid validation errors.
  • Keeps Secrets in Kubernetes Secrets; does not copy secret values into ConfigMaps/CRs.

Why this approach

  • No Milvus core changes; all resolution happens in operator/runtime.
  • Compatible with current Milvus config contract.
  • Security: secrets remain in Secret volumes, only written in-pod to milvus.yaml.
  • Enables reconciliation of Kafka signed with Private CA

Backwards compatibility

  • If users already provide literal saslUsername/saslPassword or TLS file paths in the CR,
    those are respected (files win if Secret mount exists; mixed-mode supported).
  • If only caCertSecret is provided, only tlsCACert is set (no accidental mTLS enable).

How to use

spec:
  config:
    kafka:
      brokerList: ["broker:9094"]
      securityProtocol: SASL_SSL
      saslMechanisms: SCRAM-SHA-512
      saslUsernameSecret: { name: kafka-passwords, key: sasl-username }
      saslPasswordSecret: { name: kafka-passwords, key: sasl-password }
      ssl:
        enabled: true
        caCertSecret:  { name: kafka-ssl-secret, key: ca-cert }
        # (optional)
        certSecret:    { name: kafka-ssl-secret, key: client-cert }
        keySecret:     { name: kafka-ssl-secret, key: client-key }
        keyPasswordSecret: { name: kafka-passwords, key: tls-key-password }

Issue References :
milvus-io/milvus#27977

Even though there is provision to configure the Kafka Certs in the Milvus CRD Definition file, the configured certificates are not being used and the TLS object is simply being initialised to an empty TLS object. Specifically here :

This PR contains the changes to address this issue, it has two changes :

  • actually initialises the certs inkafka.go passed with the Milvus CRD.
  • Controller: resolves Kafka SecretRefs during reconcile into runtime-visible paths/strings.
  • Runtime: enhances scripts/run.sh to materialize SASL/TLS from mounted Secret volumes
    (or env fallbacks) into the final milvus.yaml consumed by Milvus.
  • Clears stray kafka.properties (e.g., accidental Kerberos defaults) to avoid validation errors.
  • Keeps Secrets in Kubernetes Secrets; does not copy secret values into ConfigMaps/CRs.

@sre-ci-robot
Copy link
Collaborator

Welcome @rohitreddy1698! It looks like this is your first PR to zilliztech/milvus-operator 🎉

@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from 9ffd2be to 8e0c4e1 Compare June 18, 2025 06:07
@haorenfsa
Copy link
Collaborator

haorenfsa commented Jun 25, 2025

Thank you @rohitreddy1698. Sry for the delay on my side. There're 2 solutions to me.

  1. add an option to skip kafka checks
  2. read the private CA from secrets

First one is simpler to me, and takes less time. And we may also need the same option for other dependencies.

2nd one is what you've be doing. But maybe the operator don't need to actually mount that secret, since it has the permission to read secret directly with kubernetes api access.

@rohitreddy1698
Copy link
Author

Hello @haorenfsa

Apologies for the delay in response and Thank you for your time for reviewing this and adding in your suggestions.

These were the suggestions you have given :
1)add an option to skip kafka checks
2)read the private CA from secrets

First one is simpler to me, and takes less time. And we may also need the same option for other dependencies - There should be an option to skip checks in Staging I understand but in production it might be necessary to ensure the health of dependencies. So I will maybe raise a Seperate PR providing an option to Skip Checks for all the dependencies.

2nd one is what you've be doing. But maybe the operator don't need to actually mount that secret, since it has the permission to read secret directly with kubernetes api access - I am currently taking in this approach.

Please let me know if you are okay with this, or have any other suggestions?

Thank you,
Rohit

@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from 8e0c4e1 to ae8697a Compare August 25, 2025 08:31
@rohitreddy1698
Copy link
Author

Hi @haorenfsa ,

Apologies for the delay in sharing the updated code.
As suggested I have made the changes to the code -
The current code is configured to read the certs from the secrets and additionally I have also made changes for enabling reading the SASL credentials from the secrets as well directly ( previously both the certs and the passwords were being read from files )
I have also made some changes to the run.sh script to be able to manifest this secrets into Plain Strings as librdkafka expects them to be.

Thanks and Regards,
Rohit Mothe

@rohitreddy1698
Copy link
Author

/assign @haorenfsa

@rohitreddy1698
Copy link
Author

Hi @haorenfsa ,

Did you get a chance to review the new code ?

Thank you !
Rohit Mothe

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rohitreddy1698
To complete the pull request process, please ask for approval from haorenfsa after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rohitreddy1698
Copy link
Author

Hello @haorenfsa ,

Just checking in on this.
In case you got a chance to review this PR and any changes are necessary please do let me know.

Thank you !

@rohitreddy1698
Copy link
Author

rohitreddy1698 commented Sep 23, 2025

Hello @haorenfsa , @LoveEachDay

Just checking in on this.
In case you got a chance to review this PR and any changes are necessary please do let me know.

Thank you !

@haorenfsa
Copy link
Collaborator

haorenfsa commented Oct 3, 2025

Hi @rohitreddy1698 , one UT failed.

--- FAIL: TestGetKafkaDialer (0.00s)
    --- FAIL: TestGetKafkaDialer/securityProtocol=SASL_PLAINTEXT (0.00s)
        kafka_test.go:60: 
            	Error Trace:	/home/runner/work/milvus-operator/milvus-operator/pkg/external/kafka_test.go:60
            	Error:      	Received unexpected error:
            	            	SASL enabled ("") but username or password is empty (checked secrets and literals)
            	Test:       	TestGetKafkaDialer/securityProtocol=SASL_PLAINTEXT

And also the lint, seems due to using deprecated function of go lib

  Error: pkg/external/kafka.go:99:6: SA1019: x509.IsEncryptedPEMBlock has been deprecated since Go 1.16 because it shouldn't be used: Legacy PEM encryption as specified in RFC 1423 is insecure by design. Since it does not authenticate the ciphertext, it is vulnerable to padding oracle attacks that can let an attacker recover the plaintext. (staticcheck)
  		if x509.IsEncryptedPEMBlock(block) {
  		   ^
  Error: pkg/external/kafka.go:103:16: SA1019: x509.DecryptPEMBlock has been deprecated since Go 1.16 because it shouldn't be used: Legacy PEM encryption as specified in RFC 1423 is insecure by design. Since it does not authenticate the ciphertext, it is vulnerable to padding oracle attacks that can let an attacker recover the plaintext. (staticcheck)
  			der, err := x509.DecryptPEMBlock(block, password)

@rohitreddy1698
Copy link
Author

Hi @haorenfsa , one UT failed. - This is because the UT expects passing no credentials for the SASL_PLAINTEXT to go forward and fail during runtime- but I have a hard check for this, will fix this and update the PR

--- FAIL: TestGetKafkaDialer (0.00s)
--- FAIL: TestGetKafkaDialer/securityProtocol=SASL_PLAINTEXT (0.00s)
kafka_test.go:60:
Error Trace: /home/runner/work/milvus-operator/milvus-operator/pkg/external/kafka_test.go:60
Error: Received unexpected error:
SASL enabled ("") but username or password is empty (checked secrets and literals)
Test: TestGetKafkaDialer/securityProtocol=SASL_PLAINTEXT

And also the lint, seems due to using deprecated function of go lib - I have updated my code to exclude the deprecated function

Error: pkg/external/kafka.go:99:6: SA1019: x509.IsEncryptedPEMBlock has been deprecated since Go 1.16 because it shouldn't be used: Legacy PEM encryption as specified in RFC 1423 is insecure by design. Since it does not authenticate the ciphertext, it is vulnerable to padding oracle attacks that can let an attacker recover the plaintext. (staticcheck)
if x509.IsEncryptedPEMBlock(block) {
^
Error: pkg/external/kafka.go:103:16: SA1019: x509.DecryptPEMBlock has been deprecated since Go 1.16 because it shouldn't be used: Legacy PEM encryption as specified in RFC 1423 is insecure by design. Since it does not authenticate the ciphertext, it is vulnerable to padding oracle attacks that can let an attacker recover the plaintext. (staticcheck)
der, err := x509.DecryptPEMBlock(block, password)

Will be raising a new PR by Monday.

Thank You,
Rohit

@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from 9945f80 to 3fa3842 Compare October 23, 2025 10:46
@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from 3fa3842 to aff5da2 Compare October 23, 2025 15:10
@rohitreddy1698
Copy link
Author

Hello @haorenfsa ,

I have updated the code with the latest changes and I have also made a few changes to the Unit Test Cases for the Kafka Check from the operator.
I have made few optimisations to prevent leaking secret material into ConfigMaps; all sensitive values
are read in-pod at runtime.

Can you please review and let me know if everything is fine, or still something is needed.

Regards,
Rohit Mothe

@rohitreddy1698
Copy link
Author

Hi @haorenfsa , @LoveEachDay

Can you please help review the code and let me know in case anything else is needed from my side for the PR to be merged.

Thank you !

@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch 2 times, most recently from 4037d15 to ef82c8e Compare October 29, 2025 05:44
@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from ef82c8e to dd359c7 Compare October 29, 2025 05:45
Read SASL/TLS credentials via SecretRefs and mount them to Milvus component pods,
then materialize them into runtime config in  (no secrets in ConfigMaps).

- controllers: inject SecretRef volumes into Deployments and add checksum
  annotations to trigger safe restarts on SecretRef changes
- scripts/run.sh: resolve saslUsername/saslPassword and TLS (CA/cert/key/password)
  from mounted files or env; merge into /milvus/configs/milvus.yaml
- external/kafka.go: support modern PKCS#8 (youmark/pkcs8), reject legacy PEM
  encryption, allow empty creds for PLAIN, enforce non-empty for SCRAM
- tests: add/adjust UTs for GetKafkaDialer and GetKafkaConfFromCR

Security: prevents leaking secret material into ConfigMaps; all sensitive values
are read in-pod at runtime.

Signed-off-by: rohit-mothe <[email protected]>
@rohitreddy1698 rohitreddy1698 force-pushed the external-kafka-tls-private-ca branch from dd359c7 to 9d20d86 Compare October 29, 2025 06:17
@rohitreddy1698
Copy link
Author

Hi @haorenfsa , @LoveEachDay

Can you please help review the code and let me know in case anything else is needed from my side for the PR to be merged.

Thank you !

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 23.82671% with 211 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.37%. Comparing base (a2faab9) to head (af06de0).

Files with missing lines Patch % Lines
pkg/controllers/kafka_secrets.go 24.44% 94 Missing and 8 partials ⚠️
pkg/external/kafka.go 25.19% 90 Missing and 5 partials ⚠️
pkg/controllers/milvus_controller.go 0.00% 14 Missing ⚠️

❌ Your project check has failed because the head coverage (73.37%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #359      +/-   ##
==========================================
- Coverage   76.67%   73.37%   -3.30%     
==========================================
  Files          65       66       +1     
  Lines        5890     6129     +239     
==========================================
- Hits         4516     4497      -19     
- Misses       1143     1388     +245     
- Partials      231      244      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +22 to +100

# Kafka SASL/TLS materialization from Secrets

# Default mount points injected by the controller; overridable for tests.
PW_DIR="${KAFKA_PW_DIR:-/secrets/kafka/passwords}"
SSL_DIR="${KAFKA_SSL_DIR:-/secrets/kafka/ssl}"
CONF_FILE="${MilvusConfigRootPath}/milvus.yaml"

# Helper to overlay small YAML fragments into milvus.yaml
write_overlay() {
local tmp
tmp="$(mktemp)"
cat >"$tmp"
/milvus/tools/merge -s "$tmp" -d "$CONF_FILE"
rm -f "$tmp"
}

# --- SASL credentials ---
# Preferred: read from mounted files; Fallback: env variables; Else: leave whatever is in user.yaml.
if [ -f "${PW_DIR}/sasl-username" ] && [ -f "${PW_DIR}/sasl-password" ]; then
SASL_USER="$(tr -d '\r\n' < "${PW_DIR}/sasl-username")"
SASL_PASS="$(tr -d '\r\n' < "${PW_DIR}/sasl-password")"
write_overlay <<EOF
kafka:
saslUsername: "${SASL_USER}"
saslPassword: "${SASL_PASS}"
EOF
elif [ -n "${SASL_USERNAME:-}" ] && [ -n "${SASL_PASSWORD:-}" ]; then
write_overlay <<EOF
kafka:
saslUsername: "${SASL_USERNAME}"
saslPassword: "${SASL_PASSWORD}"
EOF
fi

# --- TLS material ---
TLS_ENABLE=false
CA_FILE=""
CERT_FILE=""
KEY_FILE=""
KEY_PW=""

# CA cert (required for SSL, optional for PLAINTEXT/SASL_PLAINTEXT)
if [ -f "${SSL_DIR}/ca-cert" ]; then
CA_FILE="${SSL_DIR}/ca-cert"
TLS_ENABLE=true
fi

# Support either tls.crt/tls.key (common TLS secret) or client-cert/client-key (some orgs)
if [ -f "${SSL_DIR}/tls.crt" ] && [ -f "${SSL_DIR}/tls.key" ]; then
CERT_FILE="${SSL_DIR}/tls.crt"
KEY_FILE="${SSL_DIR}/tls.key"
TLS_ENABLE=true
elif [ -f "${SSL_DIR}/client-cert" ] && [ -f "${SSL_DIR}/client-key" ]; then
CERT_FILE="${SSL_DIR}/client-cert"
KEY_FILE="${SSL_DIR}/client-key"
TLS_ENABLE=true
fi

# Optional private key password (PKCS#8 recommended)
if [ -f "${PW_DIR}/tls-key-password" ]; then
KEY_PW="$(tr -d '\r\n' < "${PW_DIR}/tls-key-password")"
fi

# If any TLS material is present (or CA only), set kafka.ssl.* accordingly.
if [ "${TLS_ENABLE}" = "true" ]; then
{
echo "kafka:"
echo " ssl:"
echo " enabled: true"
[ -n "${CA_FILE}" ] && echo " tlsCACert: \"${CA_FILE}\""
if [ -n "${CERT_FILE}" ] && [ -n "${KEY_FILE}" ]; then
echo " tlsCert: \"${CERT_FILE}\""
echo " tlsKey: \"${KEY_FILE}\""
[ -n "${KEY_PW}" ] && echo " tlsKeyPassword: \"${KEY_PW}\""
fi
} | write_overlay
fi

Copy link
Collaborator

@haorenfsa haorenfsa Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should render teh cert & key file path directly to the configmap in operator instead of doing merge while intializing

Copy link
Collaborator

@haorenfsa haorenfsa Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, a patch should be made to milvus to make it able to read the secrets from certain file set in the config file. but this would take much more efforts, we may do this later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this script should only run when milvus configured to use kafka.

@haorenfsa
Copy link
Collaborator

haorenfsa commented Nov 14, 2025

@rohitreddy1698 So 3 things there:

  • The cert & key path should be rendered in controller's code instead of in the run.sh, the username&password part can be kept here.
  • The script should only run when milvus uses kafka.
  • It can be a temporary solution, but we should move the script to milvus' code afterwards.

@rohitreddy1698
Copy link
Author

Hello @haorenfsa

I am almost done with the code changes you have asked for, I will be publishing them in a day or two.

Regards,
Rohit Mothe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants