Lessons Learned from Deploying Confluent Kafka on AKS

Deploying Confluent Kafka on Azure Kubernetes Service (AKS) offers powerful capabilities for event streaming, but comes with unique challenges. In this blog post, I’ll share practical insights from a recent deployment, covering common issues and their solutions.
Container Registry Configuration
One of the first hurdles was configuring our Azure Container Registry (ACR) correctly.
Importing Confluent Images
When working with Confluent images, you’ll need to import them into your ACR:
set ACR_NAME=your-acr-name
# Import key Confluent components with specific versions
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-kafka --image cp-kafka:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-server --image cp-server:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-server-connect --image cp-server-connect:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-ksqldb-server --image cp-ksqldb-server:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-schema-registry --image cp-schema-registry:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-kafka-rest --image cp-kafka-rest:7.7.2
az acr import --name %ACR_NAME% --source docker.io/confluentinc/confluent-operator --image confluent-operator:0.1145.6
az acr import --name %ACR_NAME% --source docker.io/confluentinc/confluent-init-container --image confluent-init-container:2.9.4
To avoid Docker Hub rate limits, authenticate first:
set DOCKER_USER=your-username
set DOCKER_PAT=your-pat
az acr import --name %ACR_NAME% --source docker.io/confluentinc/cp-kafka --image cp-kafka:7.7.2 --username %DOCKER_USER% --password %DOCKER_PAT%
Confluent for Kubernetes Deployment
Helm Installation
Deploying Confluent for Kubernetes (CFK) requires special attention to image registry and versioning:
helm upgrade --install confluent-operator .\confluent-for-kubernetes -n kafkapreprod \
--set image.registry=your-acr.azurecr.io \
--set global.registry.fqdn=your-acr.azurecr.io \
--set image.repository=confluent-operator \
--set global.initContainer.repository=confluent-init-container \
--set image.tag="0.1145.6" \
--set kafka.image.tag="7.7.2" \
--set connect.image.tag="7.7.2" \
--set controlcenter.image.tag="7.7.2" \
--set ksqldb.image.tag="7.7.2" \
--set schemaregistry.image.tag="7.7.2" \
--set licenseKey="your-license-key-here"
When using a local ACR, you need to remove the confluentinc/
repository prefix and specify each component’s image tag.
Persistent Volume Management
AKS persistent volume management can be tricky with Kafka:
Stuck Persistent Volumes
If you ever need to force delete stuck PVs:
# First change Retain policy to Delete for relevant PVs
$retainPVs = @(
"pvc-cb30adbe-ab1b-4deb-bdd7-61996cf00244",
"pvc-ccf4f715-c799-4a69-bcd7-29560b9de23b"
)
foreach ($pv in $retainPVs) {
kubectl patch pv $pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
}
# Then remove finalizers and force delete
kubectl patch pv $pvName -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete pv $pvName --force --grace-period=0
Remember that helm delete
intentionally doesn’t delete PVs to protect your data.
SSL/TLS Configuration
Configuring Kafka with SSL requires careful certificate handling:
P12 Certificate Conversion
When using P12 certificates:
# Convert P12 to PEM
openssl pkcs12 -in certificate.p12 -out certificate.pem -nokeys
For Windows and older OpenSSL versions, you might need:
openssl pkcs12 -legacy -in truststore.p12 -out truststore.pem -nokeys
Role-Based Access Control
RBAC is critical for secure Kafka operations:
Schema Registry Permissions
When Schema Registry shows authorization errors:
apiVersion: platform.confluent.io/v1beta1
kind: ConfluentRoleBinding
metadata:
name: sr-schemas-owner
namespace: kafkapreprod
spec:
principal:
name: "User:sr"
type: User
role: ResourceOwner
resourcePatterns:
- name: "_schemas"
patternType: LITERAL
resourceType: Topic
- name: "_schemaregistry_kafkapreprod"
patternType: LITERAL
resourceType: Topic
kafkaRestClassRef:
name: default
Troubleshooting Common Issues
INCONSISTENT_CLUSTER_ID Error
This typically occurs when Kafka brokers have different cluster IDs:
[ERROR] org.apache.kafka.raft.KafkaRaftClient handleUnexpectedError - [RaftManager id=0]
Unexpected error INCONSISTENT_CLUSTER_ID in FETCH response
Resolution requires cleaning up Kafka resources and redeploying:
# Delete Kafka pods
kubectl delete pod kafka-0 kafka-1 kafka-2 -n kafkapreprod --force --grace-period=0
# Delete Kafka PVCs
kubectl delete pvc data0-kafka-0 data0-kafka-1 data0-kafka-2 -n kafkapreprod --force
# Restart Kafka deployment
kubectl rollout restart statefulset kafka -n kafkapreprod
If you need to preserve a specific cluster ID, ensure it’s set consistently across redeployments.
Metric Reporter TLS Issues
When encountering errors about metric reporter TLS:
metricReporter:
enabled: true
tls:
enabled: true
jksPassword:
secretRef: kafka-metric-secret
key: password
Azure Storage Integration
For tiered storage with Azure Blob:
server:
- confluent.tier.feature=true
- confluent.tier.enable=true
- confluent.tier.backend=AzureBlockBlob
- confluent.tier.azure.block.blob.container=broker-sandbox
- confluent.tier.azure.block.blob.cred.file.path=/mnt/secrets/credential/blob-cred.json
Remember to create and mount the Azure credentials secret properly.
Conclusion
Deploying Confluent Kafka on AKS provides a powerful and scalable event streaming platform, but self-hosting Kafka on AKS can be daunting. I can personally attest to this - my first setup of Confluent Kafka was an exercise in patience and perseverance. What seemed straightforward in documentation became a labyrinth of interdependent configurations, cryptic error messages, and late nights troubleshooting unexpected behaviors. The learning curve was steep, and at times I questioned if self-hosting was the right approach at all.
While there are challenges around storage management, authentication, cluster configuration, and maintaining high availability, understanding these common issues and their solutions will help ensure a successful deployment.
For production environments, I strongly recommend implementing proper backup and recovery tooling like Velero. Velero can help you:
- Back up and restore your entire Kafka cluster, including all CRDs and state
- Create scheduled backups for disaster recovery
- Migrate cluster resources between environments
- Protect against accidental deletions or corruptions
Remember to maintain backups of critical data, especially when modifying persistent volumes, and always follow security best practices for credential management. Proper disaster recovery planning will save you from many of the headaches discussed in this post - headaches I’ve experienced firsthand.
Given the complexity involved, organizations should also consider Confluent Cloud as an alternative to self-managed deployments, particularly for production workloads where reliability is critical.
Happy streaming!