Troubleshooting KubeVirt: CNI Conflicts, CDI Errors, and Common Fixes

KubeVirt adds a virtualization layer on top of Kubernetes, which means debugging spans two stacks: the Kubernetes layer (pods, services, PVs) and the KVM/libvirt layer (QEMU processes, virtio drivers, disk images). When something breaks, you need to know which stack to look at first and how to trace the problem from one layer into the other.

This guide covers the five most frequently reported issues in KubeVirt deployments, drawn from GitHub issues, community forums, and production experience. Each issue follows the same structure: symptoms you will see, how to diagnose, the root cause, and how to fix it.

Before you start, make sure you have these tools ready: kubectl, virtctl, and SSH access to your cluster nodes for node-level debugging. You will need all three at various points in this guide.

Issue 1: CNI Conflicts — VMs Cannot Reach the Network

This is the single most common problem teams hit when deploying KubeVirt for the first time. It comes down to a fundamental mismatch between what Kubernetes networking provides and what virtual machines expect.

Symptoms

The VM boots successfully but cannot ping external hosts.
The VM can ping the node IP but not other pods or the internet.
ARP requests from the VM are not being resolved.

Diagnosis

Start by checking the VMI status and network configuration:

# Check the VMI status and network info
kubectl get vmi <vm-name> -o yaml | grep -A 20 interfaces

Next, inspect the network namespace inside the virt-launcher pod:

# Check the virt-launcher pod's network namespace
kubectl exec -it virt-launcher-<vm-name>-xxxxx -- ip addr

If the interfaces look correct, check whether traffic is actually reaching the VM’s virtual bridge:

# Check if traffic reaches the VM's virtual bridge
kubectl exec -it virt-launcher-<vm-name>-xxxxx -- tcpdump -i k6t-eth0 -n

If you see ARP requests leaving the VM but no replies coming back, you have a Layer 2/Layer 3 conflict.

Root Cause

Standard Kubernetes CNIs (Calico, Cilium, Flannel) operate at Layer 3 — they route IP packets between pods. But many VMs expect Layer 2 connectivity: ARP resolution, broadcast traffic, and direct Ethernet framing. The virtual ethernet bridge (k6t-eth0) inside the virt-launcher pod bridges the VM’s NIC to the pod network, but Layer 2 frames do not traverse the CNI’s overlay correctly. The CNI only knows how to forward IP packets, not raw Ethernet frames, so ARP requests from the VM disappear into a void.

Fix

You have two options depending on your requirements.

For basic connectivity (VM just needs internet and pod-to-pod access): Use the default masquerade binding mode. This NATs VM traffic through the pod IP, which works with any CNI because the VM’s traffic exits the pod as regular IP packets:

spec:
  domain:
    devices:
      interfaces:
        - name: default
          masquerade: {}
  networks:
    - name: default
      pod: {}

Masquerade mode handles the translation between the VM’s internal network and the pod network transparently. The VM gets an internal IP (typically 10.0.2.15), and all outbound traffic is NATed through the pod’s IP address.

For Layer 2 access (VM needs DHCP, broadcast, or VLANs): Install Multus CNI and create a NetworkAttachmentDefinition with a bridge plugin:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: br-lan
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "type": "bridge",
      "bridge": "br-lan",
      "ipam": {}
    }

Then attach your VM to this network:

spec:
  domain:
    devices:
      interfaces:
        - name: lan
          bridge: {}
  networks:
    - name: lan
      multus:
        networkName: br-lan

Warning: Layer 2/Layer 3 friction is the number one networking issue in KubeVirt. If your VMs need Layer 2 (which most legacy VMs do), you must use Multus. There is no workaround within the default pod network. Plan for this early in your deployment rather than retrofitting it later.

Issue 2: CDI Importer Errors — DataVolumes Stuck

The Containerized Data Importer (CDI) handles importing disk images into PVCs. When it fails, your VMs cannot start because there is no disk to boot from.

Symptoms

DataVolume stuck in ImportScheduled or ImportInProgress for an extended period.
CDI importer pod shows errors or restarts repeatedly.
kubectl get dv <name> shows progress stuck at 0% or a low percentage.

Diagnosis

# Check DataVolume status
kubectl describe dv <datavolume-name>

# Find and check the importer pod
kubectl get pods -l cdi.kubevirt.io/storage.import.importPvcName=<pvc-name>
kubectl logs <cdi-importer-pod-name>

The importer pod logs are where the real answers live. The DataVolume status will tell you it is stuck; the pod logs will tell you why.

Common Causes and Fixes

Cause A: PVC size too small. The target PVC must be at least as large as the disk image’s virtual size. QCOW2 images have a virtual size that can be much larger than their file size on disk — a 2GB QCOW2 file might expand to a 50GB virtual disk.

# Check virtual size of a QCOW2 image
qemu-img info <image-file>

Look for the “virtual size” field. Set your DataVolume’s PVC size to at least this value, plus some headroom for filesystem overhead (roughly 5-10% extra).

Cause B: StorageClass does not support dynamic provisioning. CDI needs to create scratch PVCs during the import process for format conversion and validation. If the default StorageClass does not support dynamic provisioning, the scratch PVC stays in a Pending state and the import never progresses.

Fix: Set a default StorageClass that supports dynamic provisioning, or specify spec.storage.storageClassName explicitly in your DataVolume manifest. Verify with kubectl get sc — the default class should have (default) next to its name and a provisioner that supports dynamic provisioning.

Cause C: Unsupported source format. CDI supports QCOW2, RAW, VMDK (with conversion), and ISO out of the box. Other formats like VHD or VHDX require manual conversion before import.

qemu-img convert -f vhd -O qcow2 source.vhd target.qcow2

Cause D: Network issues pulling the image. If the DataVolume source is an HTTP URL, the importer pod needs outbound network access. Check for proxy configuration, DNS resolution failures, or firewall rules blocking egress from the cluster. The importer pod runs in the same namespace as your DataVolume, so namespace-level NetworkPolicies can also block it.

Tip: For large images (50GB+), imports can take 30+ minutes even on fast networks. Check the CDI importer pod logs for progress updates before assuming the process is stuck. The logs report percentage completion and throughput — if those numbers are moving, the import is working.

Issue 3: VM Fails to Start — virt-launcher CrashLoopBackOff

When a VM fails to start, the symptom almost always surfaces as a failing virt-launcher pod. This pod is the wrapper process that manages the QEMU instance for your VM, so any issue with the VM configuration, node capabilities, or resource availability shows up here.

Symptoms

VirtualMachineInstance stuck in Scheduling or Failed.
virt-launcher pod in CrashLoopBackOff.
Events show errors related to device access or resource limits.

Diagnosis

# Check VMI events
kubectl describe vmi <vm-name>

# Check virt-launcher pod logs
kubectl logs virt-launcher-<vm-name>-xxxxx

# On the node, check libvirt logs (requires SSH)
journalctl -u kubelet | grep <vm-name>

The VMI events will usually contain the first clue. The virt-launcher logs contain the detailed error from QEMU or libvirt.

Common Causes and Fixes

Cause A: KVM device not available. You will see errors like device /dev/kvm not found or failed to create domain. This means the node does not have hardware virtualization enabled, or the /dev/kvm device is not accessible to the pod.

Fix: SSH into the node and check that /dev/kvm exists. If it does not, enable VT-x (Intel) or AMD-V (AMD) in the BIOS/UEFI settings. If you are running on cloud instances, you need instances that support nested virtualization (not all instance types do). As a last resort for development environments, you can enable software emulation in the KubeVirt CR by setting useEmulation: true, but never use this in production — it is orders of magnitude slower.

Cause B: Insufficient resources. Errors like insufficient memory or CPU pinning failed mean the VM is requesting more CPU or memory than is available on any schedulable node.

Fix: Check node allocatable resources:

kubectl describe node <node-name> | grep -A 10 "Allocatable"

Compare against your VM’s resource requests. Remember that Kubernetes system components, DaemonSets, and other pods consume resources too. Either reduce the VM’s resource requests or add capacity to the cluster.

Cause C: Image pull failure. If you are using a containerDisk source, the virt-launcher pod must pull the container image that contains the disk. Image pull errors — wrong registry URL, missing authentication, or the image tag does not exist — cause CrashLoopBackOff with image-related error messages in the pod events.

Fix: Verify the image URL is correct, check that imagePullSecrets are configured in the VM’s namespace, and test pulling the image manually:

crictl pull <image-url>

Issue 4: Live Migration Failures

Live migration is one of the key advantages of running VMs on KubeVirt — it lets you move running VMs between nodes for maintenance, load balancing, or upgrades. But it has strict prerequisites that are not always obvious.

Symptoms

Migration stuck in Scheduling or TargetReady phase.
Migration fails with timeout or eviction errors.
kubectl get vmim shows failed migrations.

Diagnosis

# Check migration status
kubectl get vmim -A
kubectl describe vmim <migration-name>

# Check target node's virt-handler logs
kubectl logs -n kubevirt virt-handler-xxxxx --since=5m

The migration object’s events will tell you which phase failed. The virt-handler logs on the target node show what went wrong during the actual migration attempt.

Prerequisites for Live Migration

Before troubleshooting the specific error, verify these three prerequisites are met:

Shared storage. The VM’s PVCs must use a ReadWriteMany (RWX) access mode. Both the source and target nodes need simultaneous access to the same storage. Local storage, hostPath volumes, and ReadWriteOnce PVCs do not work.
Compatible CPU models. The source and target nodes must have compatible CPU feature sets. If the source node has AVX-512 and the target does not, migration will fail. Use spec.domain.cpu.model: host-model for flexibility across heterogeneous clusters, or pin to a specific baseline model.
Sufficient network bandwidth. Large VMs with active workloads generate dirty pages faster than migration can copy them. If the dirty page rate exceeds the migration bandwidth, the migration never converges and eventually times out.

Common Fixes

Switch storage to a distributed backend that supports RWX access — Rook-Ceph, Longhorn, or your cloud provider’s shared filesystem. This is non-negotiable for live migration.

If network bandwidth is a constraint, set an explicit migration bandwidth limit to prevent migration traffic from saturating the network:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    migrations:
      bandwidthPerMigration: 64Mi

You can also increase the migration timeout and set the number of allowed convergence steps to give large VMs more time to complete:

spec:
  configuration:
    migrations:
      completionTimeoutPerGiB: 800
      progressTimeout: 150

Warning: Without shared storage (RWX), live migration is impossible. VMs on local storage are pinned to their node. Plan your storage backend accordingly before you need to evacuate nodes for maintenance.

Issue 5: Poor VM Performance

Your VM starts and runs, but it feels noticeably slower than a bare-metal or traditional hypervisor setup. This is not an inherent limitation of KubeVirt — it is almost always a configuration issue.

Symptoms

The VM feels sluggish despite adequate resource allocation.
CPU-intensive workloads run significantly slower than expected.
Disk I/O benchmarks show numbers far below host-level benchmarks.

Diagnosis

First, check whether software emulation is active:

# Check if using software emulation
kubectl get kubevirt kubevirt -n kubevirt -o jsonpath='{.spec.configuration.developerConfiguration.useEmulation}'

If that returns true, you have found your problem. Next, check the CPU model the VM is actually using:

virtctl console <vm-name>
# Inside the VM, run:
lscpu | grep "Model name"

If the model name shows something generic like “QEMU Virtual CPU” instead of your actual host CPU, you are not getting CPU passthrough.

Common Causes and Fixes

Cause A: Software emulation. If useEmulation: true, the VM runs entirely on QEMU without KVM hardware acceleration. This is 10-100x slower than hardware-accelerated virtualization. It exists only for development and testing on machines without VT-x/AMD-V.

Fix: Enable hardware virtualization on your nodes and set useEmulation: false (or remove the setting entirely, since false is the default).

Cause B: No CPU passthrough. By default, KubeVirt presents a generic CPU model to VMs. This ensures broad compatibility but hides performance-enhancing CPU features from the guest. For CPU-intensive workloads, pass through the host CPU features:

spec:
  domain:
    cpu:
      model: host-passthrough
      dedicatedCpuPlacement: true

The dedicatedCpuPlacement option pins VM vCPUs to physical cores, eliminating scheduling jitter. Note that this requires the CPU manager to be enabled on the node (kubelet --cpu-manager-policy=static).

Cause C: No hugepages. For memory-intensive VMs, the default 4KB page size causes translation lookaside buffer (TLB) pressure and frequent page table walks. Hugepages (2MB or 1GB) reduce this overhead significantly:

spec:
  domain:
    memory:
      hugepages:
        pageSize: 2Mi
    resources:
      requests:
        memory: 4Gi

The node must have hugepages pre-allocated. On each node:

echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

For persistent configuration, add hugepagesz=2M hugepages=2048 to the kernel boot parameters.

Cause D: Virtio drivers missing. This primarily affects Windows VMs. Without virtio drivers, KubeVirt falls back to IDE emulation for disks and e1000 emulation for network — both are dramatically slower than their virtio counterparts. Disk throughput can drop by 5-10x without virtio.

Fix: Install the virtio-win drivers inside the VM guest. You can attach the virtio-win ISO as a secondary CD-ROM during VM creation and install the drivers from within Windows Device Manager.

Diagnostic Commands Reference

Keep this table handy when troubleshooting. These are the commands you will reach for most often:

Command	Purpose
`kubectl get vmi`	List running VM instances
`kubectl describe vmi <name>`	Detailed VMI status and events
`virtctl console <name>`	Access VM serial console
`virtctl ssh <name>`	SSH into VM (requires guest agent)
`kubectl logs virt-launcher-<name>-xxx`	virt-launcher pod logs
`kubectl get dv`	DataVolume import status
`kubectl get vmim`	Migration status
`kubectl -n kubevirt logs -l kubevirt.io=virt-handler`	virt-handler logs
`kubectl -n kubevirt logs -l kubevirt.io=virt-controller`	virt-controller logs

Next Steps

What is KubeVirt? — review the fundamentals if any of the concepts in this guide were unfamiliar
Installing KubeVirt — verify your installation is configured correctly
Planning Your VMware to KubeVirt Migration — migration planning guide for teams moving off VMware

Summary

The most common KubeVirt issues fall into five categories: networking (Layer 2/Layer 3 CNI friction), disk image import (CDI sizing and format issues), VM startup failures (KVM device, resources, image pulls), live migration (storage and CPU compatibility), and performance (emulation, CPU model, hugepages). For networking, use masquerade mode for simple setups and Multus with a bridge plugin for Layer 2 requirements. For performance, ensure hardware virtualization is enabled and consider CPU passthrough and hugepages for demanding workloads. When in doubt, start with the virt-launcher pod logs — they contain the most useful error messages in the KubeVirt stack.