KVM crashes when attempting to restart migration

Operations performed:
Sequence to trigger crash:

    * Start two kvm systems, one on gerph (primary), one on nbuild2 (listening for incoming migration) - do not use -daemonize
    * On gerph, connect to monitor.
    * "migrate -d -b tcp:nbuild2:4444"
    * "info migrate"
    * "migrate_cancel"
    * "info migrate"
    * "migrate -d -b tcp:nbuild2:4444"
    * crashed with assertion:
kvm: block-migration.c:355: flush_blks: Assertion `block_mig_state.read_done >= 0' failed.
                 Connection closed by foreign host.
[1]+  Aborted                 (core dumped) kvm -drive file=./copy-disk2.img,boot=on -m 4096 -serial mon:telnet::23023,server,nowait -balloon virtio -vnc :99 -usbdevice tablet -net nic,macaddr=f6:a6:31:53:89:9a,model=rtl8139,vlan=0 -net tap,vlan=0


Repeating the operations above often dies in different places; just repeat the cancel and restart the operation. Because the KVM system dies, the underlying VM is obviously terminated.

Distribution:

jfletcher@gerph:~$ lsb_release -rd
Description:	Ubuntu 10.04.3 LTS
Release:	10.04

Package:

jfletcher@gerph:~$ apt-cache policy kvm
kvm:
  Installed: 1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9.15
  Candidate: 1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9.15
  Version table:
 *** 1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9.15 0
        500 http://gb.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        500 http://security.ubuntu.com/ubuntu/ lucid-security/main Packages
        100 /var/lib/dpkg/status
     1:84+dfsg-0ubuntu16+0.12.3+noroms+0ubuntu9 0
        500 http://gb.archive.ubuntu.com/ubuntu/ lucid/main Packages

Thanks for taking the time to submit this bug and helping to make Ubuntu better.

Just to be sure I understand right, if you simply let the migration continue rather than canceling it, you don't get an error, right?  I'll mark this low priority under that assumption.  If I'm wrong, then priority should be raised.

(Leaving status New until I manage to reproduce)

That's correct for the testing I have performed.

I have been able to perform repeated migrate/migrate_cancel operations much more quickly than I have been able to perform actual migrations, therefore the test set of migrate operations after a cancel is at least an order of magnitude larger than the test set of completing migrations.

Background in case it's relevant:
I was doing this to test the behaviour if (for example) the target system failed during the migration and it was necessary to cancel and restart, as such resilience is important for the services I maintain.

If there's any more information required, I'm happy to provide help :-)

If you *need* to use the live migration (rather than offline migration by copying the disk images) you have already made a decision that the service is sufficiently important that you cannot have downtime on it. If the live migration could fail, and resuming it could crash (as reported), this is going to be a serious concern and most likely not a risk you would wish to take with a service that you have already decided is so vital as to not need downtime.

The migration feature that if used might crash, is not a feature I would like to trust my valuable services to.

Therefore I would suggest that this crash have the same priority as the migration feature. If migration is a low priority feature then it would be find as 'low' priority', but if the live migration is an important feature to have then it needs to be solid.

As an administrator of services, I play have a game of Russian-roulette with them, and migration is that game at present.

Oops,  I meant "I cannot play a game of ..."

Definately confirmed on lucid.

In quantal migration always fails, but still after one failed attempt, if I do 'migrate_cancel' and re-try the migration, I get the same error.

This causes you to lose your VM state.

upstream git head qemu still behaves the same as quantal qemu-kvm (1.1.0), marking a affecting upstream.

Can you still reproduce this issue with the latest version of QEMU (currently v2.8)?

I haven't attempted to reproduce the issue recently, I'm afraid. I've changed jobs twice in the intervening time, so the immediate issue for me has gone away. If I find an opportunity, I shall try to reproduce with the most recent versions.


[Expired for qemu-kvm (Ubuntu) because there has been no activity for 60 days.]

[Expired for QEMU because there has been no activity for 60 days.]