summary refs log tree commit diff stats
Commit message (Collapse)AuthorAgeFilesLines
* cpr: relax blockdev migration blockersSteve Sistare2023-11-017-7/+7
| | | | | | | | | | | | | | | | Some blockdevs block migration because they do not support sharing across hosts and/or do not support dirty bitmaps. These prohibitions do not apply if the old and new qemu processes do not run concurrently, and if new qemu starts on the same host as old, which is the case for cpr. Narrow the scope of these blockers so they only apply to normal mode. They will not block cpr modes when they are added in subsequent patches. No functional change until a new mode is added. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1698263069-406971-4-git-send-email-steven.sistare@oracle.com>
* migration: per-mode blockersSteve Sistare2023-11-013-17/+132
| | | | | | | | | | | | | | | | | | Extend the blocker interface so that a blocker can be registered for one or more migration modes. The existing interfaces register a blocker for all modes, and the new interfaces take a varargs list of modes. Internally, maintain a separate blocker list per mode. The same Error object may be added to multiple lists. When a block is deleted, it is removed from every list, and the Error is freed. No functional change until a new mode is added. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1698263069-406971-3-git-send-email-steven.sistare@oracle.com>
* migration: mode parameterSteve Sistare2023-11-017-3/+74
| | | | | | | | | | | | | | Create a mode migration parameter that can be used to select alternate migration algorithms. The default mode is normal, representing the current migration algorithm, and does not need to be explicitly set. No functional change until a new mode is added, except that the mode is shown by the 'info migrate' command. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1698263069-406971-2-git-send-email-steven.sistare@oracle.com>
* migration: Add tracepoints for downtime checkpointsPeter Xu2023-11-013-7/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is inspired by Joao Martin's patch here: https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com Add tracepoints for major downtime checkpoints on both src and dst. They share the same tracepoint with a string showing its stage. Besides the checkpoints in the previous patch, this patch also added destination checkpoints. On src, we have these checkpoints added: - src-downtime-start: right before vm stops on src - src-vm-stopped: after vm is fully stopped - src-iterable-saved: after all iterables saved (END sections) - src-non-iterable-saved: after all non-iterable saved (FULL sections) - src-downtime-stop: migration fully completed On dst, we have these checkpoints added: - dst-precopy-loadvm-completes: after loadvm all done for precopy - dst-precopy-bh-*: record BH steps to resume VM for precopy - dst-postcopy-bh-*: record BH steps to resume VM for postcopy On dst side, we don't have a good way to trace total time consumed by iterable or non-iterable for now. We can mark it by 1st time receiving a FULL / END section, but rather than that let's just rely on the other tracepoints added for vmstates to back up the information. With this patch, one can enable "vmstate_downtime*" tracepoints and it'll enable all tracepoints for downtime measurements necessary. Drop loadvm_postcopy_handle_run_bh() tracepoint alongside, because they service the same purpose, which was only for postcopy. We then have unified prefix for all downtime relevant tracepoints. Co-developed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231030163346.765724-6-peterx@redhat.com>
* migration: migration_stop_vm() helperPeter Xu2023-11-012-3/+10
| | | | | | | | | | | Provide a helper for non-COLO use case of migration to stop a VM. This prepares for adding some downtime relevant tracepoints to migration, where they may or may not apply to COLO. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231030163346.765724-5-peterx@redhat.com>
* migration: Add per vmstate downtime tracepointsPeter Xu2023-11-012-4/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a bunch of savevm_section* tracepoints, they're good to analyze migration stream, but not always suitable if someone would like to analyze the migration downtime. Two major problems: - savevm_section* tracepoints are dumping all sections, we only care about the sections that contribute to the downtime - They don't have an identifier to show the type of sections, so no way to filter downtime information either easily. We can add type into the tracepoints, but instead of doing so, this patch kept them untouched, instead of adding a bunch of downtime specific tracepoints, so one can enable "vmstate_downtime*" tracepoints and get a full picture of how the downtime is distributed across iterative and non-iterative vmstate save/load. Note that here both save() and load() need to be traced, because both of them may contribute to the downtime. The contribution is not a simple "add them together", though: consider when the src is doing a save() of device1 while the dest can be load()ing for device2, so they can happen concurrently. Tracking both sides make sense because device load() and save() can be imbalanced, one device can save() super fast, but load() super slow, vice versa. We can't figure that out without tracing both. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231030163346.765724-4-peterx@redhat.com>
* migration: Add migration_downtime_start|end() helpersPeter Xu2023-11-011-13/+24
| | | | | | | | | | Unify the three users on recording downtimes with the same pair of helpers. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231030163346.765724-3-peterx@redhat.com>
* migration: Set downtime_start even for postcopyPeter Xu2023-11-011-2/+3
| | | | | | | | | | | | | | | Postcopy calculates its downtime separately. It always sets MigrationState.downtime properly, but not MigrationState.downtime_start. Make postcopy do the same as other modes on properly recording the timestamp when the VM is going to be stopped. Drop the temporary variable in postcopy_start() along the way. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231030163346.765724-2-peterx@redhat.com>
* migration: Use vmstate_register_any() for vmware_vgaJuan Quintela2023-11-011-1/+1
| | | | | | | | | | I have no idea if we can have more than one vmware_vga device, so play it safe. Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-14-quintela@redhat.com>
* migration: Use vmstate_register_any() for eeprom93xxJuan Quintela2023-11-011-1/+1
| | | | | | | | | | | We can have more than one eeprom93xx. For instance: e100_nic_realize() -> eeprom93xx_new() Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-13-quintela@redhat.com>
* migration: Use vmstate_register_any() for audioJuan Quintela2023-11-011-1/+1
| | | | | | | | | | | | | | | | | We can have more than one audio backend. void audio_init_audiodevs(void) { AudiodevListEntry *e; QSIMPLEQ_FOREACH(e, &audiodevs, next) { audio_init(e->dev, &error_fatal); } } Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-12-quintela@redhat.com>
* migration: Improve example and documentation of vmstate_register()Juan Quintela2023-11-011-4/+8
| | | | | | Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-11-quintela@redhat.com>
* migration: Check in savevm_state_handler_insert for dupsPeter Xu2023-11-011-0/+14
| | | | | | | | | | | | | | | | | | | | Before finally register one SaveStateEntry, we detect for duplicated entries. This could be helpful to notify us asap instead of get silent migration failures which could be hard to diagnose. For example, this patch will generate a message like this (if without previous fixes on x2apic) as long as we wants to boot a VM instance with "-smp 200,maxcpus=288,sockets=2,cores=72,threads=2" and QEMU will bail out even before VM starts: savevm_state_handler_insert: Detected duplicate SaveStateEntry: id=apic, instance_id=0x0 Suggested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-10-quintela@redhat.com>
* migration: Hack to maintain backwards compatibility for ppcJuan Quintela2023-11-014-4/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current code does: - register pre_2_10_vmstate_dummy_icp with "icp/server" and instance dependinfg on cpu number - for newer machines, it register vmstate_icp with "icp/server" name and instance 0 - now it unregisters "icp/server" for the 1st instance. This is wrong at many levels: - we shouldn't have two VMSTATEDescriptions with the same name - In case this is the only solution that we can came with, it needs to be: * register pre_2_10_vmstate_dummy_icp * unregister pre_2_10_vmstate_dummy_icp * register real vmstate_icp Created vmstate_replace_hack_for_ppc() with warnings left and right that it is a hack. CC: Cedric Le Goater <clg@kaod.org> CC: Daniel Henrique Barboza <danielhb413@gmail.com> CC: David Gibson <david@gibson.dropbear.id.au> CC: Greg Kurz <groug@kaod.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-8-quintela@redhat.com>
* migration: Use VMSTATE_INSTANCE_ID_ANY for slirpJuan Quintela2023-11-011-2/+3
| | | | | | | | | | | | | | Each user network conection create a new slirp instance. We register more than one slirp instance for number 0. qemu-system-x86_64: -netdev user,id=hs1: savevm_state_handler_insert: Detected duplicate SaveStateEntry: id=slirp, instance_id=0x0 Broken pipe ../../../../../mnt/code/qemu/full/tests/qtest/libqtest.c:195: kill_qemu() tried to terminate QEMU process but encountered exit status 1 (expected 0) Aborted (core dumped) Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-6-quintela@redhat.com>
* migration: Use vmstate_register_any() for isa-ideJuan Quintela2023-11-011-1/+1
| | | | | | | | | | | | | | | Otherwise qom-test fails. ok 4 /i386/qom/x-remote qemu-system-i386: savevm_state_handler_insert: Detected duplicate SaveStateEntry: id=isa-ide, instance_id=0x0 Broken pipe ../../../../../mnt/code/qemu/full/tests/qtest/libqtest.c:195: kill_qemu() tried to terminate QEMU process but encountered exit status 1 (expected 0) Aborted (core dumped) $ Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-4-quintela@redhat.com>
* migration: Use vmstate_register_any()Juan Quintela2023-11-0111-17/+12
| | | | | | | | | This are the easiest cases, where we were already using VMSTATE_INSTANCE_ID_ANY. Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-3-quintela@redhat.com>
* migration: Create vmstate_register_any()Juan Quintela2023-11-011-0/+17
| | | | | | | | | | | | | We have lots of cases where we are using an instance_id==0 when we should be using VMSTATE_INSTANCE_ID_ANY (-1). Basically everything that can have more than one needs to have a proper instance_id or -1 and the system will take one for it. vmstate_register_any(): We register with -1. Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-2-quintela@redhat.com>
* hw/s390x/s390-stattrib: Don't call register_savevm_live() during instance_init()Thomas Huth2023-11-011-14/+15
| | | | | | | | | | | | | We must not call register_savevm_live() from an instance_init() function (since this could be called multiple times during device introspection). Move this to the realize() function instead. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020150554.664422-4-thuth@redhat.com>
* hw/s390x/s390-stattrib: Simplify handling of the "migration-enabled" propertyThomas Huth2023-11-011-20/+7
| | | | | | | | | | | | There's no need for dedicated handlers here if they don't do anything special. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Acked-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020150554.664422-3-thuth@redhat.com>
* hw/s390x/s390-skeys: Don't call register_savevm_live() during instance_init()Thomas Huth2023-11-011-27/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Since the instance_init() function immediately tries to set the property to "true", the s390_skeys_set_migration_enabled() tries to register a savevm handler during instance_init(). However, instance_init() functions can be called multiple times, e.g. for introspection of devices. That means multiple instances of devices can be created during runtime (which is fine as long as they all don't get realized, too), so the "Prevent double registration of savevm handler" check in the s390_skeys_set_migration_enabled() function does not work at all as expected (since there could be more than one instance). Thus we must not call register_savevm_live() from an instance_init() function at all. Move this to the realize() function instead. This way we can also get rid of the property getter and setter functions completely, simplifying the code along the way quite a bit. Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Acked-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020150554.664422-2-thuth@redhat.com>
* hw/ipmi: Don't call vmstate_register() from instance_init() functionsThomas Huth2023-11-013-55/+56
| | | | | | | | | | | | instance_init() can be called multiple times, e.g. during introspection of the device. We should not install the vmstate handlers here. Do it in the realize() function instead. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Acked-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020145554.662751-1-thuth@redhat.com>
* Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into stagingStefan Hajnoczi2023-11-0142-331/+1437
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer patches - virtio-blk: use blk_io_plug_call() instead of notification BH - mirror: allow switching from background to active mode - qemu-img rebase: add compression support - Fix locking in media change monitor commands - Fix a few blockjob-related deadlocks when using iothread # -----BEGIN PGP SIGNATURE----- # # iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmVBTkERHGt3b2xmQHJl # ZGhhdC5jb20ACgkQfwmycsiPL9ZiqRAAqvsWbblmEGJ7TBKYQK3f8QshJ66RxzbC # 4eSjKHrciWNTeeIeU8r8OvFcPPoTcPXxpcmasD2gsAxG5W5N8vkPbBkW+YT4YdDJ # pWJXrbJ15nILC4DmnR1ARVtvxKgv9zy5LSm5bjss1K+OSYJl/nx+ILjmfVZnYDF7 # z1dP/G0JxKKm4JzAIdBE3uZS+6Q5kx/wGYlJv8EQmlH3DYfsJfy6Lthe9jfw8ijg # lSqLoQ+D0lEd6Bk4XbkUqqBxFcYBWTfU6qPZoyIO94zCTwTG9yIjmoivxmmfwQZq # cJUTGGZjcxpJYnvcC6P13WgcWBtcD9L2kYFVH0JyjpwcSg9cCGHMF66n9pSlyEGq # DUikwVzbTwOotwzYQyM88v4ET+2+Qdcwn8pRbv9PllEczh0kAsUAEuxSgtz4NEcN # bZrap/16xHFybNOKkMZcmpqxspT5NXKbDODUP0IvbSYMOYpWS983nBTxwMRpyHog # 2TFDZu4DjNiPkI2BcYM5VOKk6diNowZFShcEKvoaOLX/n9EBhP0tjoH9VUn1800F # myHrhF2jpIf9GhErMWB7N2W3/0aK0pqdQgbpVnd1ARDdIdYkr7G/S+50D9K80b6n # 0q2E7br4S5bcsY0HQzBL9YARSayY+lVOssLoolCWEsYzijdBQmAvs5THajFKcism # /idI6nlp2Vs= # =RdxS # -----END PGP SIGNATURE----- # gpg: Signature made Wed 01 Nov 2023 03:58:09 JST # gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6 # gpg: issuer "kwolf@redhat.com" # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (27 commits) iotests: add test for changing mirror's copy_mode mirror: return mirror-specific information upon query blockjob: query driver-specific info via a new 'query' driver method qapi/block-core: turn BlockJobInfo into a union qapi/block-core: use JobType for BlockJobInfo's type mirror: implement mirror_change method block/mirror: determine copy_to_target only once block/mirror: move dirty bitmap to filter block/mirror: set actively_synced even after the job is ready blockjob: introduce block-job-change QMP command virtio-blk: remove batch notification BH virtio: use defer_call() in virtio_irqfd_notify() util/defer-call: move defer_call() to util/ block: rename blk_io_plug_call() API to defer_call() blockdev: mirror: avoid potential deadlock when using iothread block: avoid potential deadlock during bdrv_graph_wrlock() in bdrv_close() blockjob: drop AioContext lock before calling bdrv_graph_wrlock() iotests: Test media change with iothreads block: Fix locking in media change monitor commands iotests: add tests for "qemu-img rebase" with compression ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * iotests: add test for changing mirror's copy_modeFiona Ebner2023-10-312-0/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One part of the test is using a throttled source to ensure that there are no obvious issues when changing the copy_mode while there are ongoing requests (source and target images are compared at the very end). The other part of the test is using a throttled target to ensure that the change to active mode actually happened. This is done by hitting the throttling limit, issuing a synchronous write and then immediately verifying the target side. QSD is used, because otherwise, a synchronous write would hang there. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231031135431.393137-11-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * mirror: return mirror-specific information upon queryFiona Ebner2023-10-313-21/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To start out, only actively-synced is returned. For example, this is useful for jobs that started out in background mode and switched to active mode. Once actively-synced is true, it's clear that the mode switch has been completed. Note that completion of the switch might happen much earlier, e.g. if the switch happens before the job is ready, once all background operations have finished. It's assumed that whether the disks are actively-synced or not is more interesting than whether the mode switch completed. That information can still be added if required in the future. In presence of an iothread, the actively_synced member is now shared between the iothread and the main thread, so turn accesses to it atomic. Requires to adapt the output for iotest 109. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231031135431.393137-10-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * blockjob: query driver-specific info via a new 'query' driver methodFiona Ebner2023-10-312-0/+11
| | | | | | | | | | | | | | Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231031135431.393137-9-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qapi/block-core: turn BlockJobInfo into a unionFiona Ebner2023-10-311-3/+5
| | | | | | | | | | | | | | | | | | | | In preparation to additionally return job-type-specific information. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-ID: <20231031135431.393137-8-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qapi/block-core: use JobType for BlockJobInfo's typeFiona Ebner2023-10-313-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to turn BlockJobInfo into a union with @type as the discriminator. That requires it to be an enum. Even without that requirement, it's nicer to have an enum instead of a str here. No functional change is intended. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20231031135431.393137-7-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * mirror: implement mirror_change methodFiona Ebner2023-10-312-4/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | which allows switching the @copy-mode from 'background' to 'write-blocking'. This is useful for management applications, so they can start out in background mode to avoid limiting guest write speed and switch to active mode when certain criteria are fulfilled. In presence of an iothread, the copy_mode member is now shared between the iothread and the main thread, so turn accesses to it atomic. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231031135431.393137-6-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block/mirror: determine copy_to_target only onceFiona Ebner2023-10-311-23/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to allow changing the copy_mode via QMP. When running in an iothread, it could be that copy_mode is changed from the main thread in between reading copy_mode in bdrv_mirror_top_pwritev() and reading copy_mode in bdrv_mirror_top_do_write(), so they might end up disagreeing about whether copy_to_target is true or false. Avoid that scenario by determining copy_to_target only once and passing it to bdrv_mirror_top_do_write() as an argument. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-ID: <20231031135431.393137-5-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block/mirror: move dirty bitmap to filterFiona Ebner2023-10-311-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to allow switching to active mode without draining. Initialization of the bitmap in mirror_dirty_init() still happens with the original/backing BlockDriverState, which should be fine, because the mirror top has the same length. Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231031135431.393137-4-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block/mirror: set actively_synced even after the job is readyFiona Ebner2023-10-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | In preparation to allow switching from background to active mode. This ensures that setting actively_synced will not be missed when the switch happens after the job is ready. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-ID: <20231031135431.393137-3-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * blockjob: introduce block-job-change QMP commandFiona Ebner2023-10-317-1/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | which will allow changing job-type-specific options after job creation. In the JobVerbTable, the same allow bits as for set-speed are used, because set-speed can be considered an existing change command. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-ID: <20231031135431.393137-2-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * virtio-blk: remove batch notification BHStefan Hajnoczi2023-10-311-47/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a batching mechanism for virtio-blk Used Buffer Notifications that is no longer needed because the previous commit added batching to virtio_notify_irqfd(). Note that this mechanism was rarely used in practice because it is only enabled when EVENT_IDX is not negotiated by the driver. Modern drivers enable EVENT_IDX. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-5-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * virtio: use defer_call() in virtio_irqfd_notify()Stefan Hajnoczi2023-10-315-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used Buffer Notifications from an IOThread. This involves an eventfd write(2) syscall. Calling this repeatedly when completing multiple I/O requests in a row is wasteful. Use the defer_call() API to batch together virtio_irqfd_notify() calls made during thread pool (aio=threads), Linux AIO (aio=native), and io_uring (aio=io_uring) completion processing. Behavior is unchanged for emulated devices that do not use defer_call_begin()/defer_call_end() since defer_call() immediately invokes the callback when called outside a defer_call_begin()/defer_call_end() region. fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could be noise. Detailed performance data and configuration specifics are available here: https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd This duplicates the BH that virtio-blk uses for batching. The next commit will remove it. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-4-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * util/defer-call: move defer_call() to util/Stefan Hajnoczi2023-10-3113-7/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The networking subsystem may wish to use defer_call(), so move the code to util/ where it can be reused. As a reminder of what defer_call() does: This API defers a function call within a defer_call_begin()/defer_call_end() section, allowing multiple calls to batch up. This is a performance optimization that is used in the block layer to submit several I/O requests at once instead of individually: defer_call_begin(); <-- start of section ... defer_call(my_func, my_obj); <-- deferred my_func(my_obj) call defer_call(my_func, my_obj); <-- another defer_call(my_func, my_obj); <-- another ... defer_call_end(); <-- end of section, my_func(my_obj) is called once Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-3-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: rename blk_io_plug_call() API to defer_call()Stefan Hajnoczi2023-10-319-79/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare to move the blk_io_plug_call() API out of the block layer so that other subsystems call use this deferred call mechanism. Rename it to defer_call() but leave the code in block/plug.c. The next commit will move the code out of the block layer. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-2-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * blockdev: mirror: avoid potential deadlock when using iothreadFiona Ebner2023-10-311-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bdrv_getlength() function is a generated co-wrapper and uses AIO_WAIT_WHILE() to wait for the spawned coroutine. AIO_WAIT_WHILE() expects the lock to be acquired exactly once. Fix a case where it may be acquired twice. This can happen when the source node is explicitly specified as the @replaces parameter or if the source node is a filter node. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231019131936.414246-4-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: avoid potential deadlock during bdrv_graph_wrlock() in bdrv_close()Fiona Ebner2023-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | by passing the BlockDriverState along, so the held AioContext can be dropped before polling. See commit 31b2ddfea3 ("graph-lock: Unlock the AioContext while polling") which introduced this functionality for more information. The only way to reach bdrv_close() is via bdrv_unref() and for calling that the BlockDriverState's AioContext lock is supposed to be held. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231019131936.414246-3-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * blockjob: drop AioContext lock before calling bdrv_graph_wrlock()Fiona Ebner2023-10-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same rationale as in 31b2ddfea3 ("graph-lock: Unlock the AioContext while polling"). Otherwise, a deadlock can happen. The alternative would be to pass a BlockDriverState along to bdrv_graph_wrlock(), but there is no BlockDriverState readily available and it's also better conceptually, because the lock is held for the job. The function is always called with the job's AioContext lock held, via one of the .abort, .clean, .free or .prepare job driver functions. Thus, it's safe to drop it. While mirror_exit_common() does hold a second AioContext lock while calling block_job_remove_all_bdrv(), that is for the main thread's AioContext and does not need to be dropped (bdrv_graph_wrlock(bs) also skips dropping the lock if bdrv_get_aio_context(bs) == qemu_get_aio_context()). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20231019131936.414246-2-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * iotests: Test media change with iothreadsKevin Wolf2023-10-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | iotests case 118 already tests all relevant operations for media change with multiple devices, however never with iothreads. This changes the test so that the virtio-scsi tests run with an iothread. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20231013153302.39234-3-kwolf@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: Fix locking in media change monitor commandsKevin Wolf2023-10-311-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_insert_bs() requires that the caller holds the AioContext lock for the node to be inserted. Since commit c066e808e11, neglecting to do so causes a crash when the child has to be moved to a different AioContext to attach it to the BlockBackend. This fixes qmp_blockdev_insert_anon_medium(), which is called for the QMP commands 'blockdev-insert-medium' and 'blockdev-change-medium', to correctly take the lock. Cc: qemu-stable@nongnu.org Fixes: https://issues.redhat.com/browse/RHEL-3922 Fixes: c066e808e11a5c181b625537b6c78e0de27a4801 Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20231013153302.39234-2-kwolf@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * iotests: add tests for "qemu-img rebase" with compressionAndrey Drobyshev2023-10-314-0/+345
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The test cases considered so far: 314 (new test suite): 1. Check that compression mode isn't compatible with "-f raw" (raw format doesn't support compression). 2. Check that rebasing an image onto no backing file preserves the data and writes the copied clusters actually compressed. 3. Same as 2, but with a raw backing file (i.e. the clusters copied from the backing are originally uncompressed -- we check they end up compressed after being merged). 4. Remove a single delta from a backing chain, perform the same checks as in 2. 5. Check that even when backing and overlay are initially uncompressed, copied clusters end up compressed when rebase with compression is performed. 271: 1. Check that when target image has subclusters, rebase with compression will make an entire cluster containing the written subcluster compressed. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-9-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-img: add compression option to rebase subcommandAndrey Drobyshev2023-10-313-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we rebase an image whose backing file has compressed clusters, we might end up wasting disk space since the copied clusters are now uncompressed. In order to have better control over this, let's add "--compress" option to the "qemu-img rebase" command. Note that this option affects only the clusters which are actually being copied from the original backing file. The clusters which were uncompressed in the target image will remain so. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-8-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * iotests/{024, 271}: add testcases for qemu-img rebaseAndrey Drobyshev2023-10-314-0/+211
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the previous commit changes the logic of "qemu-img rebase" (it's using write alignment now), let's add a couple more test cases which would ensure it works correctly. In particular, the following scenarios: 024: add test case for rebase within one backing chain when the overlay cluster size > backings cluster size; 271: add test case for rebase images that contain subclusters. Check that no extra allocations are being made. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-7-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-img: rebase: avoid unnecessary COW operationsAndrey Drobyshev2023-10-311-20/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When rebasing an image from one backing file to another, we need to compare data from old and new backings. If the diff between that data happens to be unaligned to the target cluster size, we might end up doing partial writes, which would lead to copy-on-write and additional IO. Consider the following simple case (virtual_size == cluster_size == 64K): base <-- inc1 <-- inc2 qemu-io -c "write -P 0xaa 0 32K" base.qcow2 qemu-io -c "write -P 0xcc 32K 32K" base.qcow2 qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2 qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2 qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2 While doing rebase, we'll write a half of the cluster to inc2, and block layer will have to read the 2nd half of the same cluster from the base image inc1 while doing this write operation, although the whole cluster is already read earlier to perform data comparison. In order to avoid these unnecessary IO cycles, let's make sure every write request is aligned to the overlay subcluster boundaries. Using subcluster size is universal as for the images which don't have them this size equals to the cluster size. so in any case we end up aligning to the smallest unit of allocation. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Message-ID: <20230919165804.439110-6-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-img: add chunk size parameter to compare_buffers()Andrey Drobyshev2023-10-311-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add @chsize param to the function which, if non-zero, would represent the chunk size to be used for comparison. If it's zero, then BDRV_SECTOR_SIZE is used as default chunk size, which is the previous behaviour. In particular, we're going to use this param in img_rebase() to make the write requests aligned to a predefined alignment value. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-5-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-img: rebase: use backing files' BlockBackend for buffer alignmentAndrey Drobyshev2023-10-311-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for the data read from the old and new backing files are aligned using BlockDriverState (or BlockBackend later on) referring to the target image. However, this isn't quite right, because buf_new is only being used for reading from the new backing, while buf_old is being used for both reading from the old backing and writing to the target. Let's take that into account and use more appropriate values as alignments. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Message-ID: <20230919165804.439110-4-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-iotests: 024: add rebasing test case for overlay_size > backing_sizeAndrey Drobyshev2023-10-312-0/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Before previous commit, rebase was getting infitely stuck in case of rebasing within the same backing chain and when overlay_size > backing_size. Let's add this case to the rebasing test 024 to make sure it doesn't break again. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-3-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qemu-img: rebase: stop when reaching EOF of old backing fileAndrey Drobyshev2023-10-311-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case when we're rebasing within one backing chain, and when target image is larger than old backing file, bdrv_is_allocated_above() ends up setting *pnum = 0. As a result, target offset isn't getting incremented, and we get stuck in an infinite for loop. Let's detect this case and proceed further down the loop body, as the offsets beyond the old backing size need to be explicitly zeroed. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20230919165804.439110-2-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>