summary refs log tree commit diff stats
path: root/include/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* blockjob: mark block_job_remove_all_bdrv() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-1/+1
| | | | | | | | | | | The function block_job_remove_all_bdrv() calls bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-49-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_open_child_common() and its callers GRAPH_UNLOCKEDFiona Ebner2025-07-141-4/+5
| | | | | | | | | | | | | | The function bdrv_open_child_common() calls bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. Mark it and its two callers bdrv_open_file_child() and bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily unlocking in vmdk_parse_extents() and making the locked section shorter in vmdk_open(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_close() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-1/+1
| | | | | | | | | | | | | | | | | | The functions blk_log_writes_close(), blkverify_close(), quorum_close(), vmdk_close() via vmdk_free_extents(), and other bdrv_close() implementations call bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. They are reached via the BlockDriver's bdrv_close() callback and the bdrv_close() wrapper, which are also marked as GRAPH_UNLOCKED_PTR and GRAPH_UNLOCKED. Furthermore, the function bdrv_close() also calls bdrv_drained_begin() and bdrv_graph_wrlock_drained(), so there are additional reasons for marking it GRAPH_UNLOCKED. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-47-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_close_all() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-1/+1
| | | | | | | | | | The function bdrv_close_all() calls bdrv_drain_all(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-46-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_drop_intermediate() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-3/+4
| | | | | | | | | | The function bdrv_drop_intermediate() calls bdrv_drained_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-45-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_insert_node() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-2/+3
| | | | | | | | | | The function bdrv_insert_node() calls bdrv_drained_begin() which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-44-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_replace_child_bs() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-2/+2
| | | | | | | | | | The function bdrv_replace_child_bs() calls bdrv_drained_begin() which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-43-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_inactivate_all() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-1/+1
| | | | | | | | | | The function bdrv_inactivate_all() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-37-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_inactivate() as GRAPH_RDLOCK and move drain to callersFiona Ebner2025-07-141-1/+1
| | | | | | | | | | | | | | The function bdrv_inactivate() calls bdrv_drain_all_begin(), which needs to be called with the graph unlocked, so either bdrv_inactivate() should be marked as GRAPH_UNLOCKED or the drain needs to be moved to the callers. The caller in qmp_blockdev_set_active() requires that the locked section covers bdrv_find_node() too, so the latter alternative is chosen. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-36-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-4/+5
| | | | | | | | | | | | | | | | The function bdrv_reopen_queue() can call bdrv_drain_all_begin(), which must be called with the graph unlocked. The function bdrv_reopen_multiple() calls bdrv_reopen_prepare() which must be called with the graph unlocked. To mark bdrv_reopen_queue() as GRAPH_UNLOCKED, it is necessary to make the locked section in reopen_backing_file() shorter. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-35-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block/snapshot: mark bdrv_all_delete_snapshot() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-3/+3
| | | | | | | | | | The function bdrv_all_delete_snapshot() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-33-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: drop wrapper for bdrv_set_backing_hd_drained()Fiona Ebner2025-07-141-4/+1
| | | | | | | | | | | | Nearly all callers (outside of the tests) are already using the _drained() variant of the function. It doesn't seem worth keeping. Simply adapt the remaining callers of bdrv_set_backing_hd() and rename bdrv_set_backing_hd_drained() to bdrv_set_backing_hd(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-31-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_set_backing_hd() as GRAPH_UNLOCKEDFiona Ebner2025-07-141-2/+3
| | | | | | | | | | The function bdrv_set_backing_hd() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-29-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: add bdrv_graph_wrlock_drained() convenience wrapperFiona Ebner2025-07-141-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Many write-locked sections are also drained sections. A new bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is introduced, which will begin a drained section first. A global variable is used so bdrv_graph_wrunlock() knows if it also needs to end such a drained section. Both the aio_poll call in bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock() can re-enter a write-locked section. While for the latter, ending the drain could be moved to before the call, the former requires that the variable is a counter and not just a boolean. Since the wrapper calls bdrv_drain_all_begin(), which must be called with the graph unlocked, mark the wrapper as GRAPH_UNLOCKED too. The switch to the new helpers was generated with the following commands and then manually checked: find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock_drained();/g' {} ';' find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock();/g' {} ';' Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-25-f.ebner@proxmox.com> [kwolf: Removed redundant GRAPH_UNLOCKED] Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: never use atomics to access bs->quiesce_counterFiona Ebner2025-07-141-1/+1
| | | | | | | | | | | | | | | | | | All accesses of bs->quiesce_counter are in the main thread, either after a GLOBAL_STATE_CODE() macro or in a function with GRAPH_WRLOCK annotation. This is essentially a revert of 414c2ec358 ("block: access quiesce_counter with atomic ops"). At that time, neither the GLOBAL_STATE_CODE() macro nor the GRAPH_WRLOCK annotation existed. Even if the field was only accessed in the main thread back then (did not check if that is actually the case), it wouldn't have been easy to verify. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-24-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_drained_begin() and friends as GRAPH_UNLOCKEDFiona Ebner2025-06-042-3/+3
| | | | | | | | | | | | All of bdrv_drain_all_begin(), bdrv_drain_all() and bdrv_drained_begin() poll and are not allowed to be called with the block graph lock held. Mark the function as such. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-20-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: move drain outside of quorum_del_child()Fiona Ebner2025-06-041-0/+7
| | | | | | | | | | | | | | | | | | | | The quorum_del_child() callback runs under the graph lock, so it is not allowed to drain. It is only called as the .bdrv_del_child() callback, which is only called in the bdrv_del_child() function, which also runs under the graph lock. The bdrv_del_child() function is called by qmp_x_blockdev_change(). A drained section was already introduced there by commit "block: move drain out of quorum_add_child()". This finally finishes moving out the drain to places that are not under the graph lock started in "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-17-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: move drain outside of quorum_add_child()Fiona Ebner2025-06-041-0/+7
| | | | | | | | | | | | | | | | | | This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". The quorum_add_child() callback runs under the graph lock, so it is not allowed to drain. It is only called as the .bdrv_add_child() callback, which is only called in the bdrv_add_child() function, which also runs under the graph lock. The bdrv_add_child() function is called by qmp_x_blockdev_change(), where a drained section is introduced. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-15-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: move drain outside of bdrv_root_attach_child()Fiona Ebner2025-06-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". The function bdrv_root_attach_child() runs under the graph lock, so it is not allowed to drain. It is called by: 1. blk_insert_bs(), where a drained section is introduced. 2. block_job_add_bdrv(), which holds the graph lock itself. block_job_add_bdrv() is called by: 1. mirror_start_job() 2. stream_start() 3. commit_start() 4. backup_job_create() 5. block_job_create() 6. In the test_blockjob_common_drain_node() unit test In all callers, a drained section is introduced. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-13-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: move drain outside of bdrv_try_change_aio_context()Fiona Ebner2025-06-041-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". Convert the function to a _locked() version that has to be called with the graph lock held and add a convenience wrapper that has to be called with the graph unlocked, which drains and takes the lock itself. Since bdrv_try_change_aio_context() is global state code, the wrapper is too. Callers are adapted to use the appropriate variant, depending on whether the caller already holds the lock. In the test_set_aio_context() unit test, prior drains can be removed, because draining already happens inside the new wrapper. Note that bdrv_attach_child_common_abort(), bdrv_attach_child_common() and bdrv_root_unref_child() hold the graph lock and are not actually allowed to drain either. This will be addressed in the following commits. Functions like qmp_blockdev_mirror() query the nodes to act on before draining and locking. In theory, draining could invalidate those nodes. This kind of issue is not addressed by these commits. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-10-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCKFiona Ebner2025-06-041-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. Note that even if bdrv_drained_begin() were already marked as GRAPH_UNLOCKED, TSA would not complain about the instance in bdrv_change_aio_context() before this change, because it is preceded by a bdrv_graph_rdunlock_main_loop() call. It is not correct to release the lock here, and in case the caller holds a write lock, it wouldn't actually release the lock. In combination with block-stream, there is a deadlock that can happen because of this [0]. In particular, it can happen that main thread IO thread 1. acquires write lock in blk_co_do_preadv_part(): 2. have non-zero blk->in_flight 3. try to acquire read lock 4. begin drain Steps 3 and 4 might be switched. Draining will poll and get stuck, because it will see the non-zero in_flight counter. But the IO thread will not make any progress either, because it cannot acquire the read lock. After this change, all paths to bdrv_change_aio_context() drain: bdrv_change_aio_context() is called by: 1. bdrv_child_cb_change_aio_ctx() which is only called via the change_aio_ctx() callback, see below. 2. bdrv_child_change_aio_context(), see below. 3. bdrv_try_change_aio_context(), where a drained section is introduced. The change_aio_ctx() callback is called by: 1. bdrv_attach_child_common_abort(), where a drained section is introduced. 2. bdrv_attach_child_common(), where a drained section is introduced. 3. bdrv_parent_change_aio_context(), see below. bdrv_child_change_aio_context() is called by: 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained section is invariant. 2. child_job_change_aio_ctx(), which is only called via the change_aio_ctx() callback, see above. bdrv_parent_change_aio_context() is called by: 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained section is invariant. This resolves all code paths. Note that bdrv_attach_child_common() and bdrv_attach_child_common_abort() hold the graph write lock and callers of bdrv_try_change_aio_context() might too, so they are not actually allowed to drain either. This will be addressed in the following commits. More granular draining is not trivially possible, because bdrv_change_aio_context() can recursively call itself e.g. via bdrv_child_change_aio_context(). [0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/ Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-9-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark bdrv_child_change_aio_context() GRAPH_RDLOCKFiona Ebner2025-06-041-3/+4
| | | | | | | | | | | | This is a small step in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. More concretely, it is in preparation to move the drain out of bdrv_change_aio_context() and marking that function as GRAPH_RDLOCK. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-8-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR)Fiona Ebner2025-06-041-3/+3
| | | | | | | | | | | | This is a small step in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. More concretely, it is in preparation to move the drain out of bdrv_change_aio_context() and marking that function as GRAPH_RDLOCK. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-7-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* mirror: Drop redundant zero_target parameterEric Blake2025-05-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The two callers to a mirror job (drive-mirror and blockdev-mirror) set zero_target precisely when sync mode == FULL, with the one exception that drive-mirror skips zeroing the target if it was newly created and reads as zero. But given the previous patch, that exception is equally captured by target_is_zero. Meanwhile, there is another slight wrinkle, fortunately caught by iotest 185: if the caller uses "sync":"top" but the source has no backing file, the code in blockdev.c was changing sync to be FULL, but only after it had set zero_target=false. In mirror.c, prior to recent patches, this didn't matter: the only places that inspected sync were setting is_none_mode (both TOP and FULL had set that to false), and mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL. But now that we are passing sync around, the slammed sync mode would result in a new pre-zeroing pass even when the user had passed "sync":"top" in an effort to skip pre-zeroing. Fortunately, the assignment of base when bs has no backing chain still works out to NULL if we don't slam things. So with the forced change of sync ripped out of blockdev.c, the sync mode is passed through the full callstack unmolested, and we can now reliably reconstruct the same settings as what used to be passed in by zero_target=false, without the redundant parameter. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-24-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [eblake: Fix regression in iotest 185] Signed-off-by: Eric Blake <eblake@redhat.com>
* mirror: Allow QMP override to declare target already zeroEric Blake2025-05-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | QEMU has an optimization for a just-created drive-mirror destination that is not possible for blockdev-mirror (which can't create the destination) - any time we know the destination starts life as all zeroes, we can skip a pre-zeroing pass on the destination. Recent patches have added an improved heuristic for detecting if a file contains all zeroes, and we plan to use that heuristic in upcoming patches. But since a heuristic cannot quickly detect all scenarios, and there may be cases where the caller is aware of information that QEMU cannot learn quickly, it makes sense to have a way to tell QEMU to assume facts about the destination that can make the mirror operation faster. Given our existing example of "qemu-img convert --target-is-zero", it is time to expose this override in QMP for blockdev-mirror as well. This patch results in some slight redundancy between the older s->zero_target (set any time mode==FULL and the destination image was not just created - ie. clear if drive-mirror is asking to skip the pre-zero pass) and the newly-introduced s->target_is_zero (in addition to the QMP override, it is set when drive-mirror creates the destination image); this will be cleaned up in the next patch. There is also a subtlety that we must consider. When drive-mirror is passing target_is_zero on behalf of a just-created image, we know the image is sparse (skipping the pre-zeroing keeps it that way), so it doesn't matter whether the destination also has "discard":"unmap" and "detect-zeroes":"unmap". But now that we are letting the user set the knob for target-is-zero, if the user passes a pre-existing file that is fully allocated, it is fine to leave the file fully allocated under "detect-zeroes":"on", but if the file is open with "detect-zeroes":"unmap", we should really be trying harder to punch holes in the destination for every region of zeroes copied from the source. The easiest way to do this is to still run the pre-zeroing pass (turning the entire destination file sparse before populating just the allocated portions of the source), even though that currently results in double I/O to the portions of the file that are allocated. A later patch will add further optimizations to reduce redundant zeroing I/O during the mirror operation. Since "target-is-zero":true is designed for optimizations, it is okay to silently ignore the parameter rather than erroring if the user ever sets the parameter in a scenario where the mirror job can't exploit it (for example, when doing "sync":"top" instead of "sync":"full", we can't pre-zero, so setting the parameter won't make a speed difference). Signed-off-by: Eric Blake <eblake@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20250509204341.3553601-23-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: Add new bdrv_co_is_all_zeroes() functionEric Blake2025-05-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | There are some optimizations that require knowing if an image starts out as reading all zeroes, such as making blockdev-mirror faster by skipping the copying of source zeroes to the destination. The existing bdrv_co_is_zero_fast() is a good building block for answering this question, but it tends to give an answer of 0 for a file we just created via QMP 'blockdev-create' or similar (such as 'qemu-img create -f raw'). Why? Because file-posix.c insists on allocating a tiny header to any file rather than leaving it 100% sparse, due to some filesystems that are unable to answer alignment probes on a hole. But teaching file-posix.c to read the tiny header doesn't scale - the problem of a small header is also visible when libvirt sets up an NBD client to a just-created file on a migration destination host. So, we need a wrapper function that handles a bit more complexity in a common manner for all block devices - when the BDS is mostly a hole, but has a small non-hole header, it is still worth the time to read that header and check if it reads as all zeroes before giving up and returning a pessimistic answer. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-19-eblake@redhat.com>
* block: Expand block status mode from bool to flagsEric Blake2025-05-143-15/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is purely mechanical, changing bool want_zero into an unsigned int for bitwise-or of flags. As of this patch, all implementations are unchanged (the old want_zero==true is now mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but the callers in io.c that used to pass want_zero==false are now prepared for future driver changes that can now distinguish bewteen BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED. The next patch will actually change the file-posix driver along those lines, now that we have more-specific hints. As for the background why this patch is useful: right now, the file-posix driver recognizes that if allocation is being queried, the entire image can be reported as allocated (there is no backing file to refer to) - but this throws away information on whether the entire image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0 returns -ENXIO, a bit more complicated to prove if the raw file was created with 'qemu-img create' since we intentionally allocate a small chunk of all-zero data to help with alignment probing). Later patches will add a generic algorithm for seeing if an entire file reads as zeroes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-16-eblake@redhat.com>
* blockdev-backup: Add error handling option for copy-before-write jobsRaman Dzehtsiar2025-05-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch extends the blockdev-backup QMP command to allow users to specify how to behave when IO errors occur during copy-before-write operations. Previously, the behavior was fixed and could not be controlled by the user. The new 'on-cbw-error' option can be set to one of two values: - 'break-guest-write': Forwards the IO error to the guest and triggers the on-source-error policy. This preserves snapshot integrity at the expense of guest IO operations. - 'break-snapshot': Allows the guest OS to continue running normally, but invalidates the snapshot and aborts related jobs. This prioritizes guest operation over backup consistency. This enhancement provides more flexibility for backup operations in different environments where requirements for guest availability versus backup consistency may vary. The default behavior remains unchanged to maintain backward compatibility. Signed-off-by: Raman Dzehtsiar <Raman.Dzehtsiar@gmail.com> Message-ID: <20250414090025.828660-1-Raman.Dzehtsiar@gmail.com> Acked-by: Markus Armbruster <armbru@redhat.com> [vsementsov: fix long lines] Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
* block: Remove unused callback function *bdrv_aio_pdiscardSunny Zhu2025-04-251-4/+0
| | | | | | | | | | | | | | | | | | The bytes type in *bdrv_aio_pdiscard should be int64_t rather than int. There are no drivers implementing the *bdrv_aio_pdiscard() callback, it appears to be an unused function. Therefore, we'll simply remove it instead of fixing it. Additionally, coroutine-based callbacks are preferred. If someone needs to implement bdrv_aio_pdiscard, a coroutine-based version would be straightforward to implement. Signed-off-by: Sunny Zhu <sunnyzhyy@qq.com> Message-ID: <tencent_7140D2E54157D98CF3D9E64B1A007A1A7906@qq.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* aio-posix: Separate AioPolledEvent per AioHandlerKevin Wolf2025-03-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adaptive polling has a big problem: It doesn't consider that an event loop can wait for many different events that may have very different typical latencies. For example, think of a guest that tends to send a new I/O request soon after the previous I/O request completes, but the storage on the host is rather slow. In this case, getting the new request from guest quickly means that polling is enabled, but the next thing is performing the I/O request on the backend, which is slow and disables polling again for the next guest request. This means that in such a scenario, polling could help for every other event, but is only ever enabled when it can't succeed. In order to fix this, keep a separate AioPolledEvent for each AioHandler. We will then know that the backend file descriptor always has a high latency and isn't worth polling for, but we also know that the guest is always fast and we should poll for it. This solves at least half of the problem, we can now keep polling for those cases where it makes sense and get the improved performance from it. Since the event loop doesn't know which event will be next, we still do some unnecessary polling while we're waiting for the slow disk. I made some attempts to be more clever than just randomly growing and shrinking the polling time, and even to let callers be explicit about when they expect a new event, but so far this hasn't resulted in improved performance or even caused performance regressions. For now, let's just fix the part that is easy enough to fix, we can revisit the rest later. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-6-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* aio: Create AioPolledEventKevin Wolf2025-03-131-1/+5
| | | | | | | | | | As a preparation for having multiple adaptive polling states per AioContext, move the 'ns' field into a separate struct. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-4-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* file-posix: Support FUA writesKevin Wolf2025-03-131-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, FUA was always emulated with a separate flush after the write for file-posix. The overhead of processing a second request can reduce performance significantly for a guest disk that has disabled the write cache, especially if the host disk is already write through, too, and the flush isn't actually doing anything. Advertise support for REQ_FUA in write requests and implement it for Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The thread pool still performs a separate fdatasync() call. This can be improved later by using the pwritev2() syscall if available. As an example, this is how fio numbers can be improved in some scenarios with this patch (all using virtio-blk with cache=directsync on an nvme block device for the VM, fio with ioengine=libaio,direct=1,sync=1): | old | with FUA support ------------------------------+---------------+------------------- bs=4k, iodepth=1, numjobs=1 | 45.6k iops | 56.1k iops bs=4k, iodepth=1, numjobs=16 | 183.3k iops | 236.0k iops bs=4k, iodepth=16, numjobs=1 | 258.4k iops | 311.1k iops However, not all scenarios are clear wins. On another slower disk I saw little to no improvment. In fact, in two corner case scenarios, I even observed a regression, which I however consider acceptable: 1. On slow host disks in a write through cache mode, when the guest is using virtio-blk in a separate iothread so that polling can be enabled, and each completion is quickly followed up with a new request (so that polling gets it), it can happen that enabling FUA makes things slower - the additional very fast no-op flush we used to have gave the adaptive polling algorithm a success so that it kept polling. Without it, we only have the slow write request, which disables polling. This is a problem in the polling algorithm that will be fixed later in this series. 2. With a high queue depth, it can be beneficial to have flush requests for another reason: The optimisation in bdrv_co_flush() that flushes only once per write generation acts as a synchronisation mechanism that lets all requests complete at the same time. This can result in better batching and if the disk is very fast (I only saw this with a null_blk backend), this can make up for the overhead of the flush and improve throughput. In theory, we could optionally introduce a similar artificial latency in the normal completion path to achieve the same kind of completion batching. This is not implemented in this series. Compatibility is not a concern for the kernel side of io_uring, it has supported RWF_DSYNC from the start. However, io_uring_prep_writev2() is not available before liburing 2.2. Linux AIO started supporting it in Linux 4.13 and libaio 0.3.111. The kernel is not a problem for any supported build platform, so it's not necessary to add runtime checks. However, openSUSE is still stuck with an older libaio version that would break the build. We must detect the presence of the writev2 functions in the user space libraries at build time to avoid build failures. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-2-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* Merge tag 'accel-cpus-20250306' of https://github.com/philmd/qemu into stagingStefan Hajnoczi2025-03-072-3/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generic CPUs / accelerators patch queue - Merge "qemu/clang-tsa.h" within "qemu/compiler.h" - Various cleanups around accelerators initialization code (better user/system split) - Various trivial cleanups in accel/tcg/, Guard few TCG calls with tcg_enabled() - Explicit disassemble_info endianness - Improve dual-endianness support for MicroBlaze # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmfJw08ACgkQ4+MsLN6t # wN70whAAtfcdWtqseFfb6fvDtjflgxN51Ui0iaOECXUA18USKriGy34eBcMYMiM2 # +eKgU7+jI6JGE4+burcgWUsPpFFF951/A8+lyIbFgO5yToTDmC+qNe4XfmMAIyXq # uf9Obr2c0Xk9luh4odb+jPAQodw/7G1fKgcCVIJNDCl/xEcPhS9eNpTaHwcVnkWI # K6KrxWXOsqG6+evJBPWYoXtOOyt0+JcwAsJoGhprwtGm3P9+jSVXsgeGsJVyZcna # f32JtjWL754O8XeMkOn4x6rt58VrCIMKI9xT7keDyuhTCq0Zki9RO2nMU2dSw5mN # AfL9hxqUy0Nijnyslg3ugujDfTePsNyLdwwH7n0mnoD72ELi6WnhDsmOThuEB3Rd # 4/kdwTJfA/rlWk/GF1tbKW7AvQZokRARtzmL3V0HmGJu57lX+2JuszEdYBkqDEP7 # GH1I10B2yANUm+C9y3X8qWOU7Ws433ebJeJoZuyfnbZ9Me+UfRmql/oS+V8ata2i # fArEItpldUFrWRyYLkTbXrh2dgyV9yJTEir/lzOzeAZZzyabTbjf2z9qnh976GGO # 1QnDy5QA4f54kDBUZe7JK26TZsHPch7cgqXW6f8tRlJF7A9hxGK8d2TUV/lC3/vx # LUOlWNu03PhiruYmZEcWOsY3Jt9jRCF6lIryrnaJsqnVOVmMUMM= # =3TRh # -----END PGP SIGNATURE----- # gpg: Signature made Thu 06 Mar 2025 23:46:23 HKT # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [full] # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'accel-cpus-20250306' of https://github.com/philmd/qemu: (54 commits) include: Poison TARGET_PHYS_ADDR_SPACE_BITS definition system: Open-code qemu_init_arch_modules() using target_name() target/i386: Mark WHPX APIC region as little-endian target/alpha: Do not mix exception flags and FPCR bits target/riscv: Convert misa_mxl_max using GLib macros target/riscv: Declare RISCVCPUClass::misa_mxl_max as RISCVMXL target/xtensa: Finalize config in xtensa_register_core() target/sparc: Constify SPARCCPUClass::cpu_def target/i386: Constify X86CPUModel uses disas: Remove target_words_bigendian() call in initialize_debug_target() target/xtensa: Set disassemble_info::endian value in disas_set_info() target/sh4: Set disassemble_info::endian value in disas_set_info() target/riscv: Set disassemble_info::endian value in disas_set_info() target/ppc: Set disassemble_info::endian value in disas_set_info() target/mips: Set disassemble_info::endian value in disas_set_info() target/microblaze: Set disassemble_info::endian value in disas_set_info target/arm: Set disassemble_info::endian value in disas_set_info() target: Set disassemble_info::endian value for big-endian targets target: Set disassemble_info::endian value for little-endian targets target/mips: Fix possible MSA int overflow ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * qemu/compiler: Absorb 'clang-tsa.h'Philippe Mathieu-Daudé2025-03-062-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have "qemu/compiler.h" for compiler-specific arrangements, automatically included by "qemu/osdep.h" for each source file. No need to explicitly include a header for a Clang particularity. Suggested-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250117170201.91182-1-philmd@linaro.org>
* | thread-pool: Implement generic (non-AIO) pool supportMaciej S. Szmigiero2025-03-061-0/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Migration code wants to manage device data sending threads in one place. QEMU has an existing thread pool implementation, however it is limited to queuing AIO operations only and essentially has a 1:1 mapping between the current AioContext and the AIO ThreadPool in use. Implement generic (non-AIO) ThreadPool by essentially wrapping Glib's GThreadPool. This brings a few new operations on a pool: * thread_pool_wait() operation waits until all the submitted work requests have finished. * thread_pool_set_max_threads() explicitly sets the maximum thread count in the pool. * thread_pool_adjust_max_threads_to_work() adjusts the maximum thread count in the pool to equal the number of still waiting in queue or unfinished work. Reviewed-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/b1efaebdbea7cb7068b8fb74148777012383e12b.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
* | thread-pool: Rename AIO pool functions to *_aio() and data types to *AioMaciej S. Szmigiero2025-03-062-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These names conflict with ones used by future generic thread pool equivalents. Generic names should belong to the generic pool type, not specific (AIO) type. Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/70f9e0fb4b01042258a1a57996c64d19779dc7f0.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
* | thread-pool: Remove thread_pool_submit() functionMaciej S. Szmigiero2025-03-061-2/+1
|/ | | | | | | | | | | | | | This function name conflicts with one used by a future generic thread pool function and it was only used by one test anyway. Update the trace event name in thread_pool_submit_aio() accordingly. Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/qemu-devel/6830f07777f939edaf0a2d301c39adcaaf3817f0.1741124640.git.maciej.szmigiero@oracle.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
* hw/ufs: Add temperature event notification supportKeoseong Park2025-03-051-1/+12
| | | | | | | | | | | | | | | | | This patch introduces temperature event notification support to the UFS emulation. It enables the emulated UFS device to generate temperature-related events, including high and low temperature notifications, in compliance with the UFS specification. With this feature, UFS drivers can now handle temperature exception events during testing and development within the emulated environment. This enhances validation and debugging capabilities for thermal event handling in UFS implementations. Signed-off-by: Keoseong Park <keosung.park@samsung.com> Reviewed-by: Jeuk Kim <jeuk20.kim@samsung.com> Message-ID: <20250225064146epcms2p50889cb0066e2d4734f2386de325bcdf6@epcms2p5> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
* hw/nvme: set error status code explicitly for misc commandsKlaus Jensen2025-02-261-0/+1
| | | | | | | | | | | | | | The nvme_aio_err() does not handle Verify, Compare, Copy and other misc commands and defaults to setting the error status code to Internal Device Error. For some of these commands, we know better, so set it explicitly. For the commands using the nvme_misc_cb() callback (Copy, Flush, ...), if no status code has explicitly been set by the lower handlers, default to Internal Device Error as previously. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: rework csi handlingKlaus Jensen2025-02-261-4/+6
| | | | | | | | | | | | | The controller incorrectly allows a zoned namespace to be attached even if CS.CSS is configured to only support the NVM command set for I/O queues. Rework handling of namespace command sets in general by attaching supported namespaces when the controller is started instead of, like now, statically when realized. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: be compliant wrt. dsm processing limitsKlaus Jensen2025-02-251-0/+2
| | | | | | | | | | | | | The specification states that, > The controller shall set all three processing limit fields (i.e., the > DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all > three processing limit fields to 0h. So, set the DMRL and DMSL fields in addition to DMRSL. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* nvme: fix iocs status code valuesKlaus Jensen2025-02-251-2/+2
| | | | | | | The status codes related to I/O Command Sets are in the wrong group. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: add knob for doorbell buffer config supportKlaus Jensen2025-02-251-1/+1
| | | | | | | | Add a 'dbcs' knob to allow Doorbell Buffer Config command to be disabled. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: make oacs dynamicKlaus Jensen2025-02-251-1/+2
| | | | | | | | Virtualization Management needs sriov-related parameters. Only report support for the command when that conditions are true. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: Add OCP SMART / Health Information Extended Log PageStephen Bates2025-02-251-0/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Open Compute Project [1] includes a Datacenter NVMe SSD Specification [2]. The most recent version of this specification (as of November 2024) is 2.6.1. This specification layers on top of the NVM Express specifications [3] to provide additional functionality. A key part of of this is the 512 Byte OCP SMART / Health Information Extended log page that is defined in Section 4.8.6 of the specification. We add a controller argument (ocp) that toggles on/off the SMART log extended structure. To accommodate different vendor specific specifications like OCP, we add a multiplexing function (nvme_vendor_specific_log) which will route to the different log functions based on arguments and log ids. We only return the OCP extended SMART log when the command is 0xC0 and ocp has been turned on in the nvme argumants. Though we add the whole nvme SMART log extended structure, we only populate the physical_media_units_{read,written}, log_page_version and log_page_uuid. This patch is based on work done by Joel but has been modified enough that he requested a co-developed-by tag rather than a signed-off-by. [1]: https://www.opencompute.org/ [2]: https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-6-1-pdf [3]: https://nvmexpress.org/specifications/ Signed-off-by: Stephen Bates <sbates@raithlin.com> Co-developed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* nbd/server: Allow users to adjust handshake limit in QMPEric Blake2025-02-111-3/+3
| | | | | | | | | | | | | | | | | | | | Although defaulting the handshake limit to 10 seconds was a nice QoI change to weed out intentionally slow clients, it can interfere with integration testing done with manual NBD_OPT commands over 'nbdsh --opt-mode'. Expose a QMP knob 'handshake-max-secs' to allow the user to alter the timeout away from the default. The parameter name here intentionally matches the spelling of the constant added in commit fb1c2aaa98, and not the command-line spelling added in the previous patch for qemu-nbd; that's because in QMP, longer names serve as good self-documentation, and unlike the command line, machines don't have problems generating longer spellings. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250203222722.650694-6-eblake@redhat.com> [eblake: s/max-secs/max-seconds/ in QMP] Acked-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
* Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into stagingStefan Hajnoczi2025-02-103-1/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer patches - Managing inactive nodes (enables QSD migration with shared storage) - Fix swapped values for BLOCK_IO_ERROR 'device' and 'qom-path' - vpc: Read images exported from Azure correctly - scripts/qemu-gdb: Support coroutine dumps in coredumps - Minor cleanups # -----BEGIN PGP SIGNATURE----- # # iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmek34IRHGt3b2xmQHJl # ZGhhdC5jb20ACgkQfwmycsiPL9bDpxAAnTvwmdazAXG0g9GzqvrEB/+6rStjAsqE # 9MTWV4WxyN41d0RXxN8CYKb8CXSiTRyw6r3CSGNYEI2eShe9e934PriSkZm41HyX # n9Yh5YxqGZqitzvPtx62Ii/1KG+PcjQbfHuK1p4+rlKa0yQ2eGlio1JIIrZrCkBZ # ikZcQUrhIyD0XV8hTQ2+Ysa+ZN6itjnlTQIG3gS3m8f8WR7kyUXD8YFMQFJFyjVx # NrAIpLnc/ln9+5PZR9tje8U7XEn2KCgI5pgGaQnrd0h0G1H4ig8ogzYYnKTLhjU/ # AmQpS8np8Tyg6S1UZTiekEq0VuAhThEQc5b3sGbmHWH/R2ABMStyf18oCBAkPzZ7 # s6h+3XzTKKY2Q5Q3ZG/ANkUJjTNBhdj1fcaARvbSWsqsuk5CWX/I3jzvgihFtCSs # eGu+b/bLeW6P7hu4qPHBcgLHuB1Fc7Rd2t4BoIGM1wcO2CeC9DzUKOiIMZOEJIh0 # GGqCkEWDHgckDTakD4/vSqm0UDKt6FSlQC9ga/ILBY3IB5HpHoArY58selymy28i # X7MgAvbjdsmNuUuXDZZOiObcFt3j8jlmwPJpPyzXPQIiPX1RXeBPRhVAEeZCKn6Z # tfHr72SJdMeVOGXVTvOrJ2iW+4g03rPdmkDFCUhpOwo62RODq7ahvCIXsNf3nEFR # rSB3T1M/8EM= # =iQLP # -----END PGP SIGNATURE----- # gpg: Signature made Thu 06 Feb 2025 11:12:50 EST # gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6 # gpg: issuer "kwolf@redhat.com" # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (25 commits) block: remove unused BLOCK_OP_TYPE_DATAPLANE iotests: Add (NBD-based) tests for inactive nodes iotests: Add qsd-migrate case iotests: Add filter_qtest() nbd/server: Support inactive nodes block/export: Add option to allow export of inactive nodes block: Drain nodes before inactivating them block/export: Don't ignore image activation error in blk_exp_add() block: Support inactive nodes in blk_insert_bs() block: Add blockdev-set-active QMP command block: Add option to create inactive nodes block: Fix crash on block_resize on inactive node block: Don't attach inactive child to active node migration/block-active: Remove global active flag block: Inactivate external snapshot overlays when necessary block: Allow inactivating already inactive nodes block: Add 'active' field to BlockDeviceInfo block-backend: Fix argument order when calling 'qapi_event_send_block_io_error()' scripts/qemu-gdb: Support coroutine dumps in coredumps scripts/qemu-gdb: Simplify fs_base fetching for coroutines ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * block: remove unused BLOCK_OP_TYPE_DATAPLANEStefan Hajnoczi2025-02-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BLOCK_OP_TYPE_DATAPLANE prevents BlockDriverState from being used by virtio-blk/virtio-scsi with IOThread. Commit b112a65c52aa ("block: declare blockjobs and dataplane friends!") eliminated the main reason for this blocker in 2014. Nowadays the block layer supports I/O from multiple AioContexts, so there is even less reason to block IOThread users. Any legitimate reasons related to interference would probably also apply to non-IOThread users. The only remaining users are bdrv_op_unblock(BLOCK_OP_TYPE_DATAPLANE) calls after bdrv_op_block_all(). If we remove BLOCK_OP_TYPE_DATAPLANE their behavior doesn't change. Existing bdrv_op_block_all() callers that don't explicitly unblock BLOCK_OP_TYPE_DATAPLANE seem to do so simply because no one bothered to rather than because it is necessary to keep BLOCK_OP_TYPE_DATAPLANE blocked. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250203182529.269066-1-stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block/export: Add option to allow export of inactive nodesKevin Wolf2025-02-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an option in BlockExportOptions to allow creating an export on an inactive node without activating the node. This mode needs to be explicitly supported by the export type (so that it doesn't perform any operations that are forbidden for inactive nodes), so this patch alone doesn't allow this option to be successfully used yet. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-13-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: Add blockdev-set-active QMP commandKevin Wolf2025-02-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The system emulator tries to automatically activate and inactivate block nodes at the right point during migration. However, there are still cases where it's necessary that the user can do this manually. Images are only activated on the destination VM of a migration when the VM is actually resumed. If the VM was paused, this doesn't happen automatically. The user may want to perform some operation on a block device (e.g. taking a snapshot or starting a block job) without also resuming the VM yet. This is an example where a manual command is necessary. Another example is VM migration when the image files are opened by an external qemu-storage-daemon instance on each side. In this case, the process that needs to hand over the images isn't even part of the migration and can't know when the migration completes. Management tools need a way to explicitly inactivate images on the source and activate them on the destination. This adds a new blockdev-set-active QMP command that lets the user change the status of individual nodes (this is necessary in qemu-storage-daemon because it could be serving multiple VMs and only one of them migrates at a time). For convenience, operating on all devices (like QEMU does automatically during migration) is offered as an option, too, and can be used in the context of single VM. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250204211407.381505-9-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>