summary refs log tree commit diff stats
path: root/hw/core/qdev-properties-system.c (unfollow)
Commit message (Collapse)AuthorFilesLines
2025-05-14tests: Add iotest mirror-sparse for recent patchesEric Blake2-0/+490
Prove that blockdev-mirror can now result in sparse raw destination files, regardless of whether the source is raw or qcow2. By making this a separate test, it was possible to test effects of individual patches for the various pieces that all have to work together for a sparse mirror to be successful. Note that ./check -file produces different job lengths than ./check -qcow2 (the test uses a filter to normalize); that's because when deciding how much of the image to be mirrored, the code looks at how much of the source image was allocated (for qcow2, this is only the written clusters; for raw, it is the entire file). But the important part is that the destination file ends up smaller than 3M, rather than the 20M it used to be before this patch series. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-28-eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14iotests/common.rc: add disk_usage functionAndrey Drobyshev2-5/+6
Move the definition from iotests/250 to common.rc. This is used to detect real disk usage of sparse files. In particular, we want to use it for checking subclusters-based discards. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-ID: <20240913163942.423050-6-andrey.drobyshev@virtuozzo.com> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-27-eblake@redhat.com>
2025-05-14mirror: Skip writing zeroes when target is already zeroEric Blake1-14/+93
When mirroring, the goal is to ensure that the destination reads the same as the source; this goal is met whether the destination is sparse or fully-allocated (except when explicitly punching holes, then merely reading zero is not enough to know if it is sparse, so we still want to punch the hole). Avoiding a redundant write to zero (whether in the background because the zero cluster was marked in the dirty bitmap, or in the foreground because the guest is writing zeroes) when the destination already reads as zero makes mirroring faster, and avoids allocating the destination merely because the source reports as allocated. The effect is especially pronounced when the source is a raw file. That's because when the source is a qcow2 file, the dirty bitmap only visits the portions of the source that are allocated, which tend to be non-zero. But when the source is a raw file, bdrv_co_is_allocated_above() reports the entire file as allocated so mirror_dirty_init sets the entire dirty bitmap, and it is only later during mirror_iteration that we change to consulting the more precise bdrv_co_block_status_above() to learn where the source reads as zero. Remember that since a mirror operation can write a cluster more than once (every time the guest changes the source, the destination is also changed to keep up), and the guest can change whether a given cluster reads as zero, is discarded, or has non-zero data over the course of the mirror operation, we can't take the shortcut of relying on s->target_is_zero (which is static for the life of the job) in mirror_co_zero() to see if the destination is already zero, because that information may be stale. Any solution we use must be dynamic in the face of the guest writing or discarding a cluster while the mirror has been ongoing. We could just teach mirror_co_zero() to do a block_status() probe of the destination, and skip the zeroes if the destination already reads as zero, but we know from past experience that extra block_status() calls are not always cheap (tmpfs, anyone?), especially when they are random access rather than linear. Use of block_status() of the source by the background task in a linear fashion is not our bottleneck (it's a background task, after all); but since mirroring can be done while the source is actively being changed, we don't want a slow block_status() of the destination to occur on the hot path of the guest trying to do random-access writes to the source. So this patch takes a slightly different approach: any time we have to track dirty clusters, we can also track which clusters are known to read as zero. For sync=TOP or when we are punching holes from "detect-zeroes":"unmap", the zero bitmap starts out empty, but prevents a second write zero to a cluster that was already zero by an earlier pass; for sync=FULL when we are not punching holes, the zero bitmap starts out full if the destination reads as zero during initialization. Either way, I/O to the destination can now avoid redundant write zero to a cluster that already reads as zero, all without having to do a block_status() per write on the destination. With this patch, if I create a raw sparse destination file, connect it with QMP 'blockdev-add' while leaving it at the default "discard": "ignore", then run QMP 'blockdev-mirror' with "sync": "full", the destination remains sparse rather than fully allocated. Meanwhile, a destination image that is already fully allocated remains so unless it was opened with "detect-zeroes": "unmap". And any time writing zeroes is skipped, the job counters are not incremented. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-26-eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14mirror: Skip pre-zeroing destination if it is already zeroEric Blake4-13/+33
When doing a sync=full mirroring, we can skip pre-zeroing the destination if it already reads as zeroes and we are not also trying to punch holes due to detect-zeroes. With this patch, there are fewer scenarios that have to pass in an explicit target-is-zero, while still resulting in a sparse destination remaining sparse. A later patch will then further improve things to skip writing to the destination for parts of the image where the source is zero; but even with just this patch, it is possible to see a difference for any source that does not report itself as fully allocated, coupled with a destination BDS that can quickly report that it already reads as zero. (For a source that reports as fully allocated, such as a file, the rest of mirror_dirty_init() still sets the entire dirty bitmap to true, so even though we avoided the pre-zeroing, we are not yet avoiding all redundant I/O). Iotest 194 detects the difference made by this patch: for a file source (where block status reports the entire image as allocated, and therefore we end up writing zeroes everywhere in the destination anyways), the job length remains the same. But for a qcow2 source and a destination that reads as all zeroes, the dirty bitmap changes to just tracking the allocated portions of the source, which results in faster completion and smaller job statistics. For the test to pass with both ./check -file and -qcow2, a new python filter is needed to mask out the now-varying job amounts (this matches the shell filters _filter_block_job_{offset,len} in common.filter). A later test will also be added which further validates expected sparseness, so it does not matter that 194 is no longer explicitly looking at how many bytes were copied. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-25-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14mirror: Drop redundant zero_target parameterEric Blake4-26/+11
The two callers to a mirror job (drive-mirror and blockdev-mirror) set zero_target precisely when sync mode == FULL, with the one exception that drive-mirror skips zeroing the target if it was newly created and reads as zero. But given the previous patch, that exception is equally captured by target_is_zero. Meanwhile, there is another slight wrinkle, fortunately caught by iotest 185: if the caller uses "sync":"top" but the source has no backing file, the code in blockdev.c was changing sync to be FULL, but only after it had set zero_target=false. In mirror.c, prior to recent patches, this didn't matter: the only places that inspected sync were setting is_none_mode (both TOP and FULL had set that to false), and mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL. But now that we are passing sync around, the slammed sync mode would result in a new pre-zeroing pass even when the user had passed "sync":"top" in an effort to skip pre-zeroing. Fortunately, the assignment of base when bs has no backing chain still works out to NULL if we don't slam things. So with the forced change of sync ripped out of blockdev.c, the sync mode is passed through the full callstack unmolested, and we can now reliably reconstruct the same settings as what used to be passed in by zero_target=false, without the redundant parameter. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-24-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [eblake: Fix regression in iotest 185] Signed-off-by: Eric Blake <eblake@redhat.com>
2025-05-14mirror: Allow QMP override to declare target already zeroEric Blake5-14/+44
QEMU has an optimization for a just-created drive-mirror destination that is not possible for blockdev-mirror (which can't create the destination) - any time we know the destination starts life as all zeroes, we can skip a pre-zeroing pass on the destination. Recent patches have added an improved heuristic for detecting if a file contains all zeroes, and we plan to use that heuristic in upcoming patches. But since a heuristic cannot quickly detect all scenarios, and there may be cases where the caller is aware of information that QEMU cannot learn quickly, it makes sense to have a way to tell QEMU to assume facts about the destination that can make the mirror operation faster. Given our existing example of "qemu-img convert --target-is-zero", it is time to expose this override in QMP for blockdev-mirror as well. This patch results in some slight redundancy between the older s->zero_target (set any time mode==FULL and the destination image was not just created - ie. clear if drive-mirror is asking to skip the pre-zero pass) and the newly-introduced s->target_is_zero (in addition to the QMP override, it is set when drive-mirror creates the destination image); this will be cleaned up in the next patch. There is also a subtlety that we must consider. When drive-mirror is passing target_is_zero on behalf of a just-created image, we know the image is sparse (skipping the pre-zeroing keeps it that way), so it doesn't matter whether the destination also has "discard":"unmap" and "detect-zeroes":"unmap". But now that we are letting the user set the knob for target-is-zero, if the user passes a pre-existing file that is fully allocated, it is fine to leave the file fully allocated under "detect-zeroes":"on", but if the file is open with "detect-zeroes":"unmap", we should really be trying harder to punch holes in the destination for every region of zeroes copied from the source. The easiest way to do this is to still run the pre-zeroing pass (turning the entire destination file sparse before populating just the allocated portions of the source), even though that currently results in double I/O to the portions of the file that are allocated. A later patch will add further optimizations to reduce redundant zeroing I/O during the mirror operation. Since "target-is-zero":true is designed for optimizations, it is okay to silently ignore the parameter rather than erroring if the user ever sets the parameter in a scenario where the mirror job can't exploit it (for example, when doing "sync":"top" instead of "sync":"full", we can't pre-zero, so setting the parameter won't make a speed difference). Signed-off-by: Eric Blake <eblake@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20250509204341.3553601-23-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14mirror: Pass full sync mode rather than bool to internalsEric Blake1-12/+12
Out of the five possible values for MirrorSyncMode, INCREMENTAL and BITMAP are already rejected up front in mirror_start, leaving NONE, TOP, and FULL as the remaining values that the code was collapsing into a single bool is_none_mode. Furthermore, mirror_dirty_init() is only reachable for modes TOP and FULL, as further guided by s->zero_target. However, upcoming patches want to further optimize the pre-zeroing pass of a sync=full mirror in mirror_dirty_init(), while avoiding that pass on a sync=top action. Instead of throwing away context by collapsing these two values into s->is_none_mode=false, it is better to pass s->sync_mode throughout the entire operation. For active commit, the desired semantics match sync mode TOP. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-22-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14mirror: Minor refactoringEric Blake1-11/+11
Commit 5791ba52 (v9.2) pre-initialized ret in mirror_dirty_init to silence a false positive compiler warning, even though in all code paths where ret is used, it was guaranteed to be reassigned beforehand. But since the function returns -errno, and -1 is not always the right errno, it's better to initialize to -EIO. An upcoming patch wants to track two bitmaps in do_sync_target_write(); this will be easier if the current variables related to the dirty bitmap are renamed. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-21-eblake@redhat.com>
2025-05-14iotests: Improve iotest 194 to mirror dataEric Blake1-0/+1
Mirroring a completely sparse image to a sparse destination should be practically instantaneous. It isn't yet, but the test will be more realistic if it has some non-zero to mirror as well as the holes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-20-eblake@redhat.com>
2025-05-14block: Add new bdrv_co_is_all_zeroes() functionEric Blake2-0/+64
There are some optimizations that require knowing if an image starts out as reading all zeroes, such as making blockdev-mirror faster by skipping the copying of source zeroes to the destination. The existing bdrv_co_is_zero_fast() is a good building block for answering this question, but it tends to give an answer of 0 for a file we just created via QMP 'blockdev-create' or similar (such as 'qemu-img create -f raw'). Why? Because file-posix.c insists on allocating a tiny header to any file rather than leaving it 100% sparse, due to some filesystems that are unable to answer alignment probes on a hole. But teaching file-posix.c to read the tiny header doesn't scale - the problem of a small header is also visible when libvirt sets up an NBD client to a just-created file on a migration destination host. So, we need a wrapper function that handles a bit more complexity in a common manner for all block devices - when the BDS is mostly a hole, but has a small non-hole header, it is still worth the time to read that header and check if it reads as all zeroes before giving up and returning a pessimistic answer. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-19-eblake@redhat.com>
2025-05-14block: Let bdrv_co_is_zero_fast consolidate adjacent extentsEric Blake1-12/+15
Some BDS drivers have a cap on how much block status they can supply in one query (for example, NBD talking to an older server cannot inspect more than 4G per query; and qcow2 tends to cap its answers rather than cross a cluster boundary of an L1 table). Although the existing callers of bdrv_co_is_zero_fast are not passing in that large of a 'bytes' parameter, an upcoming caller wants to query the entire image at once, and will thus benefit from being able to treat adjacent zero regions in a coalesced manner, rather than claiming the region is non-zero merely because pnum was truncated and didn't match the incoming bytes. While refactoring this into a loop, note that there is no need to assign pnum prior to calling bdrv_co_common_block_status_above() (it is guaranteed to be assigned deeper in the callstack). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-18-eblake@redhat.com>
2025-05-14file-posix, gluster: Handle zero block status hint betterEric Blake2-2/+3
Although the previous patch to change 'bool want_zero' into a bitmask made no semantic change, it is now time to differentiate. When the caller specifically wants to know what parts of the file read as zero, we need to use lseek and actually reporting holes, rather than short-circuiting and advertising full allocation. This change will be utilized in later patches to let mirroring optimize for the case when the destination already reads as zeroes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-17-eblake@redhat.com>
2025-05-14block: Expand block status mode from bool to flagsEric Blake25-86/+99
This patch is purely mechanical, changing bool want_zero into an unsigned int for bitwise-or of flags. As of this patch, all implementations are unchanged (the old want_zero==true is now mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but the callers in io.c that used to pass want_zero==false are now prepared for future driver changes that can now distinguish bewteen BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED. The next patch will actually change the file-posix driver along those lines, now that we have more-specific hints. As for the background why this patch is useful: right now, the file-posix driver recognizes that if allocation is being queried, the entire image can be reported as allocated (there is no backing file to refer to) - but this throws away information on whether the entire image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0 returns -ENXIO, a bit more complicated to prove if the raw file was created with 'qemu-img create' since we intentionally allocate a small chunk of all-zero data to help with alignment probing). Later patches will add a generic algorithm for seeing if an entire file reads as zeroes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-16-eblake@redhat.com>
2025-05-14hw/i386/amd_iommu: Allow migration when explicitly create the AMDVI-PCI deviceSuravee Suthikulpanit2-1/+49
Add migration support for AMD IOMMU model by saving necessary AMDVIState parameters for MMIO registers, device table, command buffer, and event buffers. Also change devtab_len type from size_t to uint64_t to avoid 32-bit build issue. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20250504170405.12623-3-suravee.suthikulpanit@amd.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/i386/amd_iommu: Isolate AMDVI-PCI from amd-iommu device to allow full ↵Suravee Suthikulpanit3-26/+38
control over the PCI device creation Current amd-iommu model internally creates an AMDVI-PCI device. Here is a snippet from info qtree: bus: main-system-bus type System dev: amd-iommu, id "" xtsup = false pci-id = "" intremap = "on" device-iotlb = false pt = true ... dev: q35-pcihost, id "" MCFG = -1 (0xffffffffffffffff) pci-hole64-size = 34359738368 (32 GiB) below-4g-mem-size = 134217728 (128 MiB) above-4g-mem-size = 0 (0 B) smm-ranges = true x-pci-hole64-fix = true x-config-reg-migration-enabled = true bypass-iommu = false bus: pcie.0 type PCIE dev: AMDVI-PCI, id "" addr = 01.0 romfile = "" romsize = 4294967295 (0xffffffff) rombar = -1 (0xffffffffffffffff) multifunction = false x-pcie-lnksta-dllla = true x-pcie-extcap-init = true failover_pair_id = "" acpi-index = 0 (0x0) x-pcie-err-unc-mask = true x-pcie-ari-nextfn-1 = false x-max-bounce-buffer-size = 4096 (4 KiB) x-pcie-ext-tag = true busnr = 0 (0x0) class Class 0806, addr 00:01.0, pci id 1022:0000 (sub 1af4:1100) ... This prohibits users from specifying the PCI topology for the amd-iommu device, which becomes a problem when trying to support VM migration since it does not guarantee the same enumeration of AMD IOMMU device. Therefore, allow the 'AMDVI-PCI' device to optionally be pre-created and associated with a 'amd-iommu' device via a new 'pci-id' parameter on the latter. For example: -device AMDVI-PCI,id=iommupci0,bus=pcie.0,addr=0x05 \ -device amd-iommu,intremap=on,pt=on,xtsup=on,pci-id=iommupci0 \ For backward-compatibility, internally create the AMDVI-PCI device if not specified on the CLI. Co-developed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20250504170405.12623-2-suravee.suthikulpanit@amd.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14intel_iommu: Take locks when looking for and creating address spacesCLEMENT MATHIEU--DRIF1-1/+24
vtd_find_add_as can be called by multiple threads which leads to a race condition. Taking the IOMMU lock ensures we avoid such a race. Moreover we also need to take the bql to avoid an assert to fail in memory_region_add_subregion_overlap when actually allocating a new address space. Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com> Message-Id: <20250430124750.240412-3-clement.mathieu--drif@eviden.com> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14tests/functional: Skip the screendump tests if the command is not availableThomas Huth3-6/+12
It is possible nowadays to compile QEMU without pixman support - in that case the screendump command is not available and the related tests fail. Thus skip these tests if the screendump command could not be executed. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20250325081713.283490-2-thuth@redhat.com>
2025-05-14tests/functional/test_s390x_tuxrun: Check whether the machine is availableThomas Huth1-0/+1
The s390x tuxrun test lacks the call to self.set_machine(), so this test is currently failing in case the 's390-ccw-virtio' machine has not been compiled into the binary. Add the check now to fix it. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20250424090640.664217-1-thuth@redhat.com>
2025-05-14include/hw/dma/xlnx_dpdma: Remove dependency on console.hThomas Huth1-1/+0
console.h brings a dependency on the <epoxy/opengl.h> and the pixman header file (if available), so we should avoid to include this file if it is not really necessary. console.h does not seem to be necessary for the xlnx_dpdma code, so drop the include here. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20250508144120.163009-2-thuth@redhat.com>
2025-05-14intel_iommu: Use BQL_LOCK_GUARD to manage cleanup automaticallyCLEMENT MATHIEU--DRIF1-9/+1
vtd_switch_address_space needs to take the BQL if not already held. Use BQL_LOCK_GUARD to make the iommu implementation more consistent. Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com> Message-Id: <20250430124750.240412-2-clement.mathieu--drif@eviden.com> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14virtio: Move virtio_reset()Akihiko Odaki1-45/+43
Move virtio_reset() to a later part of the file to remove the forward declaration of virtio_set_features_nocheck() and to prepare the situation that we need more operations to perform during reset. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250421-reset-v2-2-e4c1ead88ea1@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14virtio: Call set_features during resetAkihiko Odaki1-1/+3
virtio-net expects set_features() will be called when the feature set used by the guest changes to update the number of virtqueues but it is not called during reset, which will clear all features, leaving the queues added for VIRTIO_NET_F_MQ or VIRTIO_NET_F_RSS. Not only these extra queues are visible to the guest, they will cause segmentation fault during migration. Call set_features() during reset to remove those queues for virtio-net as we call set_status(). It will also prevent similar bugs for virtio-net and other devices in the future. Fixes: f9d6dbf0bf6e ("virtio-net: remove virtio queues if the guest doesn't support multiqueue") Buglink: https://issues.redhat.com/browse/RHEL-73842 Cc: qemu-stable@nongnu.org Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250421-reset-v2-1-e4c1ead88ea1@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14vhost-scsi: support VIRTIO_SCSI_F_HOTPLUGDongli Zhang1-0/+3
So far there isn't way to test host kernel vhost-scsi event queue path, because VIRTIO_SCSI_F_HOTPLUG isn't supported by QEMU. virtio-scsi.c and vhost-user-scsi.c already support VIRTIO_SCSI_F_HOTPLUG as property "hotplug". Add support to vhost-scsi.c to help evaluate and test event queue. To test the feature: 1. Create vhost-scsi target with targetcli. targetcli /backstores/fileio create name=storage file_or_dev=disk01.raw targetcli /vhost create naa.1123451234512345 targetcli /vhost/naa.1123451234512345/tpg1/luns create /backstores/fileio/storage 2. Create QEMU instance with vhost-scsi. -device vhost-scsi-pci,wwpn=naa.1123451234512345,hotplug=true 3. Once guest bootup, hotplug a new LUN from host. targetcli /backstores/fileio create name=storage02 file_or_dev=disk02.raw targetcli /vhost/naa.1123451234512345/tpg1/luns create /backstores/fileio/storage02 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250203005215.1502-1-dongli.zhang@oracle.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
2025-05-14vhost-user: return failure if backend crash when live migrationHaoqian He26-109/+160
Live migration should be terminated if the vhost-user backend crashes before the migration completes. Specifically, since the vhost device will be stopped when VM is stopped before the end of the live migration, in current implementation if the backend crashes, vhost-user device set_status() won't return failure, live migration won't perceive the disconnection between QEMU and the backend. When the VM is migrated to the destination, the inflight IO will be resubmitted, and if the IO was completed out of order before, it will cause IO error. To fix this issue: 1. Add the return value to set_status() for VirtioDeviceClass. a. For the vhost-user device, return failure when the backend crashes. b. For other virtio devices, always return 0. 2. Return failure if vhost_dev_stop() failed for vhost-user device. If QEMU loses connection with the vhost-user backend, virtio set_status() can return failure to the upper layer, migration_completion() can handle the error, terminate the live migration, and restore the VM, so that inflight IO can be completed normally. Signed-off-by: Haoqian He <haoqian.he@smartx.com> Message-Id: <20250416024729.3289157-4-haoqian.he@smartx.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14vhost: return failure if stop virtqueue failed in vhost_dev_stopHaoqian He2-13/+18
This patch captures the error of vhost_virtqueue_stop() in vhost_dev_stop() and returns the error upward. Specifically, if QEMU is disconnected from the vhost backend, some actions in vhost_dev_stop() will fail, such as sending vhost-user messages to the backend (GET_VRING_BASE, SET_VRING_ENABLE) and vhost_reset_status. Considering that both set_vring_enable and vhost_reset_status require setting the specific virtio feature bit, we can capture vhost_virtqueue_stop()'s error to indicate that QEMU has lost connection with the backend. This patch is the pre patch for 'vhost-user: return failure if backend crashes when live migration', which makes the live migration aware of the loss of connection with the vhost-user backend and aborts the live migration. Signed-off-by: Haoqian He <haoqian.he@smartx.com> Message-Id: <20250416024729.3289157-3-haoqian.he@smartx.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14system/runstate: add VM state change cb with return valueHaoqian He8-20/+62
This patch adds the new VM state change cb type `VMChangeStateHandlerWithRet`, which has return value for `VMChangeStateEntry`. Thus, we can register a new VM state change cb with return value for device. Note that `VMChangeStateHandler` and `VMChangeStateHandlerWithRet` are mutually exclusive and cannot be provided at the same time. This patch is the pre patch for 'vhost-user: return failure if backend crashes when live migration', which makes the live migration aware of the loss of connection with the vhost-user backend and aborts the live migration. Virtio device will use VMChangeStateHandlerWithRet. Signed-off-by: Haoqian He <haoqian.he@smartx.com> Message-Id: <20250416024729.3289157-2-haoqian.he@smartx.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14pci-testdev.c: Add membar-backed option for backing membarStephen Bates1-2/+10
The pci-testdev device allows for an optional BAR. We have historically used this without backing to test that systems and OSes can accomodate large PCI BARs. However to help test p2pdma operations it is helpful to add an option to back this BAR with host memory. We add a membar-backed boolean parameter and when set to true or on we do a host RAM backing. The default is false which ensures backward compatability. Signed-off-by: Stephen Bates <sbates@raithlin.com> Message-Id: <Z_6JhDtn5PlaDgB_@MKMSTEBATES01.amd.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14pcie_sriov: Make a PCI device with user-created VF ARI-capableAkihiko Odaki4-10/+24
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-9-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14docs: Document composable SR-IOV deviceAkihiko Odaki3-0/+38
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-8-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14virtio-net: Implement SR-IOV VFAkihiko Odaki1-0/+1
A virtio-net device can be added as a SR-IOV VF to another virtio-pci device that will be the PF. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-7-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14virtio-pci: Implement SR-IOV PFAkihiko Odaki2-5/+16
Allow user to attach SR-IOV VF to a virtio-pci PF. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-6-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14pcie_sriov: Allow user to create SR-IOV deviceAkihiko Odaki4-78/+286
A user can create a SR-IOV device by specifying the PF with the sriov-pf property of the VFs. The VFs must be added before the PF. A user-creatable VF must have PCIDeviceClass::sriov_vf_user_creatable set. Such a VF cannot refer to the PF because it is created before the PF. A PF that user-creatable VFs can be attached calls pcie_sriov_pf_init_from_user_created_vfs() during realization and pcie_sriov_pf_exit() when exiting. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-5-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14pcie_sriov: Check PCI Express for SR-IOV PFAkihiko Odaki1-0/+5
SR-IOV requires PCI Express. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-4-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14pcie_sriov: Ensure PF and VF are mutually exclusiveAkihiko Odaki1-0/+5
A device cannot be a SR-IOV PF and a VF at the same time. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-3-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/pci: Fix SR-IOV VF number calculationAkihiko Odaki1-1/+5
pci_config_get_bar_addr() had a division by vf_stride. vf_stride needs to be non-zero when there are multiple VFs, but the specification does not prohibit to make it zero when there is only one VF. Do not perform the division for the first VF to avoid division by zero. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-2-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/pci: Do not add ROM BAR for SR-IOV VFAkihiko Odaki1-0/+8
A SR-IOV VF cannot have a ROM BAR. Co-developed-by: Yui Washizu <yui.washidu@gmail.com> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250314-sriov-v9-1-57dae8ae3ab5@daynix.com> Tested-by: Yui Washizu <yui.washidu@gmail.com> Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14docs/cxl: Add serial number for persistent-memdevYuquan Wang1-9/+9
Add serial number parameter in the cxl persistent examples. Signed-off-by: Yuquan Wang <wangyuquan1236@phytium.com.cn> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-9-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl/cxl-mailbox-utils: CXL CCI Get/Set alert config commandsSweta Kumari3-0/+134
1) get alert configuration(Opcode 4201h) 2) set alert configuration(Opcode 4202h) Signed-off-by: Sweta Kumari <s5.kumari@samsung.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-7-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl/cxl-mailbox-utils: Media operations Sanitize and Write Zeros commands ↵Vinayak Holikatti2-0/+208
CXL r3.2(8.2.10.9.5.3) CXL spec 3.2 section 8.2.10.9.5.3 describes media operations commands. CXL devices supports media operations Sanitize and Write zero command. Signed-off-by: Vinayak Holikatti <vinayak.kh@samsung.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-6-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl: factor out calculation of sanitize duration from cmd_santize_overwriteVinayak Holikatti1-26/+35
Move the code of calculation of sanitize duration into function for usability by other sanitize routines Estimate times based on: https://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Vinayak Holikatti <vinayak.kh@samsung.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-5-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl/cxl-mailbox-utils: Add support for Media operations discovery ↵Vinayak Holikatti1-0/+125
commands cxl r3.2 (8.2.10.9.5.3) CXL spec 3.2 section 8.2.10.9.5.3 describes media operations commands. CXL devices supports media operations discovery command. Signed-off-by: Vinayak Holikatti <vinayak.kh@samsung.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-4-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl: Support get/set mctp response payload sizeDavidlohr Bueso1-2/+56
Add Get/Set Response Message Limit commands. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Fan Ni <fan.ni@samsung.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-3-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/cxl: Support aborting background commandsDavidlohr Bueso5-13/+86
As of 3.1 spec, background commands can be canceled with a new abort command. Implement the support, which is advertised in the CEL. No ad-hoc context undoing is necessary as all the command logic of the running bg command is done upon completion. Arbitrarily, the on-going background cmd will not be aborted if already at least 85% done; A mutex is introduced to stabilize mbox request cancel command vs the timer callback being fired scenarios (as well as reading the mbox registers). While some operations under critical regions may be unnecessary (irq notifying, cmd callbacks), this is not a path where performance is important, so simplicity is preferred. Tested-by: Ajay Joshi <ajay.opensrc@micron.com> Reviewed-by: Ajay Joshi <ajay.opensrc@micron.com> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20250305092501.191929-2-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-14hw/loongarch/boot: Adjust the loading position of the initrdXianglai Li1-9/+43
When only the -kernel parameter is used to load the elf kernel, the initrd is loaded in the ram. If the initrd size is too large, the loading fails, resulting in a VM startup failure. This patch first loads initrd near the kernel. When the nearby memory space of the kernel is insufficient, it tries to load it to the starting position of high memory. If there is still not enough, qemu will report an error and ask the user to increase the memory space for the virtual machine to boot. Signed-off-by: Xianglai Li <lixianglai@loongson.cn> Message-Id: <20250506080946.817092-1-lixianglai@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Merge three memory region into oneBibo Mao3-73/+1
Since memory region iomem supports memory access size with 1/2/4/8, it can be used for memory region iomem8 and iomem32_high. Now remove memory region iomem8 and iomem32_high, merge them into iomem together. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023754.1877445-5-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Set flexible memory access size with iomem regionBibo Mao1-3/+10
The original iomem region only supports 4 bytes access size, set it ok with 1/2/4/8 bytes. Also unaligned memory access is not supported. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023754.1877445-4-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Rename memory region iomem32_low with iomemBibo Mao2-20/+8
Rename memory region iomem32_low with iomem, also change ops name as follows: loongarch_pch_pic_reg32_low_ops --> loongarch_pch_pic_ops loongarch_pch_pic_low_readw --> loongarch_pch_pic_read loongarch_pch_pic_low_writew --> loongarch_pch_pic_write Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023754.1877445-3-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Use unified trace event for memory region opsBibo Mao2-24/+8
Add trace event trace_loongarch_pch_pic_read(), replaces the following three events: trace_loongarch_pch_pic_low_readw() trace_loongarch_pch_pic_high_readw() trace_loongarch_pch_pic_readb() The similiar with write trace event. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023754.1877445-2-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Use generic write callback for iomem8 regionBibo Mao1-21/+10
Add iomem8 region register write operation emulation in generic write function loongarch_pch_pic_write(), and use this function for iomem8 region. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023754.1877445-1-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>
2025-05-14hw/intc/loongarch_pch: Use generic write callback for iomem32_high regionBibo Mao1-23/+5
Add iomem32_high region register write operation emulation in generic write function loongarch_pch_pic_write(), and use this function for iomem32_high region. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Song Gao <gaosong@loongson.cn> Message-Id: <20250507023148.1877287-12-maobibo@loongson.cn> Signed-off-by: Song Gao <gaosong@loongson.cn>