summary refs log tree commit diff stats
path: root/hw/nvme/ctrl.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* pcie_sriov: Fix broken MMIO accesses from SR-IOV VFsDamien Bergamini2025-10-051-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting with commit cab1398a60eb, SR-IOV VFs are realized as soon as pcie_sriov_pf_init() is called. Because pcie_sriov_pf_init() must be called before pcie_sriov_pf_init_vf_bar(), the VF BARs types won't be known when the VF realize function calls pcie_sriov_vf_register_bar(). This breaks the memory regions of the VFs (for instance with igbvf): $ lspci ... Region 0: Memory at 281a00000 (64-bit, prefetchable) [virtual] [size=16K] Region 3: Memory at 281a20000 (64-bit, prefetchable) [virtual] [size=16K] $ info mtree ... address-space: pci_bridge_pci_mem 0000000000000000-ffffffffffffffff (prio 0, i/o): pci_bridge_pci 0000000081a00000-0000000081a03fff (prio 1, i/o): igbvf-mmio 0000000081a20000-0000000081a23fff (prio 1, i/o): igbvf-msix and causes MMIO accesses to fail: Invalid write at addr 0x281A01520, size 4, region '(null)', reason: rejected Invalid read at addr 0x281A00C40, size 4, region '(null)', reason: rejected To fix this, VF BARs are now registered with pci_register_bar() which has a type parameter and pcie_sriov_vf_register_bar() is removed. Fixes: cab1398a60eb ("pcie_sriov: Reuse SR-IOV VF device instances") Signed-off-by: Damien Bergamini <damien.bergamini@eviden.com> Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <20250901151314.1038020-1-clement.mathieu--drif@eviden.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* hw/nvme: cap MDTS value for internal limitationKeith Busch2025-08-111-0/+5
| | | | | | | | | | | | | | | The emulated device had let the user set whatever max transfers size they wanted, including no limit. However the device does have an internal limit of 1024 segments. NVMe doesn't report max segments, though. This is implicitly inferred based on the MDTS and MPSMIN values. IOV_MAX is currently 1024 which 4k PRPs can exceed with 2MB transfers. Don't allow MDTS values that can exceed this, otherwise users risk seeing "internal error" status to their otherwise protocol compliant commands. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: revert CMIC behaviorKlaus Jensen2025-08-111-13/+17
| | | | | | | | | | | | | | | | | Commit cd59f50ab017 ("hw/nvme: always initialize a subsystem") causes the controller to always set the CMIC.MCTRS ("Multiple Controllers") bit. While spec-compliant, this is a deviation from the previous behavior where this was only set if an nvme-subsys device was explicitly created (to configure a subsystem with multiple controllers/namespaces). Revert the behavior to only set CMIC.MCTRS if an nvme-subsys device is created explicitly. Reported-by: Alan Adamson <alan.adamson@oracle.com> Fixes: cd59f50ab017 ("hw/nvme: always initialize a subsystem") Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Tested-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: fix namespace attachmentKlaus Jensen2025-08-111-2/+6
| | | | | | | | | | | | | | | Commit 6ccca4b6bb9f ("hw/nvme: rework csi handling") introduced a bug in Namespace Attachment, causing it to a) not allow a controller to attach namespaces to other controllers b) assert if a valid non-attached namespace is detached This fixes both issues. Fixes: 6ccca4b6bb9f ("hw/nvme: rework csi handling") Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2976 Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* treewide: fix paths for relocated files in commentsSean Wei2025-07-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the docs directory restructuring, several comments refer to paths that no longer exist. Replace these references to the current file locations so readers can find the correct files. Related commits --------------- 189c099f75f (Jul 2021) docs: collect the disparate device emulation docs into one section Rename docs/system/{ => devices}/nvme.rst 5f4c96b779f (Feb 2023) docs/system/loongarch: update loongson3.rst and rename it to virt.rst Rename docs/system/loongarch/{loongson3.rst => virt.rst} fe0007f3c1d (Sep 2023) exec: Rename cpu.c -> cpu-target.c Rename cpus-common.c => cpu-common.c 42fa9665e59 (Apr 2025) exec: Restrict 'cpu_ldst.h' to accel/tcg/ Rename include/{exec/cpu_ldst.h => accel/tcg/cpu-ldst.h} Signed-off-by: Sean Wei <me@sean.taipei> Message-ID: <20250616.qemu.relocated.06@sean.taipei> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
* hw/nvme/ctrl: skip automatic zero-init of large arraysDaniel P. Berrangé2025-06-121-3/+3
| | | | | | | | | | | | | | | | | | | | | The 'nvme_map_sgl' method has a 256 element array used for copying data from the device. Skip the automatic zero-init of this array to eliminate the performance overhead in the I/O hot path. The 'segment' array will be fully initialized when reading data from the device. The 'nme_changed_nslist' method has a 4k byte array that is manually initialized with memset(). The compiler ought to be intelligent enough to turn the memset() into a static initialization operation, and thus not duplicate the automatic zero-init. Replacing memset() with '{}' makes it unambiguous that the array is statically initialized. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Message-id: 20250610123709.835102-24-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* qom: Make InterfaceInfo[] uses constPhilippe Mathieu-Daudé2025-04-251-1/+1
| | | | | | | | | | | Mechanical change using: $ sed -i -E 's/\(InterfaceInfo.?\[/\(const InterfaceInfo\[/g' \ $(git grep -lE '\(InterfaceInfo.?\[\]\)') Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-Id: <20250424194905.82506-7-philmd@linaro.org>
* qom: Have class_init() take a const data argumentPhilippe Mathieu-Daudé2025-04-251-1/+1
| | | | | | | | | | Mechanical change using gsed, then style manually adapted to pass checkpatch.pl script. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20250424194905.82506-4-philmd@linaro.org>
* hw/nvme: fix attachment of private namespacesKlaus Jensen2025-04-081-1/+6
| | | | | | | | | | | | | | | | | | Fix regression when attaching private namespaces that gets attached to the wrong controller. Keep track of the original controller "owner" of private namespaces, and only attach if this matches on controller enablement. Fixes: 6ccca4b6bb9f ("hw/nvme: rework csi handling") Reported-by: Alan Adamson <alan.adamson@oracle.com> Suggested-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com> Tested-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250408-fix-private-ns-v1-1-28e169b6b60b@samsung.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
* pci: Use PCI PM capability initializerAlex Williamson2025-03-061-2/+1
| | | | | | | | | | | | | | | | | | | | | | | Switch callers directly initializing the PCI PM capability with pci_add_capability() to use pci_pm_init(). Cc: Dmitry Fleytman <dmitry.fleytman@gmail.com> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Stefan Weil <sw@weilnetz.de> Cc: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Klaus Jensen <its@irrelevant.dk> Cc: Jesper Devantier <foss@defmacro.it> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Cc: Cédric Le Goater <clg@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/qemu-devel/20250225215237.3314011-3-alex.williamson@redhat.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
* hw/nvme: remove nvme_aio_err()Klaus Jensen2025-02-261-37/+23
| | | | | | | | nvme_rw_complete_cb() is the only remaining user of nvme_aio_err(), so open code the status code setting instead. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: set error status code explicitly for misc commandsKlaus Jensen2025-02-261-6/+22
| | | | | | | | | | | | | | The nvme_aio_err() does not handle Verify, Compare, Copy and other misc commands and defaults to setting the error status code to Internal Device Error. For some of these commands, we know better, so set it explicitly. For the commands using the nvme_misc_cb() callback (Copy, Flush, ...), if no status code has explicitly been set by the lower handlers, default to Internal Device Error as previously. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: only set command abort requested when cancelled due to AbortKlaus Jensen2025-02-261-4/+2
| | | | | | | | | | The Command Abort Requested status code should only be set if the command was explicitly cancelled due to an Abort command. Or, in the case the cancel was due to Submission Queue deletion, set the status code to Command Aborted due to SQ Deletion. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: rework csi handlingKlaus Jensen2025-02-261-92/+121
| | | | | | | | | | | | | The controller incorrectly allows a zoned namespace to be attached even if CS.CSS is configured to only support the NVM command set for I/O queues. Rework handling of namespace command sets in general by attaching supported namespaces when the controller is started instead of, like now, statically when realized. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: be compliant wrt. dsm processing limitsKlaus Jensen2025-02-251-9/+15
| | | | | | | | | | | | | The specification states that, > The controller shall set all three processing limit fields (i.e., the > DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all > three processing limit fields to 0h. So, set the DMRL and DMSL fields in addition to DMRSL. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* nvme: fix iocs status code valuesKlaus Jensen2025-02-251-2/+2
| | | | | | | The status codes related to I/O Command Sets are in the wrong group. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: add knob for doorbell buffer config supportKlaus Jensen2025-02-251-3/+8
| | | | | | | | Add a 'dbcs' knob to allow Doorbell Buffer Config command to be disabled. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: make oacs dynamicKlaus Jensen2025-02-251-7/+18
| | | | | | | | Virtualization Management needs sriov-related parameters. Only report support for the command when that conditions are true. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: always initialize a subsystemKlaus Jensen2025-02-251-16/+20
| | | | | | | If no nvme-subsys is explicitly configured, instantiate one. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: Add OCP SMART / Health Information Extended Log PageStephen Bates2025-02-251-0/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Open Compute Project [1] includes a Datacenter NVMe SSD Specification [2]. The most recent version of this specification (as of November 2024) is 2.6.1. This specification layers on top of the NVM Express specifications [3] to provide additional functionality. A key part of of this is the 512 Byte OCP SMART / Health Information Extended log page that is defined in Section 4.8.6 of the specification. We add a controller argument (ocp) that toggles on/off the SMART log extended structure. To accommodate different vendor specific specifications like OCP, we add a multiplexing function (nvme_vendor_specific_log) which will route to the different log functions based on arguments and log ids. We only return the OCP extended SMART log when the command is 0xC0 and ocp has been turned on in the nvme argumants. Though we add the whole nvme SMART log extended structure, we only populate the physical_media_units_{read,written}, log_page_version and log_page_uuid. This patch is based on work done by Joel but has been modified enough that he requested a co-developed-by tag rather than a signed-off-by. [1]: https://www.opencompute.org/ [2]: https://www.opencompute.org/documents/datacenter-nvme-ssd-specification-v2-6-1-pdf [3]: https://nvmexpress.org/specifications/ Signed-off-by: Stephen Bates <sbates@raithlin.com> Co-developed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* pcie_sriov: Ensure VF addr does not overflowAkihiko Odaki2025-02-201-8/+14
| | | | | | | | | pci_new() aborts when creating a VF with addr >= PCI_DEVFN_MAX. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20250116-reuse-v20-7-7cb370606368@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* Merge tag 'exec-20241220' of https://github.com/philmd/qemu into stagingStefan Hajnoczi2024-12-211-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Accel & Exec patch queue - Ignore writes to CNTP_CTL_EL0 on HVF ARM (Alexander) - Add '-d invalid_mem' logging option (Zoltan) - Create QOM containers explicitly (Peter) - Rename sysemu/ -> system/ (Philippe) - Re-orderning of include/exec/ headers (Philippe) Move a lot of declarations from these legacy mixed bag headers: . "exec/cpu-all.h" . "exec/cpu-common.h" . "exec/cpu-defs.h" . "exec/exec-all.h" . "exec/translate-all" to these more specific ones: . "exec/page-protection.h" . "exec/translation-block.h" . "user/cpu_loop.h" . "user/guest-host.h" . "user/page-protection.h" # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmdlnyAACgkQ4+MsLN6t # wN6mBw//QFWi7CrU+bb8KMM53kOU9C507tjn99LLGFb5or73/umDsw6eo/b8DHBt # KIwGLgATel42oojKfNKavtAzLK5rOrywpboPDpa3SNeF1onW+99NGJ52LQUqIX6K # A6bS0fPdGG9ZzEuPpbjDXlp++0yhDcdSgZsS42fEsT7Dyj5gzJYlqpqhiXGqpsn8 # 4Y0UMxSL21K3HEexlzw2hsoOBFA3tUm2ujNDhNkt8QASr85yQVLCypABJnuoe/// # 5Ojl5wTBeDwhANET0rhwHK8eIYaNboiM9fHopJYhvyw1bz6yAu9jQwzF/MrL3s/r # xa4OBHBy5mq2hQV9Shcl3UfCQdk/vDaYaWpgzJGX8stgMGYfnfej1SIl8haJIfcl # VMX8/jEFdYbjhO4AeGRYcBzWjEJymkDJZoiSWp2NuEDi6jqIW+7yW1q0Rnlg9lay # ShAqLK5Pv4zUw3t0Jy3qv9KSW8sbs6PQxtzXjk8p97rTf76BJ2pF8sv1tVzmsidP # 9L92Hv5O34IqzBu2oATOUZYJk89YGmTIUSLkpT7asJZpBLwNM2qLp5jO00WVU0Sd # +kAn324guYPkko/TVnjC/AY7CMu55EOtD9NU35k3mUAnxXT9oDUeL4NlYtfgrJx6 # x1Nzr2FkS68+wlPAFKNSSU5lTjsjNaFM0bIJ4LCNtenJVP+SnRo= # =cjz8 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 20 Dec 2024 11:45:20 EST # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'exec-20241220' of https://github.com/philmd/qemu: (59 commits) util/qemu-timer: fix indentation meson: Do not define CONFIG_DEVICES on user emulation system/accel-ops: Remove unnecessary 'exec/cpu-common.h' header system/numa: Remove unnecessary 'exec/cpu-common.h' header hw/xen: Remove unnecessary 'exec/cpu-common.h' header target/mips: Drop left-over comment about Jazz machine target/mips: Remove tswap() calls in semihosting uhi_fstat_cb() target/xtensa: Remove tswap() calls in semihosting simcall() helper accel/tcg: Un-inline translator_is_same_page() accel/tcg: Include missing 'exec/translation-block.h' header accel/tcg: Move tcg_cflags_has/set() to 'exec/translation-block.h' accel/tcg: Restrict curr_cflags() declaration to 'internal-common.h' qemu/coroutine: Include missing 'qemu/atomic.h' header exec/translation-block: Include missing 'qemu/atomic.h' header accel/tcg: Declare cpu_loop_exit_requested() in 'exec/cpu-common.h' exec/cpu-all: Include 'cpu.h' earlier so MMU_USER_IDX is always defined target/sparc: Move sparc_restore_state_to_opc() to cpu.c target/sparc: Uninline cpu_get_tb_cpu_state() target/loongarch: Declare loongarch_cpu_dump_state() locally user: Move various declarations out of 'exec/exec-all.h' ... Conflicts: hw/char/riscv_htif.c hw/intc/riscv_aplic.c target/s390x/cpu.c Apply sysemu header path changes to not in the pull request. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * include: Rename sysemu/ -> system/Philippe Mathieu-Daudé2024-12-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Headers in include/sysemu/ are not only related to system *emulation*, they are also used by virtualization. Rename as system/ which is clearer. Files renamed manually then mechanical change using sed tool. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Lei Yang <leiyang@redhat.com> Message-Id: <20241203172445.28576-1-philmd@linaro.org>
* | include/hw/qdev-properties: Remove DEFINE_PROP_END_OF_LISTRichard Henderson2024-12-191-1/+0
|/ | | | | | | | | | | | | | Now that all of the Property arrays are counted, we can remove the terminator object from each array. Update the assertions in device_class_set_props to match. With struct Property being 88 bytes, this was a rather large form of terminator. Saves 30k from qemu-system-aarch64. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Lei Yang <leiyang@redhat.com> Link: https://lore.kernel.org/r/20241218134251.4724-21-richard.henderson@linaro.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* hw/nvme: Constify all PropertyRichard Henderson2024-12-151-1/+1
| | | | | Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
* hw/nvme: take a reference on the subsystem on vf realizationKlaus Jensen2024-12-031-0/+7
| | | | | | | | | | | | | | | | | Make sure we grab a reference on the subsystem when a VF is realized. Otherwise, the subsytem will be unrealized automatically when the VFs are unregistered and unreffed. This fixes a latent bug but was not exposed until commit 08f632848008 ("pcie: Release references of virtual functions"). This was then fixed (or rather, hidden) by commit c613ad25125b ("pcie_sriov: Do not manually unrealize"), but that was then reverted (due to other issues) in commit b0fdaee5d1ed, exposing the bug yet again. Cc: qemu-stable@nongnu.org Fixes: 08f632848008 ("pcie: Release references of virtual functions") Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: SR-IOV VFs must hardwire pci interrupt pin register to zeroKlaus Jensen2024-12-031-1/+7
| | | | | | | | | The PCI Interrupt Pin Register does not apply to VFs and MUST be hardwired to zero. Fixes: 44c2c09488db ("hw/nvme: Add support for SR-IOV") Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: fix use/unuse of msix vectorsKlaus Jensen2024-12-031-2/+3
| | | | | | | | Only call msix_{un,}use_vector() when interrupts are actually enabled for a completion queue. Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: fix msix_uninit with exclusive barKlaus Jensen2024-12-031-1/+6
| | | | | | | | | | | | Commit fa905f65c554 introduced a machine compatibility parameter to enable an exclusive bar for msix. It failed to account for this when cleaning up. Make sure that if an exclusive bar is enabled, we use the proper cleanup routine. Cc: qemu-stable@nongnu.org Fixes: fa905f65c554 ("hw/nvme: add machine compatibility parameter to enable msix exclusive bar") Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: fix handling of over-committed queuesKlaus Jensen2024-11-081-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a host chooses to use the SQHD "hint" in the CQE to know if there is room in the submission queue for additional commands, it may result in a situation where there are not enough internal resources (struct NvmeRequest) available to process the command. For a lack of a better term, the host may "over-commit" the device (i.e., it may have more inflight commands than the queue size). For example, assume a queue with N entries. The host submits N commands and all are picked up for processing, advancing the head and emptying the queue. Regardless of which of these N commands complete first, the SQHD field of that CQE will indicate to the host that the queue is empty, which allows the host to issue N commands again. However, if the device has not posted CQEs for all the previous commands yet, the device will have less than N resources available to process the commands, so queue processing is suspended. And here lies an 11 year latent bug. In the absense of any additional tail updates on the submission queue, we never schedule the processing bottom-half again unless we observe a head update on an associated full completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups (in the absense of over-commit of course). Incidentially, that "kick all associated SQs" mechanism can now be killed since we now just schedule queue processing when we return a processing resource to a non-empty submission queue, which happens to cover both edge cases. However, we must retain kicking the CQ if it was previously full. So, apparently, no previous driver tested with hw/nvme has ever used SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv shows up with the driver that actually does. I salute you. Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface") Cc: qemu-stable@nongnu.org Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388 Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: i/o cmd set independent namespace data structureArun Kumar2024-11-041-0/+31
| | | | | | | | | | Add support for the I/O Command Set Independent Namespace Data Structure (CNS 8h and 1fh). Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Link: https://lore.kernel.org/r/20240925004407.3521406-1-arun.kka@samsung.com Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: add atomic write supportAlan Adamson2024-10-011-1/+157
| | | | | | | | | | | | | | | | | | | | Adds support for the controller atomic parameters: AWUN and AWUPF. Atomic Compare and Write Unit (ACWU) is not currently supported. Writes that adhere to the ACWU and AWUPF parameters are guaranteed to be atomic. New NVMe QEMU Parameters (See NVMe Specification for details): atomic.dn (default off) - Set the value of Disable Normal. atomic.awun=UINT16 (default: 0) atomic.awupf=UINT16 (default: 0) By default (Disable Normal set to zero), the maximum atomic write size is set to the AWUN value. If Disable Normal is set, the maximum atomic write size is set to AWUPF. Signed-off-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: add knob for CTRATT.MEMKlaus Jensen2024-10-011-1/+6
| | | | | | | | | Add a boolean prop (ctratt.mem) for setting CTRATT.MEM and default it to unset (false) to keep existing behavior of the device intact. Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Arun Kumar <arun.kka@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: support CTRATT.MEMArun Kumar2024-09-301-5/+14
| | | | | | | | | | | Indicate that 'MDTS and Size Limits Exclude Metadata (MEM)' in the Controller Attributes (CTRATT) I/O Command Set Independent Identify Controller Data Structure. Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> [k.jensen: updated commit message] Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: clear masked events from the aer queueArun Kumar2024-09-301-2/+9
| | | | | | | | | Clear masked events from the aer queue when get log page is issued with RAE 0 without checking for the presence of outstanding aer requests. Signed-off-by: Arun Kumar <arun.kka@samsung.com> [k.jensen: remove unnecessary QTAILQ_EMPTY check] Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: report id controller metadata sgl supportKeith Busch2024-09-301-1/+2
| | | | | | | | | The controller already supports this decoding, so just set the ID_CTRL.SGLS field accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: replace assert(false) with g_assert_not_reached()Pierrick Bouvier2024-09-241-4/+4
| | | | | | | | | | | | This patch is part of a series that moves towards a consistent use of g_assert_not_reached() rather than an ad hoc mix of different assertion mechanisms. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Message-ID: <20240919044641.386068-11-pierrick.bouvier@linaro.org> Signed-off-by: Thomas Huth <thuth@redhat.com>
* hw: Use device_class_set_legacy_reset() instead of opencodingPeter Maydell2024-09-131-1/+1
| | | | | | | | | | | | | Use device_class_set_legacy_reset() instead of opencoding an assignment to DeviceClass::reset. This change was produced with: spatch --macro-file scripts/cocci-macro-file.h \ --sp-file scripts/coccinelle/device-reset.cocci \ --keep-comments --smpl-spacing --in-place --dir hw Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-id: 20240830145812.1967042-8-peter.maydell@linaro.org
* hw/nvme: fix leak of uninitialized memory in io_mgmt_recvKlaus Jensen2024-08-201-1/+1
| | | | | | | | | | Yutaro Shimizu from the Cyber Defense Institute discovered a bug in the NVMe emulation that leaks contents of an uninitialized heap buffer if subsystem and FDP emulation are enabled. Cc: qemu-stable@nongnu.org Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* Revert "pcie_sriov: Ensure VF function number does not overflow"Michael S. Tsirkin2024-08-011-16/+8
| | | | | | This reverts commit 77718701157f6ca77ea7a57b536fa0a22f676082. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu ↵Richard Henderson2024-07-241-0/+62
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | into staging virtio,pci,pc: features,fixes pci: Initial support for SPDM Responders cxl: Add support for scan media, feature commands, device patrol scrub control, DDR5 ECS control, firmware updates virtio: in-order support virtio-net: support for SR-IOV emulation (note: known issues on s390, might get reverted if not fixed) smbios: memory device size is now configurable per Machine cpu: architecture agnostic code to support vCPU Hotplug Fixes, cleanups all over the place. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmae9l8PHG1zdEByZWRo # YXQuY29tAAoJECgfDbjSjVRp8fYH/impBH9nViO/WK48io4mLSkl0EUL8Y/xrMvH # zKFCKaXq8D96VTt1Z4EGKYgwG0voBKZaCEKYU/0ARGnSlSwxINQ8ROCnBWMfn2sx # yQt08EXVMznNLtXjc6U5zCoCi6SaV85GH40No3MUFXBQt29ZSlFqO/fuHGZHYBwS # wuVKvTjjNF4EsGt3rS4Qsv6BwZWMM+dE6yXpKWk68kR8IGp+6QGxkMbWt9uEX2Md # VuemKVnFYw0XGCGy5K+ZkvoA2DGpEw0QxVSOMs8CI55Oc9SkTKz5fUSzXXGo1if+ # M1CTjOPJu6pMym6gy6XpFa8/QioDA/jE2vBQvfJ64TwhJDV159s= # =k8e9 # -----END PGP SIGNATURE----- # gpg: Signature made Tue 23 Jul 2024 10:16:31 AM AEST # gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469 # gpg: issuer "mst@redhat.com" # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [undefined] # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [undefined] # gpg: WARNING: The key's User ID is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (61 commits) hw/nvme: Add SPDM over DOE support backends: Initial support for SPDM socket support hw/pci: Add all Data Object Types defined in PCIe r6.0 tests/acpi: Add expected ACPI AML files for RISC-V tests/qtest/bios-tables-test.c: Enable basic testing for RISC-V tests/acpi: Add empty ACPI data files for RISC-V tests/qtest/bios-tables-test.c: Remove the fall back path tests/acpi: update expected DSDT blob for aarch64 and microvm acpi/gpex: Create PCI link devices outside PCI root bridge tests/acpi: Allow DSDT acpi table changes for aarch64 hw/riscv/virt-acpi-build.c: Update the HID of RISC-V UART hw/riscv/virt-acpi-build.c: Add namespace devices for PLIC and APLIC virtio-iommu: Add trace point on virtio_iommu_detach_endpoint_from_domain hw/vfio/common: Add vfio_listener_region_del_iommu trace event virtio-iommu: Remove the end point on detach virtio-iommu: Free [host_]resv_ranges on unset_iommu_devices virtio-iommu: Remove probe_done Revert "virtio-iommu: Clear IOMMUDevice when VFIO device is unplugged" gdbstub: Add helper function to unregister GDB register space physmem: Add helper function to destroy CPU AddressSpace ... Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
| * hw/nvme: Add SPDM over DOE supportWilfred Mallawa2024-07-221-0/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | Setup Data Object Exchange (DOE) as an extended capability for the NVME controller and connect SPDM to it (CMA) to it. Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Klaus Jensen <k.jensen@samsung.com> Message-Id: <20240703092027.644758-4-alistair.francis@wdc.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | hw/nvme: remove useless type castYao Xingtao2024-07-221-1/+1
| | | | | | | | | | | | | | | | | | The type of req->cmd is NvmeCmd, cast the pointer of this type to NvmeCmd* is useless. Signed-off-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* | hw/nvme: actually implement abortAyush Mishra2024-07-221-0/+32
| | | | | | | | | | | | | | | | | | Abort was not implemented previously, but we can implement it for AERs and asynchrnously for I/O. Signed-off-by: Ayush Mishra <ayush.m55@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* | hw/nvme: add cross namespace copy supportArun Kumar2024-07-221-92/+263
| | | | | | | | | | | | | | | | | | Extend copy command to copy user data across different namespaces via support for specifying a namespace for each source range Signed-off-by: Arun Kumar <arun.kka@samsung.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* | hw/nvme: fix memory leak in nvme_dsmZheyu Ma2024-07-221-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | The allocated memory to hold LBA ranges leaks in the nvme_dsm function. This happens because the allocated memory for iocb->range is not freed in all error handling paths. Fix this by adding a free to ensure that the allocated memory is properly freed. ASAN log: ==3075137==ERROR: LeakSanitizer: detected memory leaks Direct leak of 480 byte(s) in 6 object(s) allocated from: #0 0x55f1f8a0eddd in malloc llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3 #1 0x7f531e0f6738 in g_malloc (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x5e738) #2 0x55f1faf1f091 in blk_aio_get block/block-backend.c:2583:12 #3 0x55f1f945c74b in nvme_dsm hw/nvme/ctrl.c:2609:30 #4 0x55f1f945831b in nvme_io_cmd hw/nvme/ctrl.c:4470:16 #5 0x55f1f94561b7 in nvme_process_sq hw/nvme/ctrl.c:7039:29 Cc: qemu-stable@nongnu.org Fixes: d7d1474fd85d ("hw/nvme: reimplement dsm to allow cancellation") Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: Expand VI/VQ resource to uint32Minwoo Im2024-07-111-4/+4
| | | | | | | | | | | | | VI and VQ resources cover queue resources in each VFs in SR-IOV. Current maximum I/O queue pair size is 0xffff, we can expand them to cover the full number of I/O queue pairs. This patch also fixed Identify Secondary Controller List overflow due to expand of number of secondary controllers. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: Allocate sec-ctrl-list as a dynamic arrayMinwoo Im2024-07-111-7/+1
| | | | | | | | | | To prevent further bumping up the number of maximum VF te support, this patch allocates a dynamic array (NvmeCtrl *)->sec_ctrl_list based on number of VF supported by sriov_max_vfs property. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: separate identify data for sec. ctrl listMinwoo Im2024-07-111-11/+10
| | | | | | | | | | | | | | | | | | Secondary controller list for virtualization has been managed by Identify Secondary Controller List data structure with NvmeSecCtrlList where up to 127 secondary controller entries can be managed. The problem hasn't arisen so far because NVME_MAX_VFS has been 127. This patch separated identify data itself from the actual secondary controller list managed by controller to support more than 127 secondary controllers with the following patch. This patch reused NvmeSecCtrlEntry structure to manage all the possible secondary controllers, and copy entries to identify data structure when the command comes in. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
* hw/nvme: add Identify Endurance Group ListMinwoo Im2024-07-111-0/+22
| | | | | | | | | | | | | | Commit 73064edfb864 ("hw/nvme: flexible data placement emulation") intorudced NVMe FDP feature to nvme-subsys and nvme-ctrl with a single endurance group #1 supported. This means that controller should return proper identify data to host with Identify Endurance Group List (CNS 19h). But, yes, only just for the endurance group #1. This patch allows host applications to ask for which endurance group is available and utilize FDP through that endurance group. Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>