Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"
From: Nilay Shroff
Date: Tue May 21 2024 - 06:07:48 EST
On 5/21/24 15:15, Sagi Grimberg wrote:
>
>
> On 21/05/2024 11:48, Nilay Shroff wrote:
>>
>> On 5/21/24 01:50, John Meneghini wrote:
>>> From: "Ewan D. Milne" <emilne@xxxxxxxxxx>
>>>
>>> The round-robin path selector is inefficient in cases where there is a
>>> difference in latency between multiple active optimized paths. In the
>>> presence of one or more high latency paths the round-robin selector
>>> continues to the high latency path equally. This results in a bias
>>> towards the highest latency path and can cause is significant decrease
>>> in overall performance as IOs pile on the lowest latency path. This
>>> problem is particularly accute with NVMe-oF controllers.
>>>
>>> The queue-depth policy instead sends I/O requests down the path with the
>>> least amount of requests in its request queue. Paths with lower latency
>>> will clear requests more quickly and have less requests in their queues
>>> compared to higher latency paths. The goal of this path selector is to
>>> make more use of lower latency paths, which will bring down overall IO
>>> latency.
>>>
>>> Signed-off-by: Ewan D. Milne <emilne@xxxxxxxxxx>
>>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>> and compilation warnings, updated MODULE_PARM description, and
>>> fixed potential issue with ->current_path[] being used]
>>> Signed-off-by: Thomas Song <tsong@xxxxxxxxxxxxxxx>
>>> [jmeneghi: vairious changes and improvements, addressed review comments]
>>> Signed-off-by: John Meneghini <jmeneghi@xxxxxxxxxx>
>>> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@xxxxxxxxxx/
>>> Tested-by: Marco Patalano <mpatalan@xxxxxxxxxx>
>>> Reviewed-by: Randy Jennings <randyj@xxxxxxxxxx>
>>> Tested-by: Jyoti Rani <jani@xxxxxxxxxxxxxxx>
>>> ---
>>> drivers/nvme/host/core.c | 2 +-
>>> drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>>> drivers/nvme/host/nvme.h | 9 ++++
>>> 3 files changed, 92 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index a066429b790d..1dd7c52293ff 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>>> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>> static LIST_HEAD(nvme_subsystems);
>>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>> static DEFINE_IDA(nvme_instance_ida);
>>> static dev_t nvme_ctrl_base_chr_devt;
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 5397fb428b24..0e2b6e720e95 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>> static const char *nvme_iopolicy_names[] = {
>>> [NVME_IOPOLICY_NUMA] = "numa",
>>> [NVME_IOPOLICY_RR] = "round-robin",
>>> + [NVME_IOPOLICY_QD] = "queue-depth",
>>> };
>>> static int iopolicy = NVME_IOPOLICY_NUMA;
>>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>> iopolicy = NVME_IOPOLICY_NUMA;
>>> else if (!strncmp(val, "round-robin", 11))
>>> iopolicy = NVME_IOPOLICY_RR;
>>> + else if (!strncmp(val, "queue-depth", 11))
>>> + iopolicy = NVME_IOPOLICY_QD;
>>> else
>>> return -EINVAL;
>>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>>> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>> &iopolicy, 0644);
>>> MODULE_PARM_DESC(iopolicy,
>>> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>>> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>> {
>>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>>> struct nvme_ns *ns = rq->q->queuedata;
>>> struct gendisk *disk = ns->head->disk;
>>> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>>> + atomic_inc(&ns->ctrl->nr_active);
>>> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>>> + }
>>> +
>>> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>>> return;
>>> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>>> {
>>> struct nvme_ns *ns = rq->q->queuedata;
>>> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>>> + atomic_dec_if_positive(&ns->ctrl->nr_active);
>>> +
>>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>> return;
>>> +
>>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>> blk_rq_bytes(rq) >> SECTOR_SHIFT,
>>> nvme_req(rq)->start_time);
>>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>>> return found;
>>> }
>>>
>> I think you may also want to reset nr_active counter if in case, in-flight nvme request
>> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
>> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index bf7615cb36ee..4fea7883ce8e 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>> bool nvme_cancel_request(struct request *req, void *data)
>> {
>> - dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
>> - "Cancelling I/O %d", req->tag);
>> + struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
>> +
>> + dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>> /* don't abort one completed or idle request */
>> if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
>> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>> nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
>> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> + atomic_dec(&ctrl->nr_active);
>
> Don't think this matters because cancellation only happens when we
> teardown the controller anyways...
>
I think in case if we reset the nvme controller then we don't teardown
controller, isn't it? In this case we cancel all pending requests, and
later restart the controller.
Thanks,
--Nilay