* [PATCH v7 0/4] resource: Use list_head to link sibling resource @ 2018-07-18 2:49 Baoquan He 2018-07-18 2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Baoquan He @ 2018-07-18 2:49 UTC (permalink / raw) To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, Baoquan He This patchset is doing: 1) Move reparent_resources() to kernel/resource.c to clean up duplicated code in arch/microblaze/pci/pci-common.c and arch/powerpc/kernel/pci-common.c . 2) Replace struct resource's sibling list from singly linked list to list_head. Clearing out those pointer operation within singly linked list for better code readability. 2) Based on list_head replacement, add a new function walk_system_ram_res_rev() which can does reversed iteration on iomem_resource's siblings. 3) Change kexec_file loading to search system RAM top down for kernel loadin, using walk_system_ram_res_rev(). Note: This patchset only passed testing on x86_64 arch with network enabling. The thing we need pay attetion to is that a root resource's child member need be initialized specifically with LIST_HEAD_INIT() if statically defined or INIT_LIST_HEAD() for dynamically definition. Here Just like we do for iomem_resource/ioport_resource, or the change in get_pci_domain_busn_res(). v6: http://lkml.kernel.org/r/20180704041038.8190-1-bhe@redhat.com v5: http://lkml.kernel.org/r/20180612032831.29747-1-bhe@redhat.com v4: http://lkml.kernel.org/r/20180507063224.24229-1-bhe@redhat.com v3: http://lkml.kernel.org/r/20180419001848.3041-1-bhe@redhat.com v2: http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com v1: http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com Changelog: v6->v7: Fix code bugs that test robot reported on mips and ia64. Add error code description in reparent_resources() according to Andy's comment, and fix minor log typo. v5->v6: Fix code style problems in reparent_resources() and use existing error codes, according to Andy's suggestion. Fix bugs test robot reported. v4->v5: Add new patch 0001 to move duplicated reparent_resources() to kernel/resource.c to make it be shared by different ARCH-es. Fix several code bugs reported by test robot on ARCH powerpc and microblaze. v3->v4: Fix several bugs test robot reported. Rewrite cover letter and patch log according to reviewer's comment. v2->v3: Rename resource functions first_child() and sibling() to resource_first_chils() and resource_sibling(). Dan suggested this. Move resource_first_chils() and resource_sibling() to linux/ioport.h and make them as inline function. Rob suggested this. Accordingly add linux/list.h including in linux/ioport.h, please help review if this bring efficiency degradation or code redundancy. The change on struct resource {} bring two pointers of size increase, mention this in git log to make it more specifically, Rob suggested this. v1->v2: Use list_head instead to link resource siblings. This is suggested by Andrew. Rewrite walk_system_ram_res_rev() after list_head is taken to link resouce siblings. Baoquan He (4): resource: Move reparent_resources() to kernel/resource.c and make it public resource: Use list_head to link sibling resource resource: add walk_system_ram_res_rev() kexec_file: Load kernel at top of system RAM if required arch/arm/plat-samsung/pm-check.c | 6 +- arch/ia64/sn/kernel/io_init.c | 2 +- arch/microblaze/pci/pci-common.c | 41 +---- arch/mips/pci/pci-rc32434.c | 12 +- arch/powerpc/kernel/pci-common.c | 39 +--- arch/sparc/kernel/ioport.c | 2 +- arch/xtensa/include/asm/pci-bridge.h | 4 +- drivers/eisa/eisa-bus.c | 2 + drivers/gpu/drm/drm_memory.c | 3 +- drivers/gpu/drm/gma500/gtt.c | 5 +- drivers/hv/vmbus_drv.c | 52 +++--- drivers/input/joystick/iforce/iforce-main.c | 4 +- drivers/nvdimm/namespace_devs.c | 6 +- drivers/nvdimm/nd.h | 5 +- drivers/of/address.c | 4 +- drivers/parisc/lba_pci.c | 4 +- drivers/pci/controller/vmd.c | 8 +- drivers/pci/probe.c | 2 + drivers/pci/setup-bus.c | 2 +- include/linux/ioport.h | 21 ++- kernel/kexec_file.c | 2 + kernel/resource.c | 266 ++++++++++++++++++---------- 22 files changed, 260 insertions(+), 232 deletions(-) -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public 2018-07-18 2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He @ 2018-07-18 2:49 ` Baoquan He 2018-07-18 2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He ` (2 subsequent siblings) 3 siblings, 0 replies; 18+ messages in thread From: Baoquan He @ 2018-07-18 2:49 UTC (permalink / raw) To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, Baoquan He, Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c so that it's shared. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org --- arch/microblaze/pci/pci-common.c | 37 ----------------------------------- arch/powerpc/kernel/pci-common.c | 35 --------------------------------- include/linux/ioport.h | 1 + kernel/resource.c | 42 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 43 insertions(+), 72 deletions(-) diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c index f34346d56095..7899bafab064 100644 --- a/arch/microblaze/pci/pci-common.c +++ b/arch/microblaze/pci/pci-common.c @@ -619,43 +619,6 @@ int pcibios_add_device(struct pci_dev *dev) EXPORT_SYMBOL(pcibios_add_device); /* - * Reparent resource children of pr that conflict with res - * under res, and make res replace those children. - */ -static int __init reparent_resources(struct resource *parent, - struct resource *res) -{ - struct resource *p, **pp; - struct resource **firstpp = NULL; - - for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) { - if (p->end < res->start) - continue; - if (res->end < p->start) - break; - if (p->start < res->start || p->end > res->end) - return -1; /* not completely contained */ - if (firstpp == NULL) - firstpp = pp; - } - if (firstpp == NULL) - return -1; /* didn't find any conflicting entries? */ - res->parent = parent; - res->child = *firstpp; - res->sibling = *pp; - *firstpp = res; - *pp = NULL; - for (p = res->child; p != NULL; p = p->sibling) { - p->parent = res; - pr_debug("PCI: Reparented %s [%llx..%llx] under %s\n", - p->name, - (unsigned long long)p->start, - (unsigned long long)p->end, res->name); - } - return 0; -} - -/* * Handle resources of PCI devices. If the world were perfect, we could * just allocate all the resource regions and do nothing more. It isn't. * On the other hand, we cannot just re-allocate all devices, as it would diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index fe9733ffffaa..926035bb378d 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1088,41 +1088,6 @@ resource_size_t pcibios_align_resource(void *data, const struct resource *res, EXPORT_SYMBOL(pcibios_align_resource); /* - * Reparent resource children of pr that conflict with res - * under res, and make res replace those children. - */ -static int reparent_resources(struct resource *parent, - struct resource *res) -{ - struct resource *p, **pp; - struct resource **firstpp = NULL; - - for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) { - if (p->end < res->start) - continue; - if (res->end < p->start) - break; - if (p->start < res->start || p->end > res->end) - return -1; /* not completely contained */ - if (firstpp == NULL) - firstpp = pp; - } - if (firstpp == NULL) - return -1; /* didn't find any conflicting entries? */ - res->parent = parent; - res->child = *firstpp; - res->sibling = *pp; - *firstpp = res; - *pp = NULL; - for (p = res->child; p != NULL; p = p->sibling) { - p->parent = res; - pr_debug("PCI: Reparented %s %pR under %s\n", - p->name, p, res->name); - } - return 0; -} - -/* * Handle resources of PCI devices. If the world were perfect, we could * just allocate all the resource regions and do nothing more. It isn't. * On the other hand, we cannot just re-allocate all devices, as it would diff --git a/include/linux/ioport.h b/include/linux/ioport.h index da0ebaec25f0..dfdcd0bfe54e 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -192,6 +192,7 @@ extern int allocate_resource(struct resource *root, struct resource *new, struct resource *lookup_resource(struct resource *root, resource_size_t start); int adjust_resource(struct resource *res, resource_size_t start, resource_size_t size); +int reparent_resources(struct resource *parent, struct resource *res); resource_size_t resource_alignment(struct resource *res); static inline resource_size_t resource_size(const struct resource *res) { diff --git a/kernel/resource.c b/kernel/resource.c index 30e1bc68503b..81ccd19c1d9f 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -983,6 +983,48 @@ int adjust_resource(struct resource *res, resource_size_t start, } EXPORT_SYMBOL(adjust_resource); +/** + * reparent_resources - reparent resource children of parent that res covers + * @parent: parent resource descriptor + * @res: resource descriptor desired by caller + * + * Returns 0 on success, -ENOTSUPP if child resource is not completely + * contained by 'res', -ECANCELED if no any conflicting entry found. + * + * Reparent resource children of 'parent' that conflict with 'res' + * under 'res', and make 'res' replace those children. + */ +int reparent_resources(struct resource *parent, struct resource *res) +{ + struct resource *p, **pp; + struct resource **firstpp = NULL; + + for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) { + if (p->end < res->start) + continue; + if (res->end < p->start) + break; + if (p->start < res->start || p->end > res->end) + return -ENOTSUPP; /* not completely contained */ + if (firstpp == NULL) + firstpp = pp; + } + if (firstpp == NULL) + return -ECANCELED; /* didn't find any conflicting entries? */ + res->parent = parent; + res->child = *firstpp; + res->sibling = *pp; + *firstpp = res; + *pp = NULL; + for (p = res->child; p != NULL; p = p->sibling) { + p->parent = res; + pr_debug("PCI: Reparented %s %pR under %s\n", + p->name, p, res->name); + } + return 0; +} +EXPORT_SYMBOL(reparent_resources); + static void __init __reserve_region_with_split(struct resource *root, resource_size_t start, resource_size_t end, const char *name) -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v7 2/4] resource: Use list_head to link sibling resource 2018-07-18 2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He 2018-07-18 2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He @ 2018-07-18 2:49 ` Baoquan He 2018-07-18 2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He 2018-07-18 2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He 3 siblings, 0 replies; 18+ messages in thread From: Baoquan He @ 2018-07-18 2:49 UTC (permalink / raw) To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, Baoquan He, Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, linux-mips The struct resource uses singly linked list to link siblings, implemented by pointer operation. Replace it with list_head for better code readability. Based on this list_head replacement, it will be very easy to do reverse iteration on iomem_resource's sibling list in later patch. Besides, type of member variables of struct resource, sibling and child, are changed from 'struct resource *' to 'struct list_head'. This brings two pointers of size increase. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Cc: David Airlie <airlied@linux.ie> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Jonathan Derrick <jonathan.derrick@intel.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: devel@linuxdriverproject.org Cc: linux-input@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: devicetree@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-mips@linux-mips.org --- arch/arm/plat-samsung/pm-check.c | 6 +- arch/ia64/sn/kernel/io_init.c | 2 +- arch/microblaze/pci/pci-common.c | 4 +- arch/mips/pci/pci-rc32434.c | 12 +- arch/powerpc/kernel/pci-common.c | 4 +- arch/sparc/kernel/ioport.c | 2 +- arch/xtensa/include/asm/pci-bridge.h | 4 +- drivers/eisa/eisa-bus.c | 2 + drivers/gpu/drm/drm_memory.c | 3 +- drivers/gpu/drm/gma500/gtt.c | 5 +- drivers/hv/vmbus_drv.c | 52 +++---- drivers/input/joystick/iforce/iforce-main.c | 4 +- drivers/nvdimm/namespace_devs.c | 6 +- drivers/nvdimm/nd.h | 5 +- drivers/of/address.c | 4 +- drivers/parisc/lba_pci.c | 4 +- drivers/pci/controller/vmd.c | 8 +- drivers/pci/probe.c | 2 + drivers/pci/setup-bus.c | 2 +- include/linux/ioport.h | 17 ++- kernel/resource.c | 206 ++++++++++++++-------------- 21 files changed, 183 insertions(+), 171 deletions(-) diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c index cd2c02c68bc3..5494355b1c49 100644 --- a/arch/arm/plat-samsung/pm-check.c +++ b/arch/arm/plat-samsung/pm-check.c @@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg); static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg) { while (ptr != NULL) { - if (ptr->child != NULL) - s3c_pm_run_res(ptr->child, fn, arg); + if (!list_empty(&ptr->child)) + s3c_pm_run_res(resource_first_child(&ptr->child), fn, arg); if ((ptr->flags & IORESOURCE_SYSTEM_RAM) == IORESOURCE_SYSTEM_RAM) { @@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg) arg = (fn)(ptr, arg); } - ptr = ptr->sibling; + ptr = resource_sibling(ptr); } } diff --git a/arch/ia64/sn/kernel/io_init.c b/arch/ia64/sn/kernel/io_init.c index d63809a6adfa..338a7b7f194d 100644 --- a/arch/ia64/sn/kernel/io_init.c +++ b/arch/ia64/sn/kernel/io_init.c @@ -192,7 +192,7 @@ sn_io_slot_fixup(struct pci_dev *dev) * if it's already in the device structure, remove it before * inserting */ - if (res->parent && res->parent->child) + if (res->parent && !list_empty(&res->parent->child)) release_resource(res); if (res->flags & IORESOURCE_IO) diff --git a/arch/microblaze/pci/pci-common.c b/arch/microblaze/pci/pci-common.c index 7899bafab064..2bf73e27e231 100644 --- a/arch/microblaze/pci/pci-common.c +++ b/arch/microblaze/pci/pci-common.c @@ -533,7 +533,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose, res->flags = range.flags; res->start = range.cpu_addr; res->end = range.cpu_addr + range.size - 1; - res->parent = res->child = res->sibling = NULL; + res->parent = NULL; + INIT_LIST_HEAD(&res->child); + INIT_LIST_HEAD(&res->sibling); } } diff --git a/arch/mips/pci/pci-rc32434.c b/arch/mips/pci/pci-rc32434.c index 7f6ce6d734c0..e80283df7925 100644 --- a/arch/mips/pci/pci-rc32434.c +++ b/arch/mips/pci/pci-rc32434.c @@ -53,8 +53,8 @@ static struct resource rc32434_res_pci_mem1 = { .start = 0x50000000, .end = 0x5FFFFFFF, .flags = IORESOURCE_MEM, - .sibling = NULL, - .child = &rc32434_res_pci_mem2 + .sibling = LIST_HEAD_INIT(rc32434_res_pci_mem1.sibling), + .child = LIST_HEAD_INIT(rc32434_res_pci_mem1.child), }; static struct resource rc32434_res_pci_mem2 = { @@ -63,8 +63,8 @@ static struct resource rc32434_res_pci_mem2 = { .end = 0x6FFFFFFF, .flags = IORESOURCE_MEM, .parent = &rc32434_res_pci_mem1, - .sibling = NULL, - .child = NULL + .sibling = LIST_HEAD_INIT(rc32434_res_pci_mem2.sibling), + .child = LIST_HEAD_INIT(rc32434_res_pci_mem2.child), }; static struct resource rc32434_res_pci_io1 = { @@ -72,6 +72,8 @@ static struct resource rc32434_res_pci_io1 = { .start = 0x18800000, .end = 0x188FFFFF, .flags = IORESOURCE_IO, + .sibling = LIST_HEAD_INIT(rc32434_res_pci_io1.sibling), + .child = LIST_HEAD_INIT(rc32434_res_pci_io1.child), }; extern struct pci_ops rc32434_pci_ops; @@ -208,6 +210,8 @@ static int __init rc32434_pci_init(void) pr_info("PCI: Initializing PCI\n"); + list_add(&rc32434_res_pci_mem2.sibling, &rc32434_res_pci_mem1.child); + ioport_resource.start = rc32434_res_pci_io1.start; ioport_resource.end = rc32434_res_pci_io1.end; diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 926035bb378d..28fbe83c9daf 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller *hose, res->flags = range.flags; res->start = range.cpu_addr; res->end = range.cpu_addr + range.size - 1; - res->parent = res->child = res->sibling = NULL; + res->parent = NULL; + INIT_LIST_HEAD(&res->child); + INIT_LIST_HEAD(&res->sibling); } } } diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c index cca9134cfa7d..99efe4e98b16 100644 --- a/arch/sparc/kernel/ioport.c +++ b/arch/sparc/kernel/ioport.c @@ -669,7 +669,7 @@ static int sparc_io_proc_show(struct seq_file *m, void *v) struct resource *root = m->private, *r; const char *nm; - for (r = root->child; r != NULL; r = r->sibling) { + list_for_each_entry(r, &root->child, sibling) { if ((nm = r->name) == NULL) nm = "???"; seq_printf(m, "%016llx-%016llx: %s\n", (unsigned long long)r->start, diff --git a/arch/xtensa/include/asm/pci-bridge.h b/arch/xtensa/include/asm/pci-bridge.h index 0b68c76ec1e6..f487b06817df 100644 --- a/arch/xtensa/include/asm/pci-bridge.h +++ b/arch/xtensa/include/asm/pci-bridge.h @@ -71,8 +71,8 @@ static inline void pcibios_init_resource(struct resource *res, res->flags = flags; res->name = name; res->parent = NULL; - res->sibling = NULL; - res->child = NULL; + INIT_LIST_HEAD(&res->child); + INIT_LIST_HEAD(&res->sibling); } diff --git a/drivers/eisa/eisa-bus.c b/drivers/eisa/eisa-bus.c index 1e8062f6dbfc..dba78f75fd06 100644 --- a/drivers/eisa/eisa-bus.c +++ b/drivers/eisa/eisa-bus.c @@ -408,6 +408,8 @@ static struct resource eisa_root_res = { .start = 0, .end = 0xffffffff, .flags = IORESOURCE_IO, + .sibling = LIST_HEAD_INIT(eisa_root_res.sibling), + .child = LIST_HEAD_INIT(eisa_root_res.child), }; static int eisa_bus_count; diff --git a/drivers/gpu/drm/drm_memory.c b/drivers/gpu/drm/drm_memory.c index d69e4fc1ee77..33baa7fa5e41 100644 --- a/drivers/gpu/drm/drm_memory.c +++ b/drivers/gpu/drm/drm_memory.c @@ -155,9 +155,8 @@ u64 drm_get_max_iomem(void) struct resource *tmp; resource_size_t max_iomem = 0; - for (tmp = iomem_resource.child; tmp; tmp = tmp->sibling) { + list_for_each_entry(tmp, &iomem_resource.child, sibling) max_iomem = max(max_iomem, tmp->end); - } return max_iomem; } diff --git a/drivers/gpu/drm/gma500/gtt.c b/drivers/gpu/drm/gma500/gtt.c index 3949b0990916..addd3bc009af 100644 --- a/drivers/gpu/drm/gma500/gtt.c +++ b/drivers/gpu/drm/gma500/gtt.c @@ -565,7 +565,7 @@ int psb_gtt_init(struct drm_device *dev, int resume) int psb_gtt_restore(struct drm_device *dev) { struct drm_psb_private *dev_priv = dev->dev_private; - struct resource *r = dev_priv->gtt_mem->child; + struct resource *r; struct gtt_range *range; unsigned int restored = 0, total = 0, size = 0; @@ -573,14 +573,13 @@ int psb_gtt_restore(struct drm_device *dev) mutex_lock(&dev_priv->gtt_mutex); psb_gtt_init(dev, 1); - while (r != NULL) { + list_for_each_entry(r, &dev_priv->gtt_mem->child, sibling) { range = container_of(r, struct gtt_range, resource); if (range->pages) { psb_gtt_insert(dev, range, 1); size += range->resource.end - range->resource.start; restored++; } - r = r->sibling; total++; } mutex_unlock(&dev_priv->gtt_mutex); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index b10fe26c4891..d87ec5a1bc4c 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -1412,9 +1412,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx) { resource_size_t start = 0; resource_size_t end = 0; - struct resource *new_res; + struct resource *new_res, *tmp; struct resource **old_res = &hyperv_mmio; - struct resource **prev_res = NULL; switch (res->type) { @@ -1461,44 +1460,36 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx) /* * If two ranges are adjacent, merge them. */ - do { - if (!*old_res) { - *old_res = new_res; - break; - } - - if (((*old_res)->end + 1) == new_res->start) { - (*old_res)->end = new_res->end; + if (!*old_res) { + *old_res = new_res; + return AE_OK; + } + tmp = *old_res; + list_for_each_entry_from(tmp, &tmp->parent->child, sibling) { + if ((tmp->end + 1) == new_res->start) { + tmp->end = new_res->end; kfree(new_res); break; } - if ((*old_res)->start == new_res->end + 1) { - (*old_res)->start = new_res->start; + if (tmp->start == new_res->end + 1) { + tmp->start = new_res->start; kfree(new_res); break; } - if ((*old_res)->start > new_res->end) { - new_res->sibling = *old_res; - if (prev_res) - (*prev_res)->sibling = new_res; - *old_res = new_res; + if (tmp->start > new_res->end) { + list_add(&new_res->sibling, tmp->sibling.prev); break; } - - prev_res = old_res; - old_res = &(*old_res)->sibling; - - } while (1); + } return AE_OK; } static int vmbus_acpi_remove(struct acpi_device *device) { - struct resource *cur_res; - struct resource *next_res; + struct resource *res; if (hyperv_mmio) { if (fb_mmio) { @@ -1507,10 +1498,9 @@ static int vmbus_acpi_remove(struct acpi_device *device) fb_mmio = NULL; } - for (cur_res = hyperv_mmio; cur_res; cur_res = next_res) { - next_res = cur_res->sibling; - kfree(cur_res); - } + res = hyperv_mmio; + list_for_each_entry_from(res, &res->parent->child, sibling) + kfree(res); } return 0; @@ -1596,7 +1586,8 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj, } } - for (iter = hyperv_mmio; iter; iter = iter->sibling) { + iter = hyperv_mmio; + list_for_each_entry_from(iter, &iter->parent->child, sibling) { if ((iter->start >= max) || (iter->end <= min)) continue; @@ -1639,7 +1630,8 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size) struct resource *iter; down(&hyperv_mmio_lock); - for (iter = hyperv_mmio; iter; iter = iter->sibling) { + iter = hyperv_mmio; + list_for_each_entry_from(iter, &iter->parent->child, sibling) { if ((iter->start >= start + size) || (iter->end <= start)) continue; diff --git a/drivers/input/joystick/iforce/iforce-main.c b/drivers/input/joystick/iforce/iforce-main.c index daeeb4c7e3b0..5c0be27b33ff 100644 --- a/drivers/input/joystick/iforce/iforce-main.c +++ b/drivers/input/joystick/iforce/iforce-main.c @@ -305,8 +305,8 @@ int iforce_init_device(struct iforce *iforce) iforce->device_memory.end = 200; iforce->device_memory.flags = IORESOURCE_MEM; iforce->device_memory.parent = NULL; - iforce->device_memory.child = NULL; - iforce->device_memory.sibling = NULL; + INIT_LIST_HEAD(&iforce->device_memory.child); + INIT_LIST_HEAD(&iforce->device_memory.sibling); /* * Wait until device ready - until it sends its first response. diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c index 28afdd668905..f53d410d9981 100644 --- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -637,7 +637,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region, retry: first = 0; for_each_dpa_resource(ndd, res) { - struct resource *next = res->sibling, *new_res = NULL; + struct resource *next = resource_sibling(res), *new_res = NULL; resource_size_t allocate, available = 0; enum alloc_loc loc = ALLOC_ERR; const char *action; @@ -763,7 +763,7 @@ static resource_size_t scan_allocate(struct nd_region *nd_region, * an initial "pmem-reserve pass". Only do an initial BLK allocation * when none of the DPA space is reserved. */ - if ((is_pmem || !ndd->dpa.child) && n == to_allocate) + if ((is_pmem || list_empty(&ndd->dpa.child)) && n == to_allocate) return init_dpa_allocation(label_id, nd_region, nd_mapping, n); return n; } @@ -779,7 +779,7 @@ static int merge_dpa(struct nd_region *nd_region, retry: for_each_dpa_resource(ndd, res) { int rc; - struct resource *next = res->sibling; + struct resource *next = resource_sibling(res); resource_size_t end = res->start + resource_size(res); if (!next || strcmp(res->name, label_id->id) != 0 diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index 32e0364b48b9..da7da15e03e7 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -102,11 +102,10 @@ unsigned sizeof_namespace_label(struct nvdimm_drvdata *ndd); (unsigned long long) (res ? res->start : 0), ##arg) #define for_each_dpa_resource(ndd, res) \ - for (res = (ndd)->dpa.child; res; res = res->sibling) + list_for_each_entry(res, &(ndd)->dpa.child, sibling) #define for_each_dpa_resource_safe(ndd, res, next) \ - for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \ - res; res = next, next = next ? next->sibling : NULL) + list_for_each_entry_safe(res, next, &(ndd)->dpa.child, sibling) struct nd_percpu_lane { int count; diff --git a/drivers/of/address.c b/drivers/of/address.c index 53349912ac75..e2e25719ab52 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -330,7 +330,9 @@ int of_pci_range_to_resource(struct of_pci_range *range, { int err; res->flags = range->flags; - res->parent = res->child = res->sibling = NULL; + res->parent = NULL; + INIT_LIST_HEAD(&res->child); + INIT_LIST_HEAD(&res->sibling); res->name = np->full_name; if (res->flags & IORESOURCE_IO) { diff --git a/drivers/parisc/lba_pci.c b/drivers/parisc/lba_pci.c index 69bd98421eb1..7482bdfd1959 100644 --- a/drivers/parisc/lba_pci.c +++ b/drivers/parisc/lba_pci.c @@ -170,8 +170,8 @@ lba_dump_res(struct resource *r, int d) for (i = d; i ; --i) printk(" "); printk(KERN_DEBUG "%p [%lx,%lx]/%lx\n", r, (long)r->start, (long)r->end, r->flags); - lba_dump_res(r->child, d+2); - lba_dump_res(r->sibling, d); + lba_dump_res(resource_first_child(&r->child), d+2); + lba_dump_res(resource_sibling(r), d); } diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index 942b64fc7f1f..e3ace20345c7 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -542,14 +542,14 @@ static struct pci_ops vmd_ops = { static void vmd_attach_resources(struct vmd_dev *vmd) { - vmd->dev->resource[VMD_MEMBAR1].child = &vmd->resources[1]; - vmd->dev->resource[VMD_MEMBAR2].child = &vmd->resources[2]; + list_add(&vmd->resources[1].sibling, &vmd->dev->resource[VMD_MEMBAR1].child); + list_add(&vmd->resources[2].sibling, &vmd->dev->resource[VMD_MEMBAR2].child); } static void vmd_detach_resources(struct vmd_dev *vmd) { - vmd->dev->resource[VMD_MEMBAR1].child = NULL; - vmd->dev->resource[VMD_MEMBAR2].child = NULL; + INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR1].child); + INIT_LIST_HEAD(&vmd->dev->resource[VMD_MEMBAR2].child); } /* diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index ac876e32de4b..9624dd1dfd49 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -59,6 +59,8 @@ static struct resource *get_pci_domain_busn_res(int domain_nr) r->res.start = 0; r->res.end = 0xff; r->res.flags = IORESOURCE_BUS | IORESOURCE_PCI_FIXED; + INIT_LIST_HEAD(&r->res.child); + INIT_LIST_HEAD(&r->res.sibling); list_add_tail(&r->list, &pci_domain_busn_res_list); diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 79b1824e83b4..8e685af8938d 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -2107,7 +2107,7 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type) continue; /* Ignore BARs which are still in use */ - if (res->child) + if (!list_empty(&res->child)) continue; ret = add_to_list(&saved, bridge, res, 0, 0); diff --git a/include/linux/ioport.h b/include/linux/ioport.h index dfdcd0bfe54e..b7456ae889dd 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -12,6 +12,7 @@ #ifndef __ASSEMBLY__ #include <linux/compiler.h> #include <linux/types.h> +#include <linux/list.h> /* * Resources are tree-like, allowing * nesting etc.. @@ -22,7 +23,8 @@ struct resource { const char *name; unsigned long flags; unsigned long desc; - struct resource *parent, *sibling, *child; + struct list_head child, sibling; + struct resource *parent; }; /* @@ -216,7 +218,6 @@ static inline bool resource_contains(struct resource *r1, struct resource *r2) return r1->start <= r2->start && r1->end >= r2->end; } - /* Convenience shorthand with allocation */ #define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0) #define request_muxed_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), IORESOURCE_MUXED) @@ -287,6 +288,18 @@ static inline bool resource_overlaps(struct resource *r1, struct resource *r2) return (r1->start <= r2->end && r1->end >= r2->start); } +static inline struct resource *resource_sibling(struct resource *res) +{ + if (res->parent && !list_is_last(&res->sibling, &res->parent->child)) + return list_next_entry(res, sibling); + return NULL; +} + +static inline struct resource *resource_first_child(struct list_head *head) +{ + return list_first_entry_or_null(head, struct resource, sibling); +} + #endif /* __ASSEMBLY__ */ #endif /* _LINUX_IOPORT_H */ diff --git a/kernel/resource.c b/kernel/resource.c index 81ccd19c1d9f..c96e58d3d2f8 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -31,6 +31,8 @@ struct resource ioport_resource = { .start = 0, .end = IO_SPACE_LIMIT, .flags = IORESOURCE_IO, + .sibling = LIST_HEAD_INIT(ioport_resource.sibling), + .child = LIST_HEAD_INIT(ioport_resource.child), }; EXPORT_SYMBOL(ioport_resource); @@ -39,6 +41,8 @@ struct resource iomem_resource = { .start = 0, .end = -1, .flags = IORESOURCE_MEM, + .sibling = LIST_HEAD_INIT(iomem_resource.sibling), + .child = LIST_HEAD_INIT(iomem_resource.child), }; EXPORT_SYMBOL(iomem_resource); @@ -57,20 +61,20 @@ static DEFINE_RWLOCK(resource_lock); * by boot mem after the system is up. So for reusing the resource entry * we need to remember the resource. */ -static struct resource *bootmem_resource_free; +static struct list_head bootmem_resource_free = LIST_HEAD_INIT(bootmem_resource_free); static DEFINE_SPINLOCK(bootmem_resource_lock); static struct resource *next_resource(struct resource *p, bool sibling_only) { /* Caller wants to traverse through siblings only */ if (sibling_only) - return p->sibling; + return resource_sibling(p); - if (p->child) - return p->child; - while (!p->sibling && p->parent) + if (!list_empty(&p->child)) + return resource_first_child(&p->child); + while (!resource_sibling(p) && p->parent) p = p->parent; - return p->sibling; + return resource_sibling(p); } static void *r_next(struct seq_file *m, void *v, loff_t *pos) @@ -90,7 +94,7 @@ static void *r_start(struct seq_file *m, loff_t *pos) struct resource *p = PDE_DATA(file_inode(m->file)); loff_t l = 0; read_lock(&resource_lock); - for (p = p->child; p && l < *pos; p = r_next(m, p, &l)) + for (p = resource_first_child(&p->child); p && l < *pos; p = r_next(m, p, &l)) ; return p; } @@ -153,8 +157,7 @@ static void free_resource(struct resource *res) if (!PageSlab(virt_to_head_page(res))) { spin_lock(&bootmem_resource_lock); - res->sibling = bootmem_resource_free; - bootmem_resource_free = res; + list_add(&res->sibling, &bootmem_resource_free); spin_unlock(&bootmem_resource_lock); } else { kfree(res); @@ -166,10 +169,9 @@ static struct resource *alloc_resource(gfp_t flags) struct resource *res = NULL; spin_lock(&bootmem_resource_lock); - if (bootmem_resource_free) { - res = bootmem_resource_free; - bootmem_resource_free = res->sibling; - } + res = resource_first_child(&bootmem_resource_free); + if (res) + list_del(&res->sibling); spin_unlock(&bootmem_resource_lock); if (res) @@ -177,6 +179,8 @@ static struct resource *alloc_resource(gfp_t flags) else res = kzalloc(sizeof(struct resource), flags); + INIT_LIST_HEAD(&res->child); + INIT_LIST_HEAD(&res->sibling); return res; } @@ -185,7 +189,7 @@ static struct resource * __request_resource(struct resource *root, struct resour { resource_size_t start = new->start; resource_size_t end = new->end; - struct resource *tmp, **p; + struct resource *tmp; if (end < start) return root; @@ -193,64 +197,62 @@ static struct resource * __request_resource(struct resource *root, struct resour return root; if (end > root->end) return root; - p = &root->child; - for (;;) { - tmp = *p; - if (!tmp || tmp->start > end) { - new->sibling = tmp; - *p = new; + + if (list_empty(&root->child)) { + list_add(&new->sibling, &root->child); + new->parent = root; + INIT_LIST_HEAD(&new->child); + return NULL; + } + + list_for_each_entry(tmp, &root->child, sibling) { + if (tmp->start > end) { + list_add(&new->sibling, tmp->sibling.prev); new->parent = root; + INIT_LIST_HEAD(&new->child); return NULL; } - p = &tmp->sibling; if (tmp->end < start) continue; return tmp; } + + list_add_tail(&new->sibling, &root->child); + new->parent = root; + INIT_LIST_HEAD(&new->child); + return NULL; } static int __release_resource(struct resource *old, bool release_child) { - struct resource *tmp, **p, *chd; + struct resource *tmp, *next, *chd; - p = &old->parent->child; - for (;;) { - tmp = *p; - if (!tmp) - break; + list_for_each_entry_safe(tmp, next, &old->parent->child, sibling) { if (tmp == old) { - if (release_child || !(tmp->child)) { - *p = tmp->sibling; + if (release_child || list_empty(&tmp->child)) { + list_del(&tmp->sibling); } else { - for (chd = tmp->child;; chd = chd->sibling) { + list_for_each_entry(chd, &tmp->child, sibling) chd->parent = tmp->parent; - if (!(chd->sibling)) - break; - } - *p = tmp->child; - chd->sibling = tmp->sibling; + list_splice(&tmp->child, tmp->sibling.prev); + list_del(&tmp->sibling); } + old->parent = NULL; return 0; } - p = &tmp->sibling; } return -EINVAL; } static void __release_child_resources(struct resource *r) { - struct resource *tmp, *p; + struct resource *tmp, *next; resource_size_t size; - p = r->child; - r->child = NULL; - while (p) { - tmp = p; - p = p->sibling; - + list_for_each_entry_safe(tmp, next, &r->child, sibling) { tmp->parent = NULL; - tmp->sibling = NULL; + list_del_init(&tmp->sibling); __release_child_resources(tmp); printk(KERN_DEBUG "release child resource %pR\n", tmp); @@ -259,6 +261,8 @@ static void __release_child_resources(struct resource *r) tmp->start = 0; tmp->end = size - 1; } + + INIT_LIST_HEAD(&tmp->child); } void release_child_resources(struct resource *r) @@ -343,7 +347,8 @@ static int find_next_iomem_res(struct resource *res, unsigned long desc, read_lock(&resource_lock); - for (p = iomem_resource.child; p; p = next_resource(p, sibling_only)) { + for (p = resource_first_child(&iomem_resource.child); p; + p = next_resource(p, sibling_only)) { if ((p->flags & res->flags) != res->flags) continue; if ((desc != IORES_DESC_NONE) && (desc != p->desc)) @@ -532,7 +537,7 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags, struct resource *p; read_lock(&resource_lock); - for (p = iomem_resource.child; p ; p = p->sibling) { + list_for_each_entry(p, &iomem_resource.child, sibling) { bool is_type = (((p->flags & flags) == flags) && ((desc == IORES_DESC_NONE) || (desc == p->desc))); @@ -586,7 +591,7 @@ static int __find_resource(struct resource *root, struct resource *old, resource_size_t size, struct resource_constraint *constraint) { - struct resource *this = root->child; + struct resource *this = resource_first_child(&root->child); struct resource tmp = *new, avail, alloc; tmp.start = root->start; @@ -596,7 +601,7 @@ static int __find_resource(struct resource *root, struct resource *old, */ if (this && this->start == root->start) { tmp.start = (this == old) ? old->start : this->end + 1; - this = this->sibling; + this = resource_sibling(this); } for(;;) { if (this) @@ -632,7 +637,7 @@ next: if (!this || this->end == root->end) if (this != old) tmp.start = this->end + 1; - this = this->sibling; + this = resource_sibling(this); } return -EBUSY; } @@ -676,7 +681,7 @@ static int reallocate_resource(struct resource *root, struct resource *old, goto out; } - if (old->child) { + if (!list_empty(&old->child)) { err = -EBUSY; goto out; } @@ -757,7 +762,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start) struct resource *res; read_lock(&resource_lock); - for (res = root->child; res; res = res->sibling) { + list_for_each_entry(res, &root->child, sibling) { if (res->start == start) break; } @@ -790,32 +795,27 @@ static struct resource * __insert_resource(struct resource *parent, struct resou break; } - for (next = first; ; next = next->sibling) { + for (next = first; ; next = resource_sibling(next)) { /* Partial overlap? Bad, and unfixable */ if (next->start < new->start || next->end > new->end) return next; - if (!next->sibling) + if (!resource_sibling(next)) break; - if (next->sibling->start > new->end) + if (resource_sibling(next)->start > new->end) break; } - new->parent = parent; - new->sibling = next->sibling; - new->child = first; + list_add(&new->sibling, &next->sibling); + INIT_LIST_HEAD(&new->child); - next->sibling = NULL; - for (next = first; next; next = next->sibling) + /* + * From first to next, they all fall into new's region, so change them + * as new's children. + */ + list_cut_position(&new->child, first->sibling.prev, &next->sibling); + list_for_each_entry(next, &new->child, sibling) next->parent = new; - if (parent->child == first) { - parent->child = new; - } else { - next = parent->child; - while (next->sibling != first) - next = next->sibling; - next->sibling = new; - } return NULL; } @@ -937,19 +937,17 @@ static int __adjust_resource(struct resource *res, resource_size_t start, if ((start < parent->start) || (end > parent->end)) goto out; - if (res->sibling && (res->sibling->start <= end)) + if (resource_sibling(res) && (resource_sibling(res)->start <= end)) goto out; - tmp = parent->child; - if (tmp != res) { - while (tmp->sibling != res) - tmp = tmp->sibling; + if (res->sibling.prev != &parent->child) { + tmp = list_prev_entry(res, sibling); if (start <= tmp->end) goto out; } skip: - for (tmp = res->child; tmp; tmp = tmp->sibling) + list_for_each_entry(tmp, &res->child, sibling) if ((tmp->start < start) || (tmp->end > end)) goto out; @@ -996,27 +994,30 @@ EXPORT_SYMBOL(adjust_resource); */ int reparent_resources(struct resource *parent, struct resource *res) { - struct resource *p, **pp; - struct resource **firstpp = NULL; + struct resource *p, *first = NULL; - for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) { + list_for_each_entry(p, &parent->child, sibling) { if (p->end < res->start) continue; if (res->end < p->start) break; if (p->start < res->start || p->end > res->end) return -ENOTSUPP; /* not completely contained */ - if (firstpp == NULL) - firstpp = pp; + if (first == NULL) + first = p; } - if (firstpp == NULL) + if (first == NULL) return -ECANCELED; /* didn't find any conflicting entries? */ res->parent = parent; - res->child = *firstpp; - res->sibling = *pp; - *firstpp = res; - *pp = NULL; - for (p = res->child; p != NULL; p = p->sibling) { + list_add(&res->sibling, p->sibling.prev); + INIT_LIST_HEAD(&res->child); + + /* + * From first to p's previous sibling, they all fall into + * res's region, change them as res's children. + */ + list_cut_position(&res->child, first->sibling.prev, res->sibling.prev); + list_for_each_entry(p, &res->child, sibling) { p->parent = res; pr_debug("PCI: Reparented %s %pR under %s\n", p->name, p, res->name); @@ -1216,34 +1217,32 @@ EXPORT_SYMBOL(__request_region); void __release_region(struct resource *parent, resource_size_t start, resource_size_t n) { - struct resource **p; + struct resource *res; resource_size_t end; - p = &parent->child; + res = resource_first_child(&parent->child); end = start + n - 1; write_lock(&resource_lock); for (;;) { - struct resource *res = *p; - if (!res) break; if (res->start <= start && res->end >= end) { if (!(res->flags & IORESOURCE_BUSY)) { - p = &res->child; + res = resource_first_child(&res->child); continue; } if (res->start != start || res->end != end) break; - *p = res->sibling; + list_del(&res->sibling); write_unlock(&resource_lock); if (res->flags & IORESOURCE_MUXED) wake_up(&muxed_resource_wait); free_resource(res); return; } - p = &res->sibling; + res = resource_sibling(res); } write_unlock(&resource_lock); @@ -1278,9 +1277,7 @@ EXPORT_SYMBOL(__release_region); int release_mem_region_adjustable(struct resource *parent, resource_size_t start, resource_size_t size) { - struct resource **p; - struct resource *res; - struct resource *new_res; + struct resource *res, *new_res; resource_size_t end; int ret = -EINVAL; @@ -1291,16 +1288,16 @@ int release_mem_region_adjustable(struct resource *parent, /* The alloc_resource() result gets checked later */ new_res = alloc_resource(GFP_KERNEL); - p = &parent->child; + res = resource_first_child(&parent->child); write_lock(&resource_lock); - while ((res = *p)) { + while ((res)) { if (res->start >= end) break; /* look for the next resource if it does not fit into */ if (res->start > start || res->end < end) { - p = &res->sibling; + res = resource_sibling(res); continue; } @@ -1308,14 +1305,14 @@ int release_mem_region_adjustable(struct resource *parent, break; if (!(res->flags & IORESOURCE_BUSY)) { - p = &res->child; + res = resource_first_child(&res->child); continue; } /* found the target resource; let's adjust accordingly */ if (res->start == start && res->end == end) { /* free the whole entry */ - *p = res->sibling; + list_del(&res->sibling); free_resource(res); ret = 0; } else if (res->start == start && res->end != end) { @@ -1338,14 +1335,13 @@ int release_mem_region_adjustable(struct resource *parent, new_res->flags = res->flags; new_res->desc = res->desc; new_res->parent = res->parent; - new_res->sibling = res->sibling; - new_res->child = NULL; + INIT_LIST_HEAD(&new_res->child); ret = __adjust_resource(res, res->start, start - res->start); if (ret) break; - res->sibling = new_res; + list_add(&new_res->sibling, &res->sibling); new_res = NULL; } @@ -1526,7 +1522,7 @@ static int __init reserve_setup(char *str) res->end = io_start + io_num - 1; res->flags |= IORESOURCE_BUSY; res->desc = IORES_DESC_NONE; - res->child = NULL; + INIT_LIST_HEAD(&res->child); if (request_resource(parent, res) == 0) reserved = x+1; } @@ -1546,7 +1542,7 @@ int iomem_map_sanity_check(resource_size_t addr, unsigned long size) loff_t l; read_lock(&resource_lock); - for (p = p->child; p ; p = r_next(NULL, p, &l)) { + for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) { /* * We can probably skip the resources without * IORESOURCE_IO attribute? @@ -1602,7 +1598,7 @@ bool iomem_is_exclusive(u64 addr) addr = addr & PAGE_MASK; read_lock(&resource_lock); - for (p = p->child; p ; p = r_next(NULL, p, &l)) { + for (p = resource_first_child(&p->child); p; p = r_next(NULL, p, &l)) { /* * We can probably skip the resources without * IORESOURCE_IO attribute? -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v7 3/4] resource: add walk_system_ram_res_rev() 2018-07-18 2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He 2018-07-18 2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He 2018-07-18 2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He @ 2018-07-18 2:49 ` Baoquan He 2018-07-18 2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He 3 siblings, 0 replies; 18+ messages in thread From: Baoquan He @ 2018-07-18 2:49 UTC (permalink / raw) To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, Baoquan He This function, being a variant of walk_system_ram_res() introduced in commit 8c86e70acead ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Wei Yang <richard.weiyang@gmail.com> --- include/linux/ioport.h | 3 +++ kernel/resource.c | 40 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/include/linux/ioport.h b/include/linux/ioport.h index b7456ae889dd..066cc263e2cc 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -279,6 +279,9 @@ extern int walk_system_ram_res(u64 start, u64 end, void *arg, int (*func)(struct resource *, void *)); extern int +walk_system_ram_res_rev(u64 start, u64 end, void *arg, + int (*func)(struct resource *, void *)); +extern int walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end, void *arg, int (*func)(struct resource *, void *)); diff --git a/kernel/resource.c b/kernel/resource.c index c96e58d3d2f8..3e18f24b90c4 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -23,6 +23,8 @@ #include <linux/pfn.h> #include <linux/mm.h> #include <linux/resource_ext.h> +#include <linux/string.h> +#include <linux/vmalloc.h> #include <asm/io.h> @@ -443,6 +445,44 @@ int walk_system_ram_res(u64 start, u64 end, void *arg, } /* + * This function, being a variant of walk_system_ram_res(), calls the @func + * callback against all memory ranges of type System RAM which are marked as + * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from + * higher to lower. + */ +int walk_system_ram_res_rev(u64 start, u64 end, void *arg, + int (*func)(struct resource *, void *)) +{ + unsigned long flags; + struct resource *res; + int ret = -1; + + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; + + read_lock(&resource_lock); + list_for_each_entry_reverse(res, &iomem_resource.child, sibling) { + if (start >= end) + break; + if ((res->flags & flags) != flags) + continue; + if (res->desc != IORES_DESC_NONE) + continue; + if (res->end < start) + break; + + if ((res->end >= start) && (res->start < end)) { + ret = (*func)(res, arg); + if (ret) + break; + } + end = res->start - 1; + + } + read_unlock(&resource_lock); + return ret; +} + +/* * This function calls the @func callback against all memory ranges, which * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY. */ -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-18 2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He ` (2 preceding siblings ...) 2018-07-18 2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He @ 2018-07-18 2:49 ` Baoquan He 2018-07-18 22:33 ` Andrew Morton 3 siblings, 1 reply; 18+ messages in thread From: Baoquan He @ 2018-07-18 2:49 UTC (permalink / raw) To: linux-kernel, akpm, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko Cc: patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, Baoquan He, kexec For kexec_file loading, if kexec_buf.top_down is 'true', the memory which is used to load kernel/initrd/purgatory is supposed to be allocated from top to down. This is what we have been doing all along in the old kexec loading interface and the kexec loading is still default setting in some distributions. However, the current kexec_file loading interface doesn't do like this. The function arch_kexec_walk_mem() it calls ignores checking kexec_buf.top_down, but calls walk_system_ram_res() directly to go through all resources of System RAM from bottom to up, to try to find memory region which can contain the specific kexec buffer, then call locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This brings confusion especially when KASLR is widely supported , users have to make clear why kexec/kdump kernel loading position is different between these two interfaces in order to exclude unnecessary noises. Hence these two interfaces need be unified on behaviour. Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(), if yes, call the newly added walk_system_ram_res_rev() to find memory region from top to down to load kernel. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: kexec@lists.infradead.org --- kernel/kexec_file.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index c6a3b6851372..75226c1d08ce 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -518,6 +518,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, crashk_res.start, crashk_res.end, kbuf, func); + else if (kbuf->top_down) + return walk_system_ram_res_rev(0, ULONG_MAX, kbuf, func); else return walk_system_ram_res(0, ULONG_MAX, kbuf, func); } -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-18 2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He @ 2018-07-18 22:33 ` Andrew Morton 2018-07-19 15:17 ` Baoquan He 0 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2018-07-18 22:33 UTC (permalink / raw) To: Baoquan He Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > is used to load kernel/initrd/purgatory is supposed to be allocated from > top to down. This is what we have been doing all along in the old kexec > loading interface and the kexec loading is still default setting in some > distributions. However, the current kexec_file loading interface doesn't > do like this. The function arch_kexec_walk_mem() it calls ignores checking > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > all resources of System RAM from bottom to up, to try to find memory region > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > to allocate memory in that found memory region from top to down. This brings > confusion especially when KASLR is widely supported , users have to make clear > why kexec/kdump kernel loading position is different between these two > interfaces in order to exclude unnecessary noises. Hence these two interfaces > need be unified on behaviour. As far as I can tell, the above is the whole reason for the patchset, yes? To avoid confusing users. Is that sufficient? Can we instead simplify their lives by providing better documentation or informative printks or better Kconfig text, etc? And who *are* the people who are performing this configuration? Random system administrators? Linux distro engineers? If the latter then they presumably aren't easily confused! In other words, I'm trying to understand how much benefit this patchset will provide to our users as a whole. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-18 22:33 ` Andrew Morton @ 2018-07-19 15:17 ` Baoquan He 2018-07-19 19:44 ` Andrew Morton 2018-07-23 14:34 ` Michal Hocko 0 siblings, 2 replies; 18+ messages in thread From: Baoquan He @ 2018-07-19 15:17 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec Hi Andrew, On 07/18/18 at 03:33pm, Andrew Morton wrote: > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > top to down. This is what we have been doing all along in the old kexec > > loading interface and the kexec loading is still default setting in some > > distributions. However, the current kexec_file loading interface doesn't > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > all resources of System RAM from bottom to up, to try to find memory region > > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > > to allocate memory in that found memory region from top to down. This brings > > confusion especially when KASLR is widely supported , users have to make clear > > why kexec/kdump kernel loading position is different between these two > > interfaces in order to exclude unnecessary noises. Hence these two interfaces > > need be unified on behaviour. > > As far as I can tell, the above is the whole reason for the patchset, > yes? To avoid confusing users. In fact, it's not just trying to avoid confusing users. Kexec loading and kexec_file loading are just do the same thing in essence. Just we need do kernel image verification on uefi system, have to port kexec loading code to kernel. Kexec has been a formal feature in our distro, and customers owning those kind of very large machine can make use of this feature to speed up the reboot process. On uefi machine, the kexec_file loading will search place to put kernel under 4G from top to down. As we know, the 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume it. It may have possibility to not be able to find a usable space for kernel/initrd. From the top down of the whole memory space, we don't have this worry. And at the first post, I just posted below with AKASHI's walk_system_ram_res_rev() version. Later you suggested to use list_head to link child sibling of resource, see what the code change looks like. http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com Then I posted v2 http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com Rob Herring mentioned that other components which has this tree struct have planned to do the same thing, replacing the singly linked list with list_head to link resource child sibling. Just quote Rob's words as below. I think this could be another reason. ~~~~~ From Rob The DT struct device_node also has the same tree structure with parent, child, sibling pointers and converting to list_head had been on the todo list for a while. ACPI also has some tree walking functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a common tree struct and helpers defined either on top of list_head or a ~~~~~ new struct if that saves some size. > > Is that sufficient? Can we instead simplify their lives by providing > better documentation or informative printks or better Kconfig text, > etc? > > And who *are* the people who are performing this configuration? Random > system administrators? Linux distro engineers? If the latter then > they presumably aren't easily confused! Kexec was invented for kernel developer to speed up their kernel rebooting. Now high end sever admin, kernel developer and QE are also keen to use it to reboot large box for faster feature testing, bug debugging. Kernel dev could know this well, about kernel loading position, admin or QE might not be aware of it very well. > > In other words, I'm trying to understand how much benefit this patchset > will provide to our users as a whole. Understood. The list_head replacing patch truly involes too many code changes, it's risky. I am willing to try any idea from reviewers, won't persuit they have to be accepted finally. If don't have a try, we don't know what it looks like, and what impact it may have. I am fine to take AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even though it could be a little bit low efficient. Thanks Baoquan -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-19 15:17 ` Baoquan He @ 2018-07-19 19:44 ` Andrew Morton 2018-07-25 2:21 ` Baoquan He 2018-07-23 14:34 ` Michal Hocko 1 sibling, 1 reply; 18+ messages in thread From: Andrew Morton @ 2018-07-19 19:44 UTC (permalink / raw) To: Baoquan He Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote: > Hi Andrew, > > On 07/18/18 at 03:33pm, Andrew Morton wrote: > > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote: > > > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > > top to down. This is what we have been doing all along in the old kexec > > > loading interface and the kexec loading is still default setting in some > > > distributions. However, the current kexec_file loading interface doesn't > > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > > all resources of System RAM from bottom to up, to try to find memory region > > > which can contain the specific kexec buffer, then call locate_mem_hole_callback() > > > to allocate memory in that found memory region from top to down. This brings > > > confusion especially when KASLR is widely supported , users have to make clear > > > why kexec/kdump kernel loading position is different between these two > > > interfaces in order to exclude unnecessary noises. Hence these two interfaces > > > need be unified on behaviour. > > > > As far as I can tell, the above is the whole reason for the patchset, > > yes? To avoid confusing users. > > > In fact, it's not just trying to avoid confusing users. Kexec loading > and kexec_file loading are just do the same thing in essence. Just we > need do kernel image verification on uefi system, have to port kexec > loading code to kernel. > > Kexec has been a formal feature in our distro, and customers owning > those kind of very large machine can make use of this feature to speed > up the reboot process. On uefi machine, the kexec_file loading will > search place to put kernel under 4G from top to down. As we know, the > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > it. It may have possibility to not be able to find a usable space for > kernel/initrd. From the top down of the whole memory space, we don't > have this worry. > > And at the first post, I just posted below with AKASHI's > walk_system_ram_res_rev() version. Later you suggested to use > list_head to link child sibling of resource, see what the code change > looks like. > http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com > > Then I posted v2 > http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com > Rob Herring mentioned that other components which has this tree struct > have planned to do the same thing, replacing the singly linked list with > list_head to link resource child sibling. Just quote Rob's words as > below. I think this could be another reason. > > ~~~~~ From Rob > The DT struct device_node also has the same tree structure with > parent, child, sibling pointers and converting to list_head had been > on the todo list for a while. ACPI also has some tree walking > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a > common tree struct and helpers defined either on top of list_head or a > ~~~~~ > new struct if that saves some size. Please let's get all this into the changelogs? > > > > Is that sufficient? Can we instead simplify their lives by providing > > better documentation or informative printks or better Kconfig text, > > etc? > > > > And who *are* the people who are performing this configuration? Random > > system administrators? Linux distro engineers? If the latter then > > they presumably aren't easily confused! > > Kexec was invented for kernel developer to speed up their kernel > rebooting. Now high end sever admin, kernel developer and QE are also > keen to use it to reboot large box for faster feature testing, bug > debugging. Kernel dev could know this well, about kernel loading > position, admin or QE might not be aware of it very well. > > > > > In other words, I'm trying to understand how much benefit this patchset > > will provide to our users as a whole. > > Understood. The list_head replacing patch truly involes too many code > changes, it's risky. I am willing to try any idea from reviewers, won't > persuit they have to be accepted finally. If don't have a try, we don't > know what it looks like, and what impact it may have. I am fine to take > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even > though it could be a little bit low efficient. The larger patch produces a better result. We can handle it ;) -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-19 19:44 ` Andrew Morton @ 2018-07-25 2:21 ` Baoquan He 0 siblings, 0 replies; 18+ messages in thread From: Baoquan He @ 2018-07-25 2:21 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec Hi Andrew, On 07/19/18 at 12:44pm, Andrew Morton wrote: > On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote: > > > As far as I can tell, the above is the whole reason for the patchset, > > > yes? To avoid confusing users. > > > > > > In fact, it's not just trying to avoid confusing users. Kexec loading > > and kexec_file loading are just do the same thing in essence. Just we > > need do kernel image verification on uefi system, have to port kexec > > loading code to kernel. > > > > Kexec has been a formal feature in our distro, and customers owning > > those kind of very large machine can make use of this feature to speed > > up the reboot process. On uefi machine, the kexec_file loading will > > search place to put kernel under 4G from top to down. As we know, the > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > it. It may have possibility to not be able to find a usable space for > > kernel/initrd. From the top down of the whole memory space, we don't > > have this worry. > > > > And at the first post, I just posted below with AKASHI's > > walk_system_ram_res_rev() version. Later you suggested to use > > list_head to link child sibling of resource, see what the code change > > looks like. > > http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com > > > > Then I posted v2 > > http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com > > Rob Herring mentioned that other components which has this tree struct > > have planned to do the same thing, replacing the singly linked list with > > list_head to link resource child sibling. Just quote Rob's words as > > below. I think this could be another reason. > > > > ~~~~~ From Rob > > The DT struct device_node also has the same tree structure with > > parent, child, sibling pointers and converting to list_head had been > > on the todo list for a while. ACPI also has some tree walking > > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a > > common tree struct and helpers defined either on top of list_head or a > > ~~~~~ > > new struct if that saves some size. > > Please let's get all this into the changelogs? Sorry for late reply because of some urgent customer hotplug issues. I am rewriting all change logs, and cover letter. Then found I was wrong about the 2nd reason. The current kexec_file_load calls kexec_locate_mem_hole() to go through all system RAM region, if one region is larger than the size of kernel or initrd, it will search a position in that region from top to down. Since kexec will jump to 2nd kernel and don't need to care the 1st kernel's data, we can always find a usable space to load kexec kernel/initrd under 4G. So the only reason for this patch is keeping consistent with kexec_load and avoid confusion. And since x86 5-level paging mode has been added, we have another issue for top-down searching in the whole system RAM. That is we support dynamic 4-level to 5-level changing. Namely a kernel compiled with 5-level support, we can add 'no5lvl' to force 4-level. Then jumping from a 5-level kernel to 4-level kernel, e.g we load kernel at the top of system RAM in 5-level paging mode which might be bigger than 64TB, then try to jump to 4-level kernel with the upper limit of 64TB. For this case, we need add limit for kexec kernel loading if in 5-level kernel. All this mess makes me hesitate to choose a deligate method. Maybe I should drop this patchset. > > > > > > > Is that sufficient? Can we instead simplify their lives by providing > > > better documentation or informative printks or better Kconfig text, > > > etc? > > > > > > And who *are* the people who are performing this configuration? Random > > > system administrators? Linux distro engineers? If the latter then > > > they presumably aren't easily confused! > > > > Kexec was invented for kernel developer to speed up their kernel > > rebooting. Now high end sever admin, kernel developer and QE are also > > keen to use it to reboot large box for faster feature testing, bug > > debugging. Kernel dev could know this well, about kernel loading > > position, admin or QE might not be aware of it very well. > > > > > > > > In other words, I'm trying to understand how much benefit this patchset > > > will provide to our users as a whole. > > > > Understood. The list_head replacing patch truly involes too many code > > changes, it's risky. I am willing to try any idea from reviewers, won't > > persuit they have to be accepted finally. If don't have a try, we don't > > know what it looks like, and what impact it may have. I am fine to take > > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even > > though it could be a little bit low efficient. > > The larger patch produces a better result. We can handle it ;) For this issue, if we stop changing the kexec top down searching code, I am not sure if we should post this replacing with list_head patches separately. Thanks Baoquan -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-19 15:17 ` Baoquan He 2018-07-19 19:44 ` Andrew Morton @ 2018-07-23 14:34 ` Michal Hocko 2018-07-25 6:48 ` Baoquan He 1 sibling, 1 reply; 18+ messages in thread From: Michal Hocko @ 2018-07-23 14:34 UTC (permalink / raw) To: Baoquan He Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Thu 19-07-18 23:17:53, Baoquan He wrote: > Kexec has been a formal feature in our distro, and customers owning > those kind of very large machine can make use of this feature to speed > up the reboot process. On uefi machine, the kexec_file loading will > search place to put kernel under 4G from top to down. As we know, the > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > it. It may have possibility to not be able to find a usable space for > kernel/initrd. From the top down of the whole memory space, we don't > have this worry. I do not have the full context here but let me note that you should be careful when doing top-down reservation because you can easily get into hotplugable memory and break the hotremove usecase. We even warn when this is done. See memblock_find_in_range_node -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-23 14:34 ` Michal Hocko @ 2018-07-25 6:48 ` Baoquan He 2018-07-26 12:59 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Baoquan He @ 2018-07-25 6:48 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On 07/23/18 at 04:34pm, Michal Hocko wrote: > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > Kexec has been a formal feature in our distro, and customers owning > > those kind of very large machine can make use of this feature to speed > > up the reboot process. On uefi machine, the kexec_file loading will > > search place to put kernel under 4G from top to down. As we know, the > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > it. It may have possibility to not be able to find a usable space for > > kernel/initrd. From the top down of the whole memory space, we don't > > have this worry. > > I do not have the full context here but let me note that you should be > careful when doing top-down reservation because you can easily get into > hotplugable memory and break the hotremove usecase. We even warn when > this is done. See memblock_find_in_range_node Kexec read kernel/initrd file into buffer, just search usable positions for them to do the later copying. You can see below struct kexec_segment, for the old kexec_load, kernel/initrd are read into user space buffer, the @buf stores the user space buffer address, @mem stores the position where kernel/initrd will be put. In kernel, it calls kimage_load_normal_segment() to copy user space buffer to intermediate pages which are allocated with flag GFP_KERNEL. These intermediate pages are recorded as entries, later when user execute "kexec -e" to trigger kexec jumping, it will do the final copying from the intermediate pages to the real destination pages which @mem pointed. Because we can't touch the existed data in 1st kernel when do kexec kernel loading. With my understanding, GFP_KERNEL will make those intermediate pages be allocated inside immovable area, it won't impact hotplugging. But the @mem we searched in the whole system RAM might be lost along with hotplug. Hence we need do kexec kernel again when hotplug event is detected. #define KEXEC_CONTROL_MEMORY_GFP (GFP_KERNEL | __GFP_NORETRY) struct kexec_segment { /* * This pointer can point to user memory if kexec_load() system * call is used or will point to kernel memory if * kexec_file_load() system call is used. * * Use ->buf when expecting to deal with user memory and use ->kbuf * when expecting to deal with kernel memory. */ union { void __user *buf; void *kbuf; }; size_t bufsz; unsigned long mem; size_t memsz; }; Thanks Baoquan -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-25 6:48 ` Baoquan He @ 2018-07-26 12:59 ` Michal Hocko 2018-07-26 13:09 ` Baoquan He 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2018-07-26 12:59 UTC (permalink / raw) To: Baoquan He Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Wed 25-07-18 14:48:13, Baoquan He wrote: > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > Kexec has been a formal feature in our distro, and customers owning > > > those kind of very large machine can make use of this feature to speed > > > up the reboot process. On uefi machine, the kexec_file loading will > > > search place to put kernel under 4G from top to down. As we know, the > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > it. It may have possibility to not be able to find a usable space for > > > kernel/initrd. From the top down of the whole memory space, we don't > > > have this worry. > > > > I do not have the full context here but let me note that you should be > > careful when doing top-down reservation because you can easily get into > > hotplugable memory and break the hotremove usecase. We even warn when > > this is done. See memblock_find_in_range_node > > Kexec read kernel/initrd file into buffer, just search usable positions > for them to do the later copying. You can see below struct kexec_segment, > for the old kexec_load, kernel/initrd are read into user space buffer, > the @buf stores the user space buffer address, @mem stores the position > where kernel/initrd will be put. In kernel, it calls > kimage_load_normal_segment() to copy user space buffer to intermediate > pages which are allocated with flag GFP_KERNEL. These intermediate pages > are recorded as entries, later when user execute "kexec -e" to trigger > kexec jumping, it will do the final copying from the intermediate pages > to the real destination pages which @mem pointed. Because we can't touch > the existed data in 1st kernel when do kexec kernel loading. With my > understanding, GFP_KERNEL will make those intermediate pages be > allocated inside immovable area, it won't impact hotplugging. But the > @mem we searched in the whole system RAM might be lost along with > hotplug. Hence we need do kexec kernel again when hotplug event is > detected. I am not sure I am following. If @mem is placed at movable node then the memory hotremove simply won't work, because we are seeing reserved pages and do not know what to do about them. They are not migrateable. Allocating intermediate pages from other nodes doesn't really help. The memblock code warns exactly for that reason. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 12:59 ` Michal Hocko @ 2018-07-26 13:09 ` Baoquan He 2018-07-26 13:12 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Baoquan He @ 2018-07-26 13:09 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On 07/26/18 at 02:59pm, Michal Hocko wrote: > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > Kexec has been a formal feature in our distro, and customers owning > > > > those kind of very large machine can make use of this feature to speed > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > search place to put kernel under 4G from top to down. As we know, the > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > it. It may have possibility to not be able to find a usable space for > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > have this worry. > > > > > > I do not have the full context here but let me note that you should be > > > careful when doing top-down reservation because you can easily get into > > > hotplugable memory and break the hotremove usecase. We even warn when > > > this is done. See memblock_find_in_range_node > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > for them to do the later copying. You can see below struct kexec_segment, > > for the old kexec_load, kernel/initrd are read into user space buffer, > > the @buf stores the user space buffer address, @mem stores the position > > where kernel/initrd will be put. In kernel, it calls > > kimage_load_normal_segment() to copy user space buffer to intermediate > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > are recorded as entries, later when user execute "kexec -e" to trigger > > kexec jumping, it will do the final copying from the intermediate pages > > to the real destination pages which @mem pointed. Because we can't touch > > the existed data in 1st kernel when do kexec kernel loading. With my > > understanding, GFP_KERNEL will make those intermediate pages be > > allocated inside immovable area, it won't impact hotplugging. But the > > @mem we searched in the whole system RAM might be lost along with > > hotplug. Hence we need do kexec kernel again when hotplug event is > > detected. > > I am not sure I am following. If @mem is placed at movable node then the > memory hotremove simply won't work, because we are seeing reserved pages > and do not know what to do about them. They are not migrateable. > Allocating intermediate pages from other nodes doesn't really help. OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove in 1st kernel, it does impact the kernel which kexec jump into if kernel is at top of system RAM and the top RAM is in movable node. > > The memblock code warns exactly for that reason. > -- > Michal Hocko > SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 13:09 ` Baoquan He @ 2018-07-26 13:12 ` Michal Hocko 2018-07-26 13:14 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2018-07-26 13:12 UTC (permalink / raw) To: Baoquan He Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Thu 26-07-18 21:09:04, Baoquan He wrote: > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > those kind of very large machine can make use of this feature to speed > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > it. It may have possibility to not be able to find a usable space for > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > have this worry. > > > > > > > > I do not have the full context here but let me note that you should be > > > > careful when doing top-down reservation because you can easily get into > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > this is done. See memblock_find_in_range_node > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > for them to do the later copying. You can see below struct kexec_segment, > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > the @buf stores the user space buffer address, @mem stores the position > > > where kernel/initrd will be put. In kernel, it calls > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > kexec jumping, it will do the final copying from the intermediate pages > > > to the real destination pages which @mem pointed. Because we can't touch > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > understanding, GFP_KERNEL will make those intermediate pages be > > > allocated inside immovable area, it won't impact hotplugging. But the > > > @mem we searched in the whole system RAM might be lost along with > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > detected. > > > > I am not sure I am following. If @mem is placed at movable node then the > > memory hotremove simply won't work, because we are seeing reserved pages > > and do not know what to do about them. They are not migrateable. > > Allocating intermediate pages from other nodes doesn't really help. > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > in 1st kernel, it does impact the kernel which kexec jump into if kernel > is at top of system RAM and the top RAM is in movable node. It will affect the 1st kernel (which does the memblock allocation top-down) as well. For reasons mentioned above. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 13:12 ` Michal Hocko @ 2018-07-26 13:14 ` Michal Hocko 2018-07-26 13:37 ` Baoquan He 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2018-07-26 13:14 UTC (permalink / raw) To: Baoquan He Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Thu 26-07-18 15:12:42, Michal Hocko wrote: > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > have this worry. > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > careful when doing top-down reservation because you can easily get into > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > this is done. See memblock_find_in_range_node > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > the @buf stores the user space buffer address, @mem stores the position > > > > where kernel/initrd will be put. In kernel, it calls > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > @mem we searched in the whole system RAM might be lost along with > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > detected. > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > memory hotremove simply won't work, because we are seeing reserved pages > > > and do not know what to do about them. They are not migrateable. > > > Allocating intermediate pages from other nodes doesn't really help. > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > is at top of system RAM and the top RAM is in movable node. > > It will affect the 1st kernel (which does the memblock allocation > top-down) as well. For reasons mentioned above. And btw. in the ideal world, we would restrict the memblock allocation top-down from the non-movable nodes. But I do not think we have that information ready at the time when the reservation is done. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 13:14 ` Michal Hocko @ 2018-07-26 13:37 ` Baoquan He 2018-07-26 14:01 ` Michal Hocko 0 siblings, 1 reply; 18+ messages in thread From: Baoquan He @ 2018-07-26 13:37 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On 07/26/18 at 03:14pm, Michal Hocko wrote: > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > have this worry. > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > detected. > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > and do not know what to do about them. They are not migrateable. > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > is at top of system RAM and the top RAM is in movable node. > > > > It will affect the 1st kernel (which does the memblock allocation > > top-down) as well. For reasons mentioned above. > > And btw. in the ideal world, we would restrict the memblock allocation > top-down from the non-movable nodes. But I do not think we have that > information ready at the time when the reservation is done. Oh, you could mix kexec loading up with kdump kernel loading. For kdump kernel, we need reserve memory region during bootup with memblock allocator. For kexec loading, we just operate after system up, and do not need to reserve any memmory region. About memory used to load them, it's quite different way. Thanks Baoquan -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 13:37 ` Baoquan He @ 2018-07-26 14:01 ` Michal Hocko 2018-07-26 15:10 ` Baoquan He 0 siblings, 1 reply; 18+ messages in thread From: Michal Hocko @ 2018-07-26 14:01 UTC (permalink / raw) To: Baoquan He Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On Thu 26-07-18 21:37:05, Baoquan He wrote: > On 07/26/18 at 03:14pm, Michal Hocko wrote: > > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > > have this worry. > > > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > > detected. > > > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > > and do not know what to do about them. They are not migrateable. > > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > > is at top of system RAM and the top RAM is in movable node. > > > > > > It will affect the 1st kernel (which does the memblock allocation > > > top-down) as well. For reasons mentioned above. > > > > And btw. in the ideal world, we would restrict the memblock allocation > > top-down from the non-movable nodes. But I do not think we have that > > information ready at the time when the reservation is done. > > Oh, you could mix kexec loading up with kdump kernel loading. For kdump > kernel, we need reserve memory region during bootup with memblock > allocator. For kexec loading, we just operate after system up, and do > not need to reserve any memmory region. About memory used to load them, > it's quite different way. I didn't know about that. I thought both use the same underlying reservation mechanism. My bad and sorry for the noise. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required 2018-07-26 14:01 ` Michal Hocko @ 2018-07-26 15:10 ` Baoquan He 0 siblings, 0 replies; 18+ messages in thread From: Baoquan He @ 2018-07-26 15:10 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel, linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec On 07/26/18 at 04:01pm, Michal Hocko wrote: > On Thu 26-07-18 21:37:05, Baoquan He wrote: > > On 07/26/18 at 03:14pm, Michal Hocko wrote: > > > On Thu 26-07-18 15:12:42, Michal Hocko wrote: > > > > On Thu 26-07-18 21:09:04, Baoquan He wrote: > > > > > On 07/26/18 at 02:59pm, Michal Hocko wrote: > > > > > > On Wed 25-07-18 14:48:13, Baoquan He wrote: > > > > > > > On 07/23/18 at 04:34pm, Michal Hocko wrote: > > > > > > > > On Thu 19-07-18 23:17:53, Baoquan He wrote: > > > > > > > > > Kexec has been a formal feature in our distro, and customers owning > > > > > > > > > those kind of very large machine can make use of this feature to speed > > > > > > > > > up the reboot process. On uefi machine, the kexec_file loading will > > > > > > > > > search place to put kernel under 4G from top to down. As we know, the > > > > > > > > > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > > > > > > > > > it. It may have possibility to not be able to find a usable space for > > > > > > > > > kernel/initrd. From the top down of the whole memory space, we don't > > > > > > > > > have this worry. > > > > > > > > > > > > > > > > I do not have the full context here but let me note that you should be > > > > > > > > careful when doing top-down reservation because you can easily get into > > > > > > > > hotplugable memory and break the hotremove usecase. We even warn when > > > > > > > > this is done. See memblock_find_in_range_node > > > > > > > > > > > > > > Kexec read kernel/initrd file into buffer, just search usable positions > > > > > > > for them to do the later copying. You can see below struct kexec_segment, > > > > > > > for the old kexec_load, kernel/initrd are read into user space buffer, > > > > > > > the @buf stores the user space buffer address, @mem stores the position > > > > > > > where kernel/initrd will be put. In kernel, it calls > > > > > > > kimage_load_normal_segment() to copy user space buffer to intermediate > > > > > > > pages which are allocated with flag GFP_KERNEL. These intermediate pages > > > > > > > are recorded as entries, later when user execute "kexec -e" to trigger > > > > > > > kexec jumping, it will do the final copying from the intermediate pages > > > > > > > to the real destination pages which @mem pointed. Because we can't touch > > > > > > > the existed data in 1st kernel when do kexec kernel loading. With my > > > > > > > understanding, GFP_KERNEL will make those intermediate pages be > > > > > > > allocated inside immovable area, it won't impact hotplugging. But the > > > > > > > @mem we searched in the whole system RAM might be lost along with > > > > > > > hotplug. Hence we need do kexec kernel again when hotplug event is > > > > > > > detected. > > > > > > > > > > > > I am not sure I am following. If @mem is placed at movable node then the > > > > > > memory hotremove simply won't work, because we are seeing reserved pages > > > > > > and do not know what to do about them. They are not migrateable. > > > > > > Allocating intermediate pages from other nodes doesn't really help. > > > > > > > > > > OK, I forgot the 2nd kernel which kexec jump into. It won't impact hotremove > > > > > in 1st kernel, it does impact the kernel which kexec jump into if kernel > > > > > is at top of system RAM and the top RAM is in movable node. > > > > > > > > It will affect the 1st kernel (which does the memblock allocation > > > > top-down) as well. For reasons mentioned above. > > > > > > And btw. in the ideal world, we would restrict the memblock allocation > > > top-down from the non-movable nodes. But I do not think we have that > > > information ready at the time when the reservation is done. > > > > Oh, you could mix kexec loading up with kdump kernel loading. For kdump > > kernel, we need reserve memory region during bootup with memblock > > allocator. For kexec loading, we just operate after system up, and do > > not need to reserve any memmory region. About memory used to load them, > > it's quite different way. > > I didn't know about that. I thought both use the same underlying > reservation mechanism. My bad and sorry for the noise. Not at all. It's truly confusing. I often need take time to recall those details. -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-07-26 15:10 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-18 2:49 [PATCH v7 0/4] resource: Use list_head to link sibling resource Baoquan He 2018-07-18 2:49 ` [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public Baoquan He 2018-07-18 2:49 ` [PATCH v7 2/4] resource: Use list_head to link sibling resource Baoquan He 2018-07-18 2:49 ` [PATCH v7 3/4] resource: add walk_system_ram_res_rev() Baoquan He 2018-07-18 2:49 ` [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required Baoquan He 2018-07-18 22:33 ` Andrew Morton 2018-07-19 15:17 ` Baoquan He 2018-07-19 19:44 ` Andrew Morton 2018-07-25 2:21 ` Baoquan He 2018-07-23 14:34 ` Michal Hocko 2018-07-25 6:48 ` Baoquan He 2018-07-26 12:59 ` Michal Hocko 2018-07-26 13:09 ` Baoquan He 2018-07-26 13:12 ` Michal Hocko 2018-07-26 13:14 ` Michal Hocko 2018-07-26 13:37 ` Baoquan He 2018-07-26 14:01 ` Michal Hocko 2018-07-26 15:10 ` Baoquan He
($INBOX_DIR/description missing) This inbox may be cloned and mirrored by anyone: # this inbox consists of 2 epochs: git clone --mirror http://archive.lwn.net:8080/devicetree/0 devicetree/git/0.git # oldest git clone --mirror http://archive.lwn.net:8080/devicetree/1 devicetree/git/1.git # newest # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 devicetree devicetree/ http://archive.lwn.net:8080/devicetree \ devicetree@vger.kernel.org lwn-devicetree@archive.lwn.net public-inbox-index devicetree Example config snippet for mirrors. Newsgroup available over NNTP: nntp://archive.lwn.net/lwn.kernel.devicetree AGPL code for this site: git clone https://public-inbox.org/public-inbox.git