[Ksummit-2009-discuss] [topic] cpu evacuation framework

Vaidyanathan Srinivasan svaidy at linux.vnet.ibm.com
Fri Jul 10 01:32:56 PDT 2009


There has been recent discussion in LKML to design a fast cpu
evacuation framework that will serve various purposes listed below.
The design challenges involve preserving user space policies like
cpusets and uniformly reduce system capacity.  I would like to propose
a discussion on this topic since future platforms have additional uses
for cpuhotplug and related features which was previously used only for
hardware failures.  Also idle and offline of a virtual cpu in
virtualized environments have different effects on power/performance
and thermal aspects.  Hence a discussion at KS could help validate the
requirements and list out viable approaches.

Discussion topics:

* Overloading cpuhotplug interface:
        - Bulk cpu offline [1]

          Change the unit of offline from 1 cpu at a time to a cpulist
          or mask.  This will facilitate optimisations in notifiers
          and subsystems.  Like, sched domains can be rebuilt once for
          a set of offline.

	- Offline cpu state [3]

	  Having a general framework to decide the 'state' for each
	  offline cpu can help in power/thermal reduction and will
	  also add flexibility in virtualized environments where
	  offline operates with virtual CPUs

* Force idling of cpus:
  	- Scheduler load balancer hacks [4]
	- High priority idle task [6]
	- CPU Hard limits [7]  

* Varying nature of cpu power 
	- DVFS, hardware threads, turbo modes
	- RealTime tasks affecting fairness


* Fast method to move tasks, timers and interrupts deterministically
  away from a given cpu.  Combination of techniques are available
  today to achieve some of the requirements, but a more deterministic
  framework can help the following use cases

* Use system topology to bunch together related logical cpus for

        - Threads of a core
        - Cores of a package


* Preserve user space policies like cpusets
* Least impact to SMP scheduling fairness

Use cases:

* Thermal management
  - Fast method to force-idle full cores and packages to momentary
    reduce work and bring down temperature.  Very useful in thermal
    overcommit situations

* Reduction in average power
  - Policy/priority based reduction is system capacity in order to
    reduce average power consumption.  (Not an energy efficiency
    optimisation)  Average power reduction is what leads to thermal
    reduction.  This is mostly same as the previous case.

* Fast configuration changes in virtualized systems
  - Quickly adjust the number of virtual cpus within OS.

Implementation methods suggested so far:

* Cpu hotplug framework
	- Bulk cpu offline with optimisations

  [1] cpu: Bulk CPU Hotplug support
  [2] pseries: cpu: Reduce the polling interval in __cpu_up()
  [3] Make offline cpus to go to deepest idle state using

* Scheduler load balancer approach to limit capacity
  [4] Saving power by cpu evacuation sched_max_capacity_pct=n

* Cpuset based methods
* Dynamic isolcpu framework
  [5] cpuset: add new API to change cpuset top group's cpus
* High priority idle thread driver
  [6] new ACPI processor driver to force CPUs idle

* CPU cgroup hard limits
  [7] CPU hard limits

All of the above methods have drawbacks and hence a fresh discussion
on the problem may facilitate a flexible and extensible framework for
future platforms.

Please let me know your thoughts on these topics.

More information about the Ksummit-2009-discuss mailing list