This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.
[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge Linux v3.12-rc4 to fix a conflict and also to refresh the tree
before applying more scheduler patches.
Conflicts:
arch/avr32/include/asm/Kbuild
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull iommu fixes from Joerg Roedel:
"A couple of fixes from the IOMMU side:
- some small fixes for the new ARM-SMMU driver
- a register offset correction for VT-d
- add MAINTAINERS entry for drivers/iommu
Overall no really big or intrusive changes"
* tag 'iommu-fixes-v3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
x86/iommu: correct ICS register offset
MAINTAINERS: add overall IOMMU section
iommu/arm-smmu: don't enable SMMU device until probing has completed
iommu/arm-smmu: fix iommu_present() test in init
iommu/arm-smmu: fix a signedness bug
There's far too much duplication in the __wait_event macros; in order
to fix this introduce ___wait_event() a macro with the capability to
replace most other macros.
With the previous patches changing the various __wait_event*()
implementations to be more uniform; we can now collapse the lot
without also changing generated code.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.181897111@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4c663cf ("wait: fix false timeouts when using
wait_event_timeout()") introduced an additional condition check after
a timeout but there's a few issues;
- it forgot one site
- it put the check after the main loop; not at the actual timeout
check.
Cure both; by wrapping the condition (as suggested by Oleg), this
avoids double evaluation of 'condition' which could be quite big.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092528.028892896@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's two patterns to check signals in the __wait_event*() macros:
if (!signal_pending(current)) {
schedule();
continue;
}
ret = -ERESTARTSYS;
break;
And the more natural:
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
}
schedule();
Change them all into the latter form.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131002092527.956416254@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking changes from David Miller:
1) Multiply in netfilter IPVS can overflow when calculating destination
weight. From Simon Kirby.
2) Use after free fixes in IPVS from Julian Anastasov.
3) SFC driver bug fixes from Daniel Pieczko.
4) Memory leak in pcan_usb_core failure paths, from Alexey Khoroshilov.
5) Locking and encapsulation fixes to serial line CAN driver, from
Andrew Naujoks.
6) Duplex and VF handling fixes to bnx2x driver from Yaniv Rosner,
Eilon Greenstein, and Ariel Elior.
7) In lapb, if no other packets are outstanding, T1 timeouts actually
stall things and no packet gets sent. Fix from Josselin Costanzi.
8) ICMP redirects should not make it to the socket error queues, from
Duan Jiong.
9) Fix bugs in skge DMA mapping error handling, from Nikulas Patocka.
10) Fix setting of VLAN priority field on via-rhine driver, from Roget
Luethi.
11) Fix TX stalls and VLAN promisc programming in be2net driver from
Ajit Khaparde.
12) Packet padding doesn't get handled correctly in new usbnet SG
support code, from Ming Lei.
13) Fix races in netdevice teardown wrt. network namespace closing.
From Eric W. Biederman.
14) Fix potential missed initialization of net_secret if not TCP
connections are openned. From Eric Dumazet.
15) Cinterion PLXX product ID in qmi_wwan driver is wrong, from
Aleksander Morgado.
16) skb_cow_head() can change skb->data and thus packet header pointers,
don't use stale ip_hdr reference in ip_tunnel code.
17) Backend state transition handling fixes in xen-netback, from Paul
Durrant.
18) Packet offset for AH protocol is handled wrong in flow dissector,
from Eric Dumazet.
19) Taking down an fq packet scheduler instance can leave stale packets
in the queues, fix from Eric Dumazet.
20) Fix performance regressions introduced by TCP Small Queues. From
Eric Dumazet.
21) IPV6 GRE tunneling code calculates max_headroom incorrectly, from
Hannes Frederic Sowa.
22) Multicast timer handlers in ipv4 and ipv6 can be the last and final
reference to the ipv4/ipv6 specific network device state, so use the
reference put that will check and release the object if the
reference hits zero. From Salam Noureddine.
23) Fix memory corruption in ip_tunnel driver, and use skb_push()
instead of __skb_push() so that similar bugs are less hard to find.
From Steffen Klassert.
24) Add forgotten hookup of rtnl_ops in SIT and ip6tnl drivers, from
Nicolas Dichtel.
25) fq scheduler doesn't accurately rate limit in certain circumstances,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
pkt_sched: fq: rate limiting improvements
ip6tnl: allow to use rtnl ops on fb tunnel
sit: allow to use rtnl ops on fb tunnel
ip_tunnel: Remove double unregister of the fallback device
ip_tunnel_core: Change __skb_push back to skb_push
ip_tunnel: Add fallback tunnels to the hash lists
ip_tunnel: Fix a memory corruption in ip_tunnel_xmit
qlcnic: Fix SR-IOV configuration
ll_temac: Reset dma descriptors indexes on ndo_open
skbuff: size of hole is wrong in a comment
ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put
ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put
ethernet: moxa: fix incorrect placement of __initdata tag
ipv6: gre: correct calculation of max_headroom
powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file
Revert "powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file"
bonding: Fix broken promiscuity reference counting issue
tcp: TSQ can use a dynamic limit
dm9601: fix IFF_ALLMULTI handling
pkt_sched: fq: qdisc dismantle fixes
...
Since commit c93bdd0e03 ("netvm: allow skb allocation to use PFMEMALLOC
reserves"), hole size is one bit less than what is written in the comment.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix for Oopses in the pNFS files layout driver
- Fix a regression when doing a non-exclusive file create on NFSv4.x
- NFSv4.1 security negotiation fixes when looking up the root
filesystem
- Fix a memory ordering issue in the pNFS files layout driver
* tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Give "flavor" an initial value to fix a compile warning
NFSv4.1: try SECINFO_NO_NAME flavs until one works
NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method
Merge misc fixes from Andrew Morton.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
pidns: fix free_pid() to handle the first fork failure
ipc,msg: prevent race with rmid in msgsnd,msgrcv
ipc/sem.c: update sem_otime for all operations
mm/hwpoison: fix the lack of one reference count against poisoned page
mm/hwpoison: fix false report on 2nd attempt at page recovery
mm/hwpoison: fix test for a transparent huge page
mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
block: change config option name for cmdline partition parsing
mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
mm: avoid reinserting isolated balloon pages into LRU lists
arch/parisc/mm/fault.c: fix uninitialized variable usage
include/asm-generic/vtime.h: avoid zero-length file
nilfs2: fix issue with race condition of competition between segments for dirty blocks
Documentation/kernel-parameters.txt: replace kernelcore with Movable
mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
kernel/kmod.c: check for NULL in call_usermodehelper_exec()
ipc/sem.c: synchronize the proc interface
ipc/sem.c: optimize sem_lock()
ipc/sem.c: fix race in sem_lock()
mm/compaction.c: periodically schedule when freeing pages
...
Pull char/misc driver fixes from Greg KH:
"Here are some HyperV and MEI driver fixes for 3.12-rc3. They resolve
some issues that people have been reporting for them"
* tag 'char-misc-3.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Drivers: hv: vmbus: Terminate vmbus version negotiation on timeout
Drivers: hv: util: Correctly support ws2008R2 and earlier
mei: cancel stall timers in mei_reset
mei: bus: stop wait for read during cl state transition
mei: make me client counters less error prone
Commit 638c5115a7949(USBNET: support DMA SG) introduces DMA SG
if the usb host controller is capable of building packet from
discontinuous buffers, but missed handling padding packet when
building DMA SG.
This patch attachs the pre-allocated padding packet at the
end of the sg list, so padding packet can be sent to device
if drivers require that.
Reported-by: David Laight <David.Laight@aculab.com>
Acked-by: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull s390 lockref enablement from Heiko Carstens:
"Enabling the new lockless lockref variant on s390 would have been
trivial until Tony Luck added a cpu_relax() call into the
CMPXCHG_LOOP(), with commit d472d9d98b ("lockref: Relax in cmpxchg
loop")
As already mentioned cpu_relax() is very expensive on s390 since it
yields() the current virtual cpu. So we are talking of several
thousand cycles. Considering this enabling the lockless lockref
variant would contradict the intention of the new semantics. And also
some quick measurements show performance regressions of 50% and more.
Simply removing the cpu_relax() call again seems also not very
desireable since Waiman Long reported that for some workloads the call
improved performance by 5%."
* 'lockref' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: enable ARCH_USE_CMPXCHG_LOCKREF
lockref: use arch_mutex_cpu_relax() in CMPXCHG_LOOP()
mutex: replace CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX with simple ifdef
Pull DeviceTree fixes from Rob Herring:
"Clean-up to fix some warnings for !OF builds and spelling fixes in
docs:
- Clean-up openrisc prom.h
- Fix warnings caused by of_irq.h ifdefs
- Spelling fix for Synopsys"
* tag 'devicetree-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dts: Fix misspelling of Synopsys
of: clean-up ifdefs in of_irq.h
openrisc: clean-up prom.h
Linus suggested to replace
#ifndef CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX
#define arch_mutex_cpu_relax() cpu_relax()
#endif
with just a simple
#ifndef arch_mutex_cpu_relax
# define arch_mutex_cpu_relax() cpu_relax()
#endif
to get rid of CONFIG_HAVE_CPU_RELAX_SIMPLE. So architectures can
simply define arch_mutex_cpu_relax if they want an architecture
specific function instead of having to add a select statement in
their Kconfig in addition.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The current code does not correctly negotiate the version numbers for the util
driver when hosted on earlier hosts. The version numbers presented by this
driver were not compatible with the version numbers supported by Windows Server
2008. Fix this problem.
I would like to thank Olaf Hering (ohering@suse.com) for identifying the problem.
Reported-by: Olaf Hering <ohering@suse.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Determine if we've created a new file by examining the directory change
attribute and/or the O_EXCL flag.
This fixes a regression when doing a non-exclusive create of a new file.
If the FILE_CREATED flag is not set, the atomic_open() command will
perform full file access permissions checks instead of just checking
for MAY_OPEN.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull device-mapper fixes from Mike Snitzer:
"A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error
handling fixes for dm-multipath. A fix for the thin provisioning
target to not expose non-zero discard limits if discards are disabled.
Lastly, add two DM module parameters which allow users to tune the
emergency memory reserves that DM mainatins per device -- this helps
fix a long-standing issue for dm-multipath. The conservative default
reserve for request-based dm-multipath devices (256) has proven
problematic for users with many multipathed SCSI devices but
relatively little memory. To responsibly select a smaller value users
should use the new nr_bios tracepoint info (via commit 75afb352
"block: Add nr_bios to block_rq_remap tracepoint") to determine the
peak number of bios their workloads create"
* tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: add reserved_bio_based_ios module parameter
dm: add reserved_rq_based_ios module parameter
dm: lower bio-based mempool reservation
dm thin: do not expose non-zero discard limits if discards disabled
dm mpath: disable WRITE SAME if it fails
dm-snapshot: fix performance degradation due to small hash size
dm snapshot: workaround for a false positive lockdep warning
dm stats: fix possible counter corruption on 32-bit systems
dm mpath: do not fail path on -ENOSPC
When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.
However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.
Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.
Doing so for !per-cpu implementations is harmless.
For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:
__preempt_count_add()
__preempt_count_sub()
and the new:
__preempt_count_dec_and_test()
And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.
Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.
Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:
- resched_task() when setting TIF_NEED_RESCHED on the current task
- scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
remote task it follows it up with a reschedule IPI
and we can modify the cpu local preempt_count from
there.
- cpu_idle_loop() for when resched_task() found tsk_is_polling().
We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.
Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace the single preempt_count() 'function' that's an lvalue with
two proper functions:
preempt_count() - returns the preempt_count value as rvalue
preempt_count_set() - Allows setting the preempt-count value
Also provide preempt_count_ptr() as a convenience wrapper to implement
all modifying operations.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
[ Fixed build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.
The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).
Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).
Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: lenb@kernel.org
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>