sync_file_range(2) is documented to issue writeback only for pages that
are not currently being written. After all the system call has been
created for userspace to be able to issue background writeout and so
waiting for in-flight IO is undesirable there. However commit
ee53a89 ("mm: do_sync_mapping_range integrity fix") switched
do_sync_mapping_range() and thus sync_file_range() to issue writeback in
WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code
relying on WB_SYNC_ALL semantics.
These days do_sync_mapping_range() went away and we can switch
sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should
help PostgreSQL avoid large latency spikes when flushing data in the
background.
Andres measured a 20% increase in transactions per second on an SSD
disk.
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Andres Freund <andres@anarazel.de>
Tested-By: Andres Freund <andres@anarazel.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
amazingly long NMI handlers from a trinity-induced workload involving
lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
There are any number of things that one might do to make sync() behave
better under high levels of contention, but it is also the case that
multiple concurrent sync() system calls can be satisfied by a single
sys_sync() invocation.
Given that this situation is reminiscent of rcu_barrier(), this commit
applies the rcu_barrier() approach to sys_sync(). This approach uses
a global mutex and a sequence counter. The mutex is held across the
sync() operation, which eliminates contention between concurrent sync()
operations. The counter is incremented at the beginning and end of
each sync() operation, so that it is odd while a sync() operation is in
progress and even otherwise, just like sequence locks.
The code that used to be in sys_sync() is now in do_sync(), and
sys_sync()
now handles the concurrency. The sys_sync() function first takes a
snapshot of the counter, then acquires the mutex, and then takes another
snapshot of the counter. If the values of the two snapshots indicate
that
a full do_sync() executed during the mutex acquisition, the sys_sync()
function releases the mutex and returns ("Our work is done!").
Otherwise,
sys_sync() increments the counter, invokes do_sync(), and increments
the counter again.
This approach allows a single call to do_sync() to satisfy an
arbitrarily
large number of sync() system calls, which should eliminate issues due
to large numbers of concurrent invocations of the sync() system call.
Changes since v1 (https://lkml.org/lkml/2013/7/24/683):
o Add a pair of memory barriers to keep the increments from
bleeding into the do_sync() code. (The failure probability
is insanely low, but when you have several hundred million
devices running Linux, you can expect several hundred instances
of one-in-a-million failures.)
o Actually CC some people who have experience in this area.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Reioux <reioux@gmail.com>
Enabling software CRCs on the data blocks can be a significant (30%) performance cost, and for other reasons may not always be desired. CRC is a mechanism aiming to prevent data corruption when enabled (reduce the performance around 30%). So if you disable it (improve the performance) but your data can be corrupted. Use it at your risk.
* ***** SysFs interface :
*
* /sys/module/mmc_core/parameters/crc
*
* Enable / Disable CRC
*
* echo N > /sys/module/mmc_core/parameters/crc (Disabled) or
* echo 0 > /sys/module/mmc_core/parameters/crc (Disabled)
*
* echo Y > /sys/module/mmc_core/parameters/crc (Enabled) or
* echo 1 > /sys/module/mmc_core/parameters/crc (Enabled)
*
*
* ***** (default = Enabled)
Signed-off-by: Pafcholini <pafcholini@gmail.com>
Signed-off-by: Pafcholini <nadyaivanova14@gmail.com>
Asynchronous I/O latency to a solid-state disk greatly increased between the 2.6.32 and 3.0 kernels.
By removing the plug from do_io_submit(), we observed a 34% improvement in the I/O latency.
Unfortunately, at this level, we don't know if the request is to
a rotating disk or not.
Change-Id: I7101df956473ed9fd5dcff18e473dd93b688a5c1
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: linux-aio@kvack.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: ahmedradaideh <ahmed.radaideh@gmail.com>
Sleeping for an entire tick adds unnecessary latency to
hotplugging a cpu (cpu_up).
Change-Id: Iab323a79f4048bc9101ecfd368e0f275827ed4ab
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
add_random was implemented for spinning hard disks. It only slows SSDs down. Read here http://wiki.samat.org/SSD for more info.
Signed-off-by: Chester Kener <Cl3Kener@gmail.com>
Signed-off-by: engstk <eng.stk@sapo.pt>
This commit allows to enable/disable via config @jesec Fake Enforcing Status patch or @jcadduono SELinux Force Enforcing/Permissive patch
Signed-off-by: BlackMesa123 <brother12@hotmail.it>
proc: update or inserted cmdline-flags required for proper SafetyNet-support
proc: cmdline: set warranty bit-flags to 0
proc: cmdline: fix invalid handling of value-changing
Change-Id: Idec862bcf9125a79ac9505a938495f114636b4fc
commit 7d6a13f023567d573ac362502bb702eda716e654 upstream.
When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.
The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.
Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.
This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.
Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 233135b763db7c64d07b728a9c66745fb0376275 upstream.
This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it. Should be a slight debugging aid, I hope.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7d3aa7fe970791f1a674b14572a411accf2f4d4e upstream.
We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51b07f30a71c27405259a0248206ed4e22adbee2 upstream.
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.
(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b1438f477934f5a4d5a44df26f3079a7575d5946 upstream.
When a failure due to an inode buffer occurs, the error handling
fails to abort the inode writeback correctly. This can result in the
inode being reclaimed whilst still in the AIL, leading to
use-after-free situations as well as filesystems that cannot be
unmounted as the inode log items left in the AIL never get removed.
Fix this by ensuring fatal errors from xfs_imap_to_bp() result in
the inode flush being aborted correctly.
Reported-by: Shyam Kaushik <shyam@zadarastorage.com>
Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com>
Tested-by: Shyam Kaushik <shyam@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ad747e3b299671e1a53db74963cc6c5f6cdb9f6d upstream.
Commit 96f859d ("libxfs: pack the agfl header structure so
XFS_AGFL_SIZE is correct") allowed the freelist to use the empty
slot at the end of the freelist on 64 bit systems that was not
being used due to sizeof() rounding up the structure size.
This has caused versions of xfs_repair prior to 4.5.0 (which also
has the fix) to report this as a corruption once the filesystem has
been grown. Older kernels can also have problems (seen from a whacky
container/vm management environment) mounting filesystems grown on a
system with a newer kernel than the vm/container it is deployed on.
To avoid this problem, change the initial free list indexes not to
wrap across the end of the AGFL, hence avoiding the initialisation
of agf_fllast to the last index in the AGFL.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>