review.tizen.org Git - platform/kernel/linux-starfive.git/log

hwrng: via_rng - Fix memory scribbling on some CPUs

It has been reported that on at least one Nano CPU the xstore
instruction will write as many as 16 bytes of data to the output
buffer.

This causes memory corruption as we use rng->priv which is only
4-8 bytes long.

This patch fixes this by using an intermediate buffer on the stack
with at least 16 bytes and aligned to a 16-byte boundary.

The problem was observed on the following processor:

processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 15
model name : VIA Nano processor U2250 (1.6GHz Capable)
stepping : 3
cpu MHz : 1600.000
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush acpi mmx fxsr sse sse2 ss tm syscall nx lm constant_tsc up rep_good pni monitor vmx est tm2 ssse3 cx16 xtpr rng rng_en ace ace_en ace2 phe phe_en lahf_lm
bogomips : 3192.08
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

Tested-by: Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: padlock - Move padlock.h into include/crypto

This patch moves padlock.h from drivers/crypto into include/crypto
so that it may be used by the via-rng driver.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

hwrng: via_rng - Fix asm constraints

The inline asm to invoke xstore did not specify the constraints
correctly. In particular, dx/di should have been marked as output
registers as well as input as they're modified by xstore.

Thanks to Mario Holbe for creating this patch and testing it.

Tested-by: Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: n2 - use __devexit not __exit in n2_unregister_algs

fixes fedora sparc build failure, thanks to kylem for helping with debugging

Signed-off-by: Dennis Gilmore <dgilmore@redhat.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: mark crypto workqueues CPU_INTENSIVE

kcrypto_wq and pcrypt->wq's are used to run ciphers and may consume
considerable amount of CPU cycles. Mark both as CPU_INTENSIVE so that
they don't block other work items.

As the workqueues are primarily used to burn CPU cycles, concurrency
levels shouldn't matter much and are left at 1. A higher value may be
beneficial and needs investigation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: mv_cesa - dont return PTR_ERR() of wrong pointer

Fix a PTR_ERR() return of the wrong pointer

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: ripemd - Set module author and update email address

Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - backlog handling fix

Previous commit "removed redundant locking" introduced
a bug in handling backlog.
In certain cases, when async request complete callback will
call complete() on -EINPROGRESS code, it will cause uncompleted requests.
It does not happen in implementation similar to crypto test manager,
but it will happen in implementation similar to dm-crypt.
Backlog needs to be checked before dequeuing next request.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: gf128mul - Remove experimental tag

This feature no longer needs the experimental tag.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: af_alg - fix af_alg memory_allocated data type

Change data type to fix warning:

crypto/af_alg.c:35: warning: initialization from incompatible pointer type

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: aesni-intel - Fixed build with binutils 2.16

This patch fixes the problem with 2.16 binutils.

Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

net: Add missing lockdep class names for af_alg

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

include: Install linux/if_alg.h for user-space crypto API

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - checkpatch --file warning fixes

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - initialize aes module once per request

AES module was initialized for every DMA transaction.
That is redundant.
Now it is initialized once per request.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - unnecessary code removed

Key and IV should always be set before AES operation.
So no need to check if it has changed or not.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - error handling implementation improved

Previous version had not error handling.
Request could remain uncompleted.

Also in the case of DMA error, FLAGS_INIT is unset
and accelerator will be initialized again.

Buffer size allignment is checked.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - redundant locking is removed

Submitting request involved double locking for enqueuing and
dequeuing. Now it is done under the same lock.

FLAGS_BUSY is now handled under the same lock.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-aes - DMA initialization fixes for OMAP off mode

DMA parameters for constant data were initialized during driver probe().
It seems that those settings sometimes are lost when devices goes to off mode.
This patch makes DMA initialization just before use.
It solves off mode problems.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: Use scatterwalk_crypto_chain

Use scatterwalk_crypto_chain in favor of locally defined chaining functions.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: scatterwalk - Add scatterwalk_crypto_chain helper

A lot of crypto algorithms implement their own chaining function.
So add a generic one that can be used from all the algorithms that
need scatterlist chaining.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: algif_skcipher - Handle unaligned receive buffer

As it is if user-space passes through a receive buffer that's not
aligned to to the cipher block size, we'll end up encrypting or
decrypting a partial block which causes a spurious EINVAL to be
returned.

This patch fixes this by moving the partial block test after the
af_alg_make_sg call.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: algif_skcipher - Fixed overflow when sndbuf is page aligned

When sk_sndbuf is not a multiple of PAGE_SIZE, the limit tests
in sendmsg fail as the limit variable becomes negative and we're
using an unsigned comparison.

The same thing can happen if sk_sndbuf is lowered after a sendmsg
call.

This patch fixes this by always taking the signed maximum of limit
and 0 before we perform the comparison.

It also rounds the value of sk_sndbuf down to a multiple of PAGE_SIZE
so that we don't end up allocating a page only to use a small number
of bytes in it because we're bound by sk_sndbuf.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: af_alg - Add dependency on NET

Add missing dependency on NET since we require sockets for our
interface.

Should really be a select but kconfig doesn't like that:

net/Kconfig:6:error: found recursive dependency: NET -> NETWORK_FILESYSTEMS -> AFS_FS -> AF_RXRPC -> CRYPTO -> CRYPTO_USER_API_HASH -> CRYPTO_USER_API -> NET

Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: aesni-intel - Fixed build error on x86-32

Exclude AES-GCM code for x86-32 due to heavy usage of 64-bit registers
not available on x86-32.

While at it, fixed unregister order in aesni_exit().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: algif_skcipher - Pass on error from af_alg_make_sg

The error returned from af_alg_make_sg is currently lost and we
always pass on -EINVAL. This patch pases on the underlying error.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - zero-copy scatterlist handling

If scatterlist have more than one entry, current driver uses
aligned buffer to copy data to to accelerator to tackle possible
issues with DMA and SHA buffer alignment.

This commit adds more intelligence to verify SG alignment and
possibility to use DMA directly on the data without using copy
buffer.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - FLAGS_FIRST is redundant and removed

bufcnt is 0 if it was no update requests before,
which is exact meaning of FLAGS_FIRST.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - hash-in-progress is stored in hw format

Hash-in-progress is now stored in hw format.
Only on final call, hash is converted to correct format.
Speedup copy procedure and will allow to use OMAP burst mode.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - crypto_ahash_final() now not need to be called.

According to the Herbert Xu, client may not always call
crypto_ahash_final().

In the case of error in hash calculation resources will be
automatically cleaned up.

But if no hash calculation error happens and client will not call
crypto_ahash_final() at all, then internal buffer will not be freed,
and clocks will not be disabled.

This patch provides support for atomic crypto_ahash_update() call.
Clocks are now enabled and disabled per update request.

Data buffer is now allocated as a part of request context.
Client is obligated to free it with crypto_free_ahash().

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - removed redundunt locking

Locking for queuing and dequeuing is combined.
test_and_set_bit() is also replaced with checking under dd->lock.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - error handling improved

Introduces DMA error handling.

DMA error is returned as a result code of the hash request.
Clients needs to handle error codes and may repeat hash calculation attempt.

Also in the case of DMA error, SHAM module is set to be re-initialized again.
It significantly improves stability against possible HW failures.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - DMA initialization fixes for off mode

DMA parameters for constant data were initialized during driver probe().
It seems that those settings sometimes are lost when devices goes to off mode.
This patch makes DMA initialization just before use.
It solves off mode problems.

Fixes: NB#202786 - Aegis & SHA1 block off mode changes
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: omap-sham - uses digest buffer in request context

Currently driver storred digest results in req->results
provided by the client. But some clients do not set it
until final() call. It leads to crash.
Changed to use internal buffer to store temporary digest results.

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: aesni-intel - Ported implementation to x86-32

The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.

To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:

x86:                   i568       aes-ni    delta
ECB, 256 bit:     93.8 MB/s   123.3 MB/s   +31.4%
CBC, 256 bit:     84.8 MB/s   262.3 MB/s  +209.3%
LRW, 256 bit:    108.6 MB/s   222.1 MB/s  +104.5%
XTS, 256 bit:    105.0 MB/s   205.5 MB/s   +95.7%

Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:

x86-64:           old impl.    new impl.    delta
ECB, 256 bit:    121.1 MB/s   123.0 MB/s    +1.5%
CBC, 256 bit:    285.3 MB/s   290.8 MB/s    +1.9%
LRW, 256 bit:    263.7 MB/s   265.3 MB/s    +0.6%
XTS, 256 bit:    251.1 MB/s   255.3 MB/s    +1.7%

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: Makefile clean up

Changed Makefile to use <modules>-y instead of <modules>-objs.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: Use vzalloc

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: algif_skcipher - User-space interface for skcipher operations

This patch adds the af_alg plugin for symmetric key ciphers,
corresponding to the ablkcipher kernel operation type.

Keys can optionally be set through the setsockopt interface.

Once a sendmsg call occurs without MSG_MORE no further writes
may be made to the socket until all previous data has been read.

IVs and and whether encryption/decryption is performed can be
set through the setsockopt interface or as a control message
to sendmsg.

The interface is completely synchronous, all operations are
carried out in recvmsg(2) and will complete prior to the system
call returning.

The splice(2) interface support reading the user-space data directly
without copying (except that the Crypto API itself may copy the data
if alignment is off).

The recvmsg(2) interface supports directly writing to user-space
without additional copying, i.e., the kernel crypto interface will
receive the user-space address as its output SG list.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>

crypto: algif_hash - User-space interface for hash operations

This patch adds the af_alg plugin for hash, corresponding to
the ahash kernel operation type.

Keys can optionally be set through the setsockopt interface.

Each sendmsg call will finalise the hash unless sent with a MSG_MORE
flag.

Partial hash states can be cloned using accept(2).

The interface is completely synchronous, all operations will
complete prior to the system call returning.

Both sendmsg(2) and splice(2) support reading the user-space
data directly without copying (except that the Crypto API itself
may copy the data if alignment is off).

For now only the splice(2) interface supports performing digest
instead of init/update/final. In future the sendmsg(2) interface
will also be modified to use digest/finup where possible so that
hardware that cannot return a partial hash state can still benefit
from this interface.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Martin Willi <martin@strongswan.org>

crypto: af_alg - User-space interface for Crypto API

This patch creates the backbone of the user-space interface for
the Crypto API, through a new socket family AF_ALG.

Each session corresponds to one or more connections obtained from
that socket.  The number depends on the number of inputs/outputs
of that particular type of operation.  For most types there will
be a s ingle connection/file descriptor that is used for both input
and output.  AEAD is one of the few that require two inputs.

Each algorithm type will provide its own implementation that plugs
into af_alg.  They're keyed using a string such as "skcipher" or
"hash".

IOW this patch only contains the boring bits that is required
to hold everything together.

Thakns to Miloslav Trmac for reviewing this and contributing
fixes and improvements.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: Martin Willi <martin@strongswan.org>

net - Add AF_ALG macros

This patch adds the socket family/level macros for the yet-to-be-born
AF_ALG family. The AF_ALG family provides the user-space interface
for the kernel crypto API.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>

crypto: rfc4106 - Extending the RC4106 AES-GCM test vectors

Updated RFC4106 AES-GCM testing. Some test vectors were taken from
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/
gcm/gcm-test-vectors.tar.gz

Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: aesni-intel - RFC4106 AES-GCM Driver Using Intel New Instructions

This patch adds an optimized RFC4106 AES-GCM implementation for 64-bit
kernels. It supports 128-bit AES key size. This leverages the crypto
AEAD interface type to facilitate a combined AES & GCM operation to
be implemented in assembly code. The assembly code leverages Intel(R)
AES New Instructions and the PCLMULQDQ instruction.

Signed-off-by: Adrian Hoban <adrian.hoban@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com>
Signed-off-by: Erdinc Ozturk <erdinc.ozturk@intel.com>
Signed-off-by: James Guilford <james.guilford@intel.com>
Signed-off-by: Wajdi Feghali <wajdi.k.feghali@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: cast5 - simplify if-statements

I noticed that by factoring out common rounds from the
branches of the if-statements in the encryption and
decryption functions, the executable file size goes down
significantly, for crypto/cast5.ko from 26688 bytes
to 24336 bytes (amd64).

On my test system, I saw a slight speedup. This is the
first time I'm doing such a benchmark - I found a similar
one on the crypto mailing list, and I hope I did it right?

Before:
# cryptsetup create dm-test /dev/hda2 -c cast5-cbc-plain -s 128
Passsatz eingeben:
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,43484 s, 21,5 MB/s
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,4089 s, 21,8 MB/s
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,41091 s, 21,7 MB/s

After:
# cryptsetup create dm-test /dev/hda2 -c cast5-cbc-plain -s 128
Passsatz eingeben:
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,38128 s, 22,0 MB/s
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,29486 s, 22,8 MB/s
# dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50
52428800 Bytes (52 MB) kopiert, 2,37162 s, 22,1 MB/s

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: hash - Fix async import on shash algorithm

The function shash_async_import did not initialise the descriptor
correctly prior to calling the underlying shash import function.

This patch adds the required initialisation.

Reported-by: Miloslav Trmac <mitr@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Merge branch 'upstream-merge' of git://git./linux/kernel/git/tytso/ext4

* 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
  ext4,jbd2: convert tracepoints to use major/minor numbers
  ext4: optimize orphan_list handling for ext4_setattr
  ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
  ext4: fix compile error in ext4_fallocate()
  ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
  ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
  ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
  ext4: rename {ext,idx}_pblock and inline small extent functions
  ext4: make various ext4 functions be static
  ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
  ext4: fix kernel oops if the journal superblock has a non-zero j_errno
  ext4: update writeback_index based on last page scanned
  ext4: implement writeback livelock avoidance using page tagging
  ext4: tidy up a void argument in inode.c
  ext4: add batched_discard into ext4 feature list
  ext4: Add batched discard support for ext4
  fs: Add FITRIM ioctl
  ext4: Use return value from sb_issue_discard()
  ext4: Check return value of sb_getblk() and friends
  ext4: use bio layer instead of buffer layer in mpage_da_submit_io
  ...

Merge branch 'next' into upstream-merge

Conflicts:
fs/ext4/inode.c
fs/ext4/mballoc.c
include/trace/events/ext4.h

Merge branch 'drm-core-next' of git://git./linux/kernel/git/airlied/drm-2.6

* 'drm-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
  drm/radeon/kms: enable unmappable vram for evergreen
  drm/radeon/kms: fix tiled db height calculation on 6xx/7xx
  drm/radeon/kms: fix handling of tex lookup disable in cs checker on r2xx

Merge branch 'for_linus' of git://git./linux/kernel/git/jack/linux-fs-2.6

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
  quota: Fix possible oops in __dquot_initialize()
  ext3: Update kernel-doc comments
  jbd/2: fixed typos
  ext2: fixed typo.
  ext3: Fix debug messages in ext3_group_extend()
  jbd: Convert atomic_inc() to get_bh()
  ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
  jbd: Fix debug message in do_get_write_access()
  jbd: Check return value of __getblk()
  ext3: Use DIV_ROUND_UP() on group desc block counting
  ext3: Return proper error code on ext3_fill_super()
  ext3: Remove unnecessary casts on bh->b_data
  ext3: Cleanup ext3_setup_super()
  quota: Fix issuing of warnings from dquot_transfer
  quota: fix dquot_disable vs dquot_transfer race v2
  jbd: Convert bitops to buffer fns
  ext3/jbd: Avoid WARN() messages when failing to write the superblock
  jbd: Use offset_in_page() instead of manual calculation
  jbd: Remove unnecessary goto statement
  jbd: Use printk_ratelimited() in journal_alloc_journal_head()
  ...

ext4,jbd2: convert tracepoints to use major/minor numbers

Unfortunately perf can't deal with anything other than direct structure
accesses in the TP_printk() section. It will drop dead when it sees
jbd2_dev_to_name() in the "print fmt" section of the tracepoint.

Addresses-Google-Bug: 3138508

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: optimize orphan_list handling for ext4_setattr

Surprisingly chown() on ext4 is not SMP scalable operation.
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
   iteration++;
   fchown(fd, uid, gid);
   fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
|        1 |       142 |        185 |
|        4 |       109 |        642 |

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Merge branch 'next' of git://git./linux/kernel/git/djbw/async_tx

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (48 commits)
  DMAENGINE: move COH901318 to arch_initcall
  dma: imx-dma: fix signedness bug
  dma/timberdale: simplify conditional
  ste_dma40: remove channel_type
  ste_dma40: remove enum for endianess
  ste_dma40: remove TIM_FOR_LINK option
  ste_dma40: move mode_opt to separate config
  ste_dma40: move channel mode to a separate field
  ste_dma40: move priority to separate field
  ste_dma40: add variable to indicate valid dma_cfg
  async_tx: make async_tx channel switching opt-in
  move async raid6 test to lib/Kconfig.debug
  dmaengine: Add Freescale i.MX1/21/27 DMA driver
  intel_mid_dma: change the slave interface
  intel_mid_dma: fix the WARN_ONs
  intel_mid_dma: Add sg list support to DMA driver
  intel_mid_dma: Allow DMAC2 to share interrupt
  intel_mid_dma: Allow IRQ sharing
  intel_mid_dma: Add runtime PM support
  DMAENGINE: define a dummy filter function for ste_dma40
  ...

Merge branch 'viafb-next' of git://github.com/schandinat/linux-2.6

* 'viafb-next' of git://github.com/schandinat/linux-2.6: (29 commits)
  viafb: add initial VX900 support
  viafb: fix hardware acceleration for suspend & resume
  viafb: make suspend and resume work (on all machines?)
  viafb: restore display on resume
  Minimal support for viafb suspend/resume
  viafb: use proper register for colour when doing fill ops
  viafb: add documentation for proc interface
  viafb: rename output devices
  viafb: add a mapping of supported output devices
  viafb: set sync polarity for all output devices
  viafb: add function to change sync polarity per device
  viafb: reduce I2C timeout and delay
  viafb: enable I2C for CRT
  viafb: fix i2c_transfer error handling
  viafb: vt1636 cleanup
  viafb: introduce per output device power management
  viafb: limit LCD code impact
  viafb: add interface for output device configuration
  viafb: merge the remaining output path with enable functions
  viafb: use new device routing
  ...

Merge git://git./linux/kernel/git/dhowells/linux-2.6-mn10300

* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits)
  MN10300: Save frame pointer in thread_info struct rather than global var
  MN10300: Change "Matsushita" to "Panasonic".
  MN10300: Create a defconfig for the ASB2364 board
  MN10300: Update the ASB2303 defconfig
  MN10300: ASB2364: Add support for SMSC911X and SMC911X
  MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA
  MN10300: Generic time support
  MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support
  MN10300: Map userspace atomic op regs as a vmalloc page
  MN10300: And Panasonic AM34 subarch and implement SMP
  MN10300: Delete idle_timestamp from irq_cpustat_t
  MN10300: Make various interrupt priority settings configurable
  MN10300: Optimise do_csum()
  MN10300: Implement atomic ops using atomic ops unit
  MN10300: Make the FPU operate in non-lazy mode under SMP
  MN10300: SMP TLB flushing
  MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control
  MN10300: Make the use of PIDR to mark TLB entries controllable
  MN10300: Rename __flush_tlb*() to local_flush_tlb*()
  MN10300: AM34 erratum requires MMUCTR read and write on exception entry
  ...

Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: usb-audio: automatically detect feedback format
  ASoC: sound/wm9090: add missing __devexit marker
  ASoC: sound/max98088: add missing __devexit marker
  ASoC: sound/ad73311: add missing __devexit marker
  ASoC: fsl - fix build error in pcm030-audio-fabric.c
  sound/oss/sb_ess.c: delete double assignment
  ALSA: hda - Change BTL amp level on some HP notebooks

Merge branch 'perf-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (50 commits)
  perf python scripting: Add futex-contention script
  perf python scripting: Fixup cut'n'paste error in sctop script
  perf scripting: Shut up 'perf record' final status
  perf record: Remove newline character from perror() argument
  perf python scripting: Support fedora 11 (audit 1.7.17)
  perf python scripting: Improve the syscalls-by-pid script
  perf python scripting: print the syscall name on sctop
  perf python scripting: Improve the syscalls-counts script
  perf python scripting: Improve the failed-syscalls-by-pid script
  kprobes: Remove redundant text_mutex lock in optimize
  x86/oprofile: Fix uninitialized variable use in debug printk
  tracing: Fix 'faild' -> 'failed' typo
  perf probe: Fix format specified for Dwarf_Off parameter
  perf trace: Fix detection of script extension
  perf trace: Use $PERF_EXEC_PATH in canned report scripts
  perf tools: Document event modifiers
  perf tools: Remove direct slang.h include
  perf_events: Fix for transaction recovery in group_sched_in()
  perf_events: Revert: Fix transaction recovery in group_sched_in()
  perf, x86: Use NUMA aware allocations for PEBS/BTS/DS allocations
  ...

Merge branch 'module' of git://git./linux/kernel/git/rusty/linux-2.6-for-linus

* 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
NULL-terminate all pci_device_id tables
(trivial) Fix compiler warning in kernel/modules.c

Merge branch 'akpm-incoming-2'

* akpm-incoming-2: (139 commits)
  epoll: make epoll_wait() use the hrtimer range feature
  select: rename estimate_accuracy() to select_estimate_accuracy()
  Remove duplicate includes from many files
  ramoops: use the platform data structure instead of module params
  kernel/resource.c: handle reinsertion of an already-inserted resource
  kfifo: fix kfifo_alloc() to return a signed int value
  w1: don't allow arbitrary users to remove w1 devices
  alpha: remove dma64_addr_t usage
  mips: remove dma64_addr_t usage
  sparc: remove dma64_addr_t usage
  fuse: use release_pages()
  taskstats: use real microsecond granularity for CPU times
  taskstats: split fill_pid function
  taskstats: separate taskstats commands
  delayacct: align to 8 byte boundary on 64-bit systems
  delay-accounting: reimplement -c for getdelays.c to report information on a target command
  namespaces Kconfig: move namespace menu location after the cgroup
  namespaces Kconfig: remove the cgroup device whitelist experimental tag
  namespaces Kconfig: remove pointless cgroup dependency
  namespaces Kconfig: make namespace a submenu
  ...

Merge branch 'x86-fixes-for-linus' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  percpu: Remove the multi-page alignment facility
  x86-32: Allocate irq stacks seperate from percpu area
  x86-32, mm: Remove duplicated #include
  x86, printk: Get rid of <0> from stack output
  x86, kexec: Make sure to stop all CPUs before exiting the kernel
  x86/vsmp: Eliminate kconfig dependency warning

proc_bus_pci_ioctl: remove pointless BKL usage

The BKL was pushed into this function when it was converted to use the
unlocked_ioctl interface, but nothing that the function touches is
actually protected by the BKL. So just remove the BKL entirely, so that
we finally can get a realistic system build without the BKL being
enabled at all.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ext4: fix compile error in ext4_fallocate()

When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.

CC [M] fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1

The patch fixes this problem.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static

These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()

Fix a namespace leak from fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static

Fix a namespace leak by moving the function to the file where it is
used and making it static.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: rename {ext,idx}_pblock and inline small extent functions

Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: make various ext4 functions be static

These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()

This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: fix kernel oops if the journal superblock has a non-zero j_errno

Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: update writeback_index based on last page scanned

As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.

As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout. If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.

Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: implement writeback livelock avoidance using page tagging

This is analogous to Jan Kara's commit,
f446daaea9d4a420d16c606f755f3689dcb2d0ce
mm: implement writeback livelock avoidance using page tagging

but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)

If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.

If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.

This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...

With the following patch, we only sync as much as was
dirty at the time of the sync.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: tidy up a void argument in inode.c

This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()

__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.

Fixing this up makes it slightly more obvious to read, and
enables proper typechecking.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: add batched_discard into ext4 feature list

Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: Add batched discard support for ext4

Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

fs: Add FITRIM ioctl

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support. I takes fstrim_range structure as an
argument. fstrim_range is definec in the include/fs.h and its
definition is as follows.

struct fstrim_range {
start;
len;
minlen;
}

start - first Byte to trim
len - number of Bytes to trim from start
minlen - minimum extent length to trim, free extents shorter than this
number of Bytes will be ignored. This will be rounded up to fs
block size.

It is also possible to specify NULL as an argument. In this case the
arguments will set itself as follows:

start = 0;
len = ULLONG_MAX;
minlen = 0;

So it will trim the whole file system at one run.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: Use return value from sb_issue_discard()

Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: Check return value of sb_getblk() and friends

Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: use bio layer instead of buffer layer in mpage_da_submit_io

Call the block I/O layer directly instad of going through the buffer
layer. This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()

This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: inline walk_page_buffers() into mpage_da_submit_io

Expand the call:

if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
ext4_bh_delay_or_unwritten))
goto redirty_page

into mpage_da_submit_io().

This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: inline ext4_writepage() into mpage_da_submit_io()

As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit. This makes it clearer what mpage_da_submit needs to do.

Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.

This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: simplify ext4_writepage()

The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: call mpage_da_submit_io() from mpage_da_map_blocks()

Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().

We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: use KMEM_CACHE instead of kmem_cache_create

Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache. This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: use search_dirblock() in ext4_dx_find_entry()

Use the search_dirblock() in ext4_dx_find_entry(). It makes the code
easier to read, and it takes advantage of common code. It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>

ext4: avoid uninitialized memory references in ext3_htree_next_block()

If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code. The warning was harmless, but it was
useful in pointing out code that was too ugly to live. This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>

ext4: remove unused ext4_sb_info members

Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: queue conversion after adding to inode's completed IO list

By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.

It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.

Thanks to Dave Chinner for pointing out the race.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: don't use ext4_allocation_contexts for tracing

Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: fix oops in trace_ext4_mb_release_group_pa

Our QA reported an oops in the ext4_mb_release_group_pa tracing,
and Josef Bacik pointed out that it was because we may have a
non-null but uninitialized ac_inode in the allocation context.

I can reproduce it when running xfstests with ext4 tracepoints on,
on a CONFIG_SLAB_DEBUG kernel.

We call trace_ext4_mb_release_group_pa from 2 places,
ext4_mb_discard_group_preallocations and
ext4_mb_discard_lg_preallocations

In both cases we allocate an ac as a container just for tracing (!)
and never fill in the ac_inode. There's no reason to be assigning,
testing, or printing it as far as I can see, so just remove it from
the tracepoint.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: fix potential infinite loop in ext4_da_writepages()

On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:

========================================================================
#!/bin/sh

echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================

We can see the following backtrace if we get the kdump when this
hangup occurs:

======================================================================
kthread()
=> bdi_writeback_thread()
   => wb_do_writeback()
      => wb_writeback()
         => writeback_inodes_wb()
            => writeback_sb_inodes()
               => writeback_single_inode()
                  => ext4_da_writepages()  ---+
                                ^ infinite    |
                                |   loop      |
                                +-------------+
======================================================================

The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem
   maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
   - the member, "m_len" of struct mpage_da_data is 1
  mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
  cannot clear "BH_Delay" flag of the buffer_head because the type of
  m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.

  Therefore an infinite loop occurs because ext4_da_writepages()
  cannot write the page (which corresponds to the block) since
  "BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
struct ext4_map_blocks *map)
{
...
int blocks = map->m_len;
...
do {
// cur_logical = 4294967295
// map->m_lblk = 4294967295
// blocks = 1
// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
// (cur_logical >= map->m_lblk + blocks) => true
if (cur_logical >= map->m_lblk + blocks)
break;
----------------------------------------------------------------------

NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: improve llseek error handling for overly large seek offsets

The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.

If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>

ext4: don't update sb journal_devnum when RO dev

An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: use sb_issue_zeroout in ext4_ext_zeroout

Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: use sb_issue_zeroout in setup_new_group_blocks

Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: add interface to advertise ext4 features in sysfs

User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

ext4: add support for lazy inode table initialization

When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Add helper function for blkdev_issue_zeroout (sb_issue_discard)

This is done the same way as helper sb_issue_discard for
blkdev_issue_discard.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

jbd2: Add sanity check for attempts to start handle during umount

An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup. So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>