profile/common/kernel-common.git
11 years agobcm63xx_enet: fix return value check in bcm_enet_shared_probe()
Wei Yongjun [Wed, 19 Jun 2013 02:32:32 +0000 (10:32 +0800)]
bcm63xx_enet: fix return value check in bcm_enet_shared_probe()

In case of error, the function devm_ioremap_resource() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Introduce by commit 0ae99b5fede6f3a8d252d50bb4aba29544295219
(bcm63xx_enet: split DMA channel register accesses)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agostaging/rtl8192u: convert skb->tail into skb_tail_pointer(skb)
Isaku Yamahata [Fri, 14 Jun 2013 08:58:35 +0000 (17:58 +0900)]
staging/rtl8192u: convert skb->tail into skb_tail_pointer(skb)

The change set of 7a884dc "[SK_BUFF]: Convert skb->tail to sk_buff_data_t"
converted skb->tail from pointer into sk_buff_data_t.
Thus skb->tail is not always pointer, the area pointed by skb->tail
should be accessed via skb_tail_pointer().

Found by inspection. Compile tested only.

Cc: Simon Horman <horms@verge.net.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: devel@driverdev.osuosl.org
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Reviewed-by: Simon Horman <horms@verge.net.au>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agopxa168_eth: convert skb->end into skb_end_pointer(skb)
Isaku Yamahata [Fri, 14 Jun 2013 08:58:34 +0000 (17:58 +0900)]
pxa168_eth: convert skb->end into skb_end_pointer(skb)

The change set of 4305b541, "[SK_BUFF]: Convert skb->end to sk_buff_data_t"
converted skb->end from pointer type to sk_buff_data_t.
The pointed value should be accessed via skb_end_pointer().

Since arm arch doesn't define NET_SKBUFF_DATA_USES_OFFSET,
skb->end is effectively pointer. So it doesn't cause a real problem.
But this patch is good for consistency.

Found by inspection. Compile tested only.

Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agomv643xx_eth.c: convert skb->end into skb_end_poitner(skb)
Isaku Yamahata [Fri, 14 Jun 2013 08:58:33 +0000 (17:58 +0900)]
mv643xx_eth.c: convert skb->end into skb_end_poitner(skb)

The change set of 4305b541 "[SK_BUFF]: Convert skb->end to sk_buff_data_t"
converted skb->end from pointer to sk_buff_data_t.
The pointed value should be accessed via skb_end_pointer().

Since arm or ppc arch doesn't define NET_SKBUFF_DATA_USES_OFFSET,
skb->end is effectively pointer. So it doesn't cause a real problem.
But this patch is good for consistency.

Found by inspection. Compile test only.

Cc: Simon Horman <horms@verge.net.au>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet, scsi/csgb4i: convert skb->transport_header into skb_transport_header(skb)
Isaku Yamahata [Fri, 14 Jun 2013 08:58:32 +0000 (17:58 +0900)]
net, scsi/csgb4i: convert skb->transport_header into skb_transport_header(skb)

The change set of 1a37e412, "net: Use 16bits for *_headers fields
of struct skbuff" converted from sk_buff_data_t into 16bit integer.
So skb->tail needs to be converted to skb_tail_pointer(skb).

Found by inspection. Compile tested only.

Cc: Simon Horman <horms@verge.net.au>
Cc: Li RongQing <roy.qing.li@gmail.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet, atm/ambassader: convert skb->tail into skb_tail_pointer(skb)
Isaku Yamahata [Fri, 14 Jun 2013 08:58:31 +0000 (17:58 +0900)]
net, atm/ambassader: convert skb->tail into skb_tail_pointer(skb)

The change set of 27a884dc, "[SK_BUFF]: Convert skb->tail to sk_buff_data_t"
converted skb->tail from pointer into sk_buff_data_t. It missed skb->tail
in drivers/atm/ambassador.c.
This patch converts skb->tail into skb_tail_pointer(skb).

Found by inspection. Compile tested only.

Cc: Simon Horman <horms@verge.net.au>
Cc: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: remove SCTP_STATIC macro
Daniel Borkmann [Mon, 17 Jun 2013 09:40:05 +0000 (11:40 +0200)]
net: sctp: remove SCTP_STATIC macro

SCTP_STATIC is just another define for the static keyword. It's use
is inconsistent in the SCTP code anyway and it was introduced in the
initial implementation of SCTP in 2.5. We have a regression suite in
lksctp-tools, but this is for user space only, so noone makes use of
this macro anymore. The kernel test suite for 2.5 is incompatible with
the current SCTP code anyway.

So simply Remove it, to be more consistent with the rest of the kernel
code.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: get rid of t_new macro for kzalloc
Daniel Borkmann [Mon, 17 Jun 2013 09:40:04 +0000 (11:40 +0200)]
net: sctp: get rid of t_new macro for kzalloc

t_new rather obfuscates things where everyone else is using actual
function names instead of that macro, so replace it with kzalloc,
which is the function t_new wraps.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agofec: Add support to restart autonegotiate
Chris Healy [Mon, 17 Jun 2013 14:25:06 +0000 (07:25 -0700)]
fec: Add support to restart autonegotiate

Add ethtool operation to restart autonegotiation via the PHY.

Tested on i.MX28EVK.

Signed-off-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobonding: don't call alb_set_slave_mac_addr() while atomic
Veaceslav Falico [Mon, 17 Jun 2013 17:30:35 +0000 (19:30 +0200)]
bonding: don't call alb_set_slave_mac_addr() while atomic

alb_set_slave_mac_addr() sets the mac address in alb mode via
dev_set_mac_address(), which might sleep. It's called from
alb_handle_addr_collision_on_attach() in atomic context (under
read_lock(bond->lock)), thus triggering a bug.

Fix this by moving the lock inside alb_handle_addr_collision_on_attach().

v1->v2:
As Nikolay Aleksandrov noticed, we can drop the bond->lock completely.
Also, use bond_slave_has_mac(), when possible.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotg3: Prevent system hang during repeated EEH errors.
Michael Chan [Mon, 17 Jun 2013 20:47:25 +0000 (13:47 -0700)]
tg3: Prevent system hang during repeated EEH errors.

The current tg3 code assumes the pci_error_handlers to be always called
in sequence.  In particular, during ->error_detected(), NAPI is disabled
and the device is shutdown.  The device is later reset and NAPI
re-enabled in ->slot_reset() and ->resume().

In EEH, if more than 6 errors are detected in a hour, only
->error_detected() will be called.  This will leave the driver in an
inconsistent state as NAPI is disabled but netif_running state is still
true.  When the device is later closed, we'll try to disable NAPI again
and it will loop forever.

We fix this by closing the device if we encounter any error conditions
during the normal sequence of the pci_error_handlers.

v2: Remove the changes in tg3_io_resume() based on Benjamin Poirier's
    feedback.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'tipc'
David S. Miller [Mon, 17 Jun 2013 22:53:09 +0000 (15:53 -0700)]
Merge branch 'tipc'

Paul Gortmaker says:

====================
This is a rework of the content sent earlier[1], with the following changes:

-drop the Kconfig --> modparam conversion patch; this was
 requested to be replaced[2] with a dynamic port quantity resizing.
 Ying and Erik were discussing how best to achieve this, and then
 vacation schedules got in the way, so implementing that will
 come (hopefully) in the next round.

-rework the sk_rcvbuf patch to allow memory resizing via sysctl
 as per what Ying and Neil discussed[3]

-add 4 more seemingly straigtforward and relatively small changes
 from Ying (the last 4 in the series).

-add cosmetic UAPI comment update patch from Ying.

That said, the largest change is still the one where we make use of
the fact that linux supports kernel threads and do the server like
operations within kernel threads.  As Jon says:

   We remove the last remnants of the TIPC native API, to make it
   possible to simplify locking policy and solve a problem with lost
   topology events.

   First, we introduce a socket-based alternative to the native API.

   Second, we convert the two remaining users of the native API, the
   TIPC internal topology server and the configuarion server, to use the
   new API.

   Third, we remove the remaining code pertaining to the native API.

I have re-tested this collection of commits between 32 and 64 bit x86
machines using the standard tipc test suite, and build tested for ppc.

[1] http://patchwork.ozlabs.org/patch/247687/
[2] http://patchwork.ozlabs.org/patch/247680/
[3] http://patchwork.ozlabs.org/patch/247688/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: remove dev_base_lock use from enable_bearer
Ying Xue [Mon, 17 Jun 2013 14:54:51 +0000 (10:54 -0400)]
tipc: remove dev_base_lock use from enable_bearer

Convert enable_bearer() to RCU locking with dev_get_by_name().

Based on a similar changeset in commit 840a185d ["aoe: remove
dev_base_lock use from aoecmd_cfg_pkts()"] -- quoting that:

  "dev_base_lock is the legacy way to lock the device list,
   and is planned to disappear. (writers hold RTNL, readers
   hold RCU lock)"

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: fix wrong return value for link_send_sections_long routine
Ying Xue [Mon, 17 Jun 2013 14:54:50 +0000 (10:54 -0400)]
tipc: fix wrong return value for link_send_sections_long routine

When skb buffer cannot be allocated in link_send_sections_long(),
-ENOMEM error code instead of -EFAULT should be returned to its
caller.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: make tipc_link_send_sections_fast exit earlier
Ying Xue [Mon, 17 Jun 2013 14:54:49 +0000 (10:54 -0400)]
tipc: make tipc_link_send_sections_fast exit earlier

Once message build request function returns invalid code, the
process of sending message cannot continue. So in case of message
build failure, tipc_link_send_sections_fast() should return
immediately.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: enhance priority of link protocol packet
Ying Xue [Mon, 17 Jun 2013 14:54:48 +0000 (10:54 -0400)]
tipc: enhance priority of link protocol packet

pfifo_fast is set as default traffic class queueing discipline. This
queue has three so called "bands". Within each band, FIFO rules apply.
However, as long as there are packets waiting in band 0, band 1 won't
be processed.

Now all kind of TIPC type packet priorities are never set, that is,
their priorities are 0, so they are mapped to band 1 of pfifo_fast
qdisc. But, especially during link congestion, if link protocol packet
can be sent out as earlier as possible than other type of packets so
that protocol packet can arrive at peer endpoint in time, the peer
will timely reset its link timeout timer to keep the link alive.
So enhancing the priority of link protocol packets can meet the
specific demand to avoid unnecessary link reset due to a transient
link congestion.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: cosmetic realignment of function arguments
Paul Gortmaker [Mon, 17 Jun 2013 14:54:47 +0000 (10:54 -0400)]
tipc: cosmetic realignment of function arguments

No runtime code changes here.  Just a realign of the function
arguments to start where the 1st one was, and fit as many args
as can be put in an 80 char line.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: save sock structure pointer instead of void pointer to tipc_port
Ying Xue [Mon, 17 Jun 2013 14:54:46 +0000 (10:54 -0400)]
tipc: save sock structure pointer instead of void pointer to tipc_port

Directly save sock structure pointer instead of void pointer to avoid
unnecessary cast conversions.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: convert config_lock from spinlock to mutex
Ying Xue [Mon, 17 Jun 2013 14:54:45 +0000 (10:54 -0400)]
tipc: convert config_lock from spinlock to mutex

As the configuration server is now running under process context,
it's unnecessary for us to have a spinlock serializing the TIPC
configuration process. Instead, we replace it with a mutex lock,
which gives us more freedom. For instance, we can now call
pre-emptable functions within the protected area.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: rename tipc_createport_raw to tipc_createport
Ying Xue [Mon, 17 Jun 2013 14:54:44 +0000 (10:54 -0400)]
tipc: rename tipc_createport_raw to tipc_createport

After the removal of the native API, there is now only one way to
to create a TIPC port instance -- the function tipc_createport_raw().
We make it more readable by renaming it to tipc_createport().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: remove user_port instance from tipc_port structure
Ying Xue [Mon, 17 Jun 2013 14:54:43 +0000 (10:54 -0400)]
tipc: remove user_port instance from tipc_port structure

After the native API has been completely removed, the 'user_port'
field in struct tipc_port becomes unused, and can be removed.
As a consequence, the "usrmem" argument in tipc_msg_build() is no
longer needed, and so we remove that one too.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: delete code orphaned by new server infrastructure
Ying Xue [Mon, 17 Jun 2013 14:54:42 +0000 (10:54 -0400)]
tipc: delete code orphaned by new server infrastructure

Having completed the conversion of the topology server and
configuration server to use the new server infrastructure,
the following functions become unused, and can be deleted:

   - tipc_createport()
   - port_wakeup_sh()
   - port_dispatcher()
   - port_dispatcher_sigh()
   - tipc_send_buf_fast()
   - tipc_send_buf2port

Additionally, the following variables become orphaned,
and can be deleted:

   - tipc_msg_err_event
   - tipc_named_msg_err_event
   - tipc_conn_shutdown_event
   - tipc_msg_event
   - tipc_named_msg_event
   - tipc_conn_msg_event
   - tipc_continue_event
   - msg_queue_head
   - msg_queue_tail
   - queue_lock

Deletion is done here in a separate commit in order to allow
the actual conversion changes to be more easily viewed.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: convert configuration server to use new server facility
Ying Xue [Mon, 17 Jun 2013 14:54:41 +0000 (10:54 -0400)]
tipc: convert configuration server to use new server facility

As the new socket-based TIPC server infrastructure has been
introduced, we can now convert the configuration server to use
it.  Then we can take future steps to simplify the configuration
server locking policy.

Some minor reordering of initialization is done, due to the
dependency on having tipc_socket_init completed.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: convert topology server to use new server facility
Ying Xue [Mon, 17 Jun 2013 14:54:40 +0000 (10:54 -0400)]
tipc: convert topology server to use new server facility

As the new TIPC server infrastructure has been introduced, we can
now convert the TIPC topology server to it.  We get two benefits
from doing this:

1) It simplifies the topology server locking policy.  In the
original locking policy, we placed one spin lock pointer in the
tipc_subscriber structure to reuse the lock of the subscriber's
server port, controlling access to members of tipc_subscriber
instance.  That is, we only used one lock to ensure both
tipc_port and tipc_subscriber members were safely accessed.

Now we introduce another spin lock for tipc_subscriber structure
only protecting themselves, to get a finer granularity locking
policy.  Moreover, the change will allow us to make the topology
server code more readable and maintainable.

2) It fixes a bug where sent subscription events may be lost when
the topology port is congested.  Using the new service, the
topology server now queues sent events into an outgoing buffer,
and then wakes up a sender process which has been blocked in
workqueue context.  The process will keep picking events from the
buffer and send them to their respective subscribers, using the
kernel socket interface, until the buffer is empty. Even if the
socket is congested during transmission there is no risk that
events may be dropped, since the sender process may block when
needed.

Some minor reordering of initialization is done, since we now
have a scenario where the topology server must be started after
socket initialization has taken place, as the former depends
on the latter.  And overall, we see a simplification of the
TIPC subscriber code in making this changeover.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: introduce new TIPC server infrastructure
Ying Xue [Mon, 17 Jun 2013 14:54:39 +0000 (10:54 -0400)]
tipc: introduce new TIPC server infrastructure

TIPC has two internal servers, one providing a subscription
service for topology events, and another providing the
configuration interface. These servers have previously been running
in BH context, accessing the TIPC-port (aka native) API directly.
Apart from these servers, even the TIPC socket implementation is
partially built on this API.

As this API may simultaneously be called via different paths and in
different contexts, a complex and costly lock policiy is required
in order to protect TIPC internal resources.

To eliminate the need for this complex lock policiy, we introduce
a new, generic service API that uses kernel sockets for message
passing instead of the native API. Once the toplogy and configuration
servers are converted to use this new service, all code pertaining
to the native API can be removed. This entails a significant
reduction in code amount and complexity, and opens up for a complete
rework of the locking policy in TIPC.

The new service also solves another problem:

As the current topology server works in BH context, it cannot easily
be blocked when sending of events fails due to congestion. In such
cases events may have to be silently dropped, something that is
unacceptable. Therefore, the new service keeps a dedicated outbound
queue receiving messages from BH context. Once messages are
inserted into this queue, we will immediately schedule a work from a
special workqueue. This way, messages/events from the topology server
are in reality sent in process context, and the server can block
if necessary.

Analogously, there is a new workqueue for receiving messages. Once a
notification about an arriving message is received in BH context, we
schedule a work from the receive workqueue to do the job of
receiving the message in process context.

As both sending and receive messages are now finished in processes,
subscribed events cannot be dropped any more.

As of this commit, this new server infrastructure is built, but
not actually yet called by the existing TIPC code, but since the
conversion changes required in order to use it are significant,
the addition is kept here as a separate commit.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: allow implicit connect for stream sockets
Erik Hugne [Mon, 17 Jun 2013 14:54:38 +0000 (10:54 -0400)]
tipc: allow implicit connect for stream sockets

TIPC's implied connect feature, aka piggyback connect, allows
applications to save one syscall and all SYN/SYN-ACK signalling
overhead when setting up a connection.  Until now, this has only
been supported for SEQPACKET sockets.  Here, we make it possible
to use this feature even with stream sockets.

At the connecting side, the connection is completed when the
first data message arrives from the accepting peer.  This means
that we must allow the connecting user to call blocking recv()
before the socket has reached state SS_CONNECTED.  So we must must
relax the state machine check at recv_stream(), and allow the
recv() call even if socket is in state SS_CONNECTING.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: change socket buffer overflow control to respect sk_rcvbuf
Ying Xue [Mon, 17 Jun 2013 14:54:37 +0000 (10:54 -0400)]
tipc: change socket buffer overflow control to respect sk_rcvbuf

As per feedback from the netdev community, we change the buffer
overflow protection algorithm in receiving sockets so that it
always respects the nominal upper limit set in sk_rcvbuf.

Instead of scaling up from a small sk_rcvbuf value, which leads to
violation of the configured sk_rcvbuf limit, we now calculate the
weighted per-message limit by scaling down from a much bigger value,
still in the same field, according to the importance priority of the
received message.

To allow for administrative tunability of the socket receive buffer
size, we create a tipc_rmem sysctl variable to allow the user to
configure an even bigger value via sysctl command.  It is a size of
three (min/default/max) to be consistent with things like tcp_rmem.

By default, the value initialized in tipc_rmem[1] is equal to the
receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
This value is also set as the default value of sk_rcvbuf.

Originally-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
[Ying: added sysctl variation to Jon's original patch]
Signed-off-by: Ying Xue <ying.xue@windriver.com>
[PG: don't compile sysctl.c if not config'd; add Documentation]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotipc: update code comments to reflect new uapi header path
Ying Xue [Mon, 17 Jun 2013 14:54:36 +0000 (10:54 -0400)]
tipc: update code comments to reflect new uapi header path

Files tipc.h and tipc_config.h were moved to uapi directory, but
the corresponding comments were not updated at the same time.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: add socket option for low latency polling
Eliezer Tamir [Fri, 14 Jun 2013 13:33:57 +0000 (16:33 +0300)]
net: add socket option for low latency polling

adds a socket option for low latency polling.
This allows overriding the global sysctl value with a per-socket one.
Unexport sysctl_net_ll_poll since for now it's not needed in modules.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: remove NET_LL_RX_POLL config menue
Eliezer Tamir [Fri, 14 Jun 2013 13:33:46 +0000 (16:33 +0300)]
net: remove NET_LL_RX_POLL config menue

Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: convert low latency sockets to sched_clock()
Eliezer Tamir [Fri, 14 Jun 2013 13:33:35 +0000 (16:33 +0300)]
net: convert low latency sockets to sched_clock()

Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: change sysctl_net_ll_poll into an unsigned int
Eliezer Tamir [Fri, 14 Jun 2013 13:33:25 +0000 (16:33 +0300)]
net: change sysctl_net_ll_poll into an unsigned int

There is no reason for sysctl_net_ll_poll to be an unsigned long.
Change it into an unsigned int.
Fix the proc handler.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: sctp_association_init: put refs in reverse order
Daniel Borkmann [Fri, 14 Jun 2013 16:24:07 +0000 (18:24 +0200)]
net: sctp: sctp_association_init: put refs in reverse order

In case we need to bail out for whatever reason during assoc
init, we call sctp_endpoint_put() and then sock_put(), however,
we've hold both refs in reverse, non-symmetric order, so first
sctp_endpoint_hold() and then sock_hold().

Reverse this, so that in an error case we have sock_put() and then
sctp_endpoint_put(). Actually shouldn't matter too much, since both
cleanup paths do the right thing, but that way, it is more consistent
with the rest of the code.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: minor: remove variable in sctp_init_sock
Daniel Borkmann [Fri, 14 Jun 2013 16:24:06 +0000 (18:24 +0200)]
net: sctp: minor: remove variable in sctp_init_sock

It's only used at this one time, so we could remove it as well.
This is valid and also makes it more explicit/obvious that in case
of error the sp->ep is NULL here, i.e. for the sctp_destroy_sock()
check that was recently added.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: sctp_sf_do_prm_asoc: do SCTP_CMD_INIT_CHOOSE_TRANSPORT first
Daniel Borkmann [Fri, 14 Jun 2013 16:24:05 +0000 (18:24 +0200)]
net: sctp: sctp_sf_do_prm_asoc: do SCTP_CMD_INIT_CHOOSE_TRANSPORT first

While this currently cannot trigger any NULL pointer dereference in
sctp_seq_dump_local_addrs(), better change the order of commands to
prevent a future bug to happen. Although we first add SCTP_CMD_NEW_ASOC
and then set the SCTP_CMD_INIT_CHOOSE_TRANSPORT, it is okay for now,
since this primitive is only called by sctp_connect() or sctp_sendmsg()
with sctp_assoc_add_peer() set first. However, lets do this precaution
and first set the transport and then add it to the association hashlist
to prevent in future something to possibly triggering this.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: sctp: sideeffect: throw BUG if primary_path is NULL
Daniel Borkmann [Fri, 14 Jun 2013 16:24:04 +0000 (18:24 +0200)]
net: sctp: sideeffect: throw BUG if primary_path is NULL

This clearly states a BUG somewhere in the SCTP code as e.g. fixed once
in f28156335 ("sctp: Use correct sideffect command in duplicate cookie
handling"). If this ever happens, throw a trace in the sideeffect engine
where assocs clearly must have a primary_path assigned.

When in sctp_seq_dump_local_addrs() also throw a WARN and bail out since
we do not need to panic for printing this one asterisk. Also, it will
avoid the not so obvious case when primary != NULL test passes and at a
later point in time triggering a NULL ptr dereference caused by primary.
While at it, also fix up the white space.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
David S. Miller [Fri, 14 Jun 2013 22:31:22 +0000 (15:31 -0700)]
Merge branch 'master' of git://git./linux/kernel/git/jesse/openvswitch

Jesse Gross says:

====================
A few miscellaneous improvements and cleanups before the GRE tunnel
integration series. Intended for net-next/3.11.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoopenvswitch: Simplify interface ovs_flow_metadata_from_nlattrs()
Pravin B Shelar [Thu, 13 Jun 2013 18:11:32 +0000 (11:11 -0700)]
openvswitch: Simplify interface ovs_flow_metadata_from_nlattrs()

This is not functional change, this is just code cleanup.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: make skb->csum consistent with rest of networking stack.
Pravin B Shelar [Thu, 13 Jun 2013 18:11:44 +0000 (11:11 -0700)]
openvswitch: make skb->csum consistent with rest of networking stack.

Following patch keeps skb->csum correct across ovs.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: Fix struct comment.
Pravin B Shelar [Wed, 12 Jun 2013 22:57:10 +0000 (15:57 -0700)]
openvswitch: Fix struct comment.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: Fix misspellings in comments and docs.
Andy Hill [Fri, 7 Jun 2013 23:53:50 +0000 (16:53 -0700)]
openvswitch: Fix misspellings in comments and docs.

Flagged with: https://github.com/lyda/misspell-check
Run with: git ls-files | misspellings -f -

Signed-off-by: Andy Hill <hillad@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: fix variable names in comment
Lorand Jakab [Mon, 3 Jun 2013 17:01:14 +0000 (10:01 -0700)]
openvswitch: fix variable names in comment

Signed-off-by: Lorand Jakab <lojakab@cisco.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: Unify vport error stats handling.
Pravin B Shelar [Mon, 13 May 2013 15:22:34 +0000 (08:22 -0700)]
openvswitch: Unify vport error stats handling.

Following patch changes vport->send return type so that vport
layer can do error accounting.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: Remove unused get_config vport op.
Jesse Gross [Mon, 13 May 2013 15:16:29 +0000 (08:16 -0700)]
openvswitch: Remove unused get_config vport op.

The get_config vport op is left over from old compatibility code,
it is neither used nor implemented any more.

Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agoopenvswitch: Immediately exit on error in ovs_vport_cmd_set().
Jesse Gross [Mon, 13 May 2013 15:15:26 +0000 (08:15 -0700)]
openvswitch: Immediately exit on error in ovs_vport_cmd_set().

It is an error to try to change the type of a vport using the set
command. However, while we check that this is an error, we still
proceed to allocate memory which then gets freed immediately.
This stops processing after noticing the error, which does not
actually fix a bug but is more correct.

Signed-off-by: Jesse Gross <jesse@nicira.com>
11 years agonet/mlx4: Add VF link state support
Rony Efraim [Thu, 13 Jun 2013 10:19:11 +0000 (13:19 +0300)]
net/mlx4: Add VF link state support

Add support to change the link state of VF (vPort)

Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet/core: Add VF link state control
Rony Efraim [Thu, 13 Jun 2013 10:19:10 +0000 (13:19 +0300)]
net/core: Add VF link state control

Add netlink directives and ndo entry to allow for controling
VF link, which can be in one of three states:

Auto - VF link state reflects the PF link state (default)

Up - VF link state is up, traffic from VF to VF works even if
the actual PF link is down

Down - VF link state is down, no traffic from/to this VF, can be of
use while configuring the VF

Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobcm63xx_enet: add support Broadcom BCM6345 Ethernet
Florian Fainelli [Wed, 12 Jun 2013 19:53:05 +0000 (20:53 +0100)]
bcm63xx_enet: add support Broadcom BCM6345 Ethernet

This patch adds support for the Broadcom BCM6345 SoC Ethernet. BCM6345
has a slightly different and older DMA engine which requires the
following modifications:

- the width of the DMA channels on BCM6345 is 64 bytes vs 16 bytes,
  which means that the helpers enet_dma{c,s} need to account for this
  channel width and we can no longer use macros

- BCM6345 DMA engine does not have any internal SRAM for transfering
  buffers

- BCM6345 buffer allocation and flow control is not per-channel but
  global (done in RSET_ENETDMA)

- the DMA engine bits are right-shifted by 3 compared to other DMA
  generations

- the DMA enable/interrupt masks are a little different (we need to
  enabled more bits for 6345)

- some register have the same meaning but are offsetted in the ENET_DMAC
  space so a lookup table is required to return the proper offset

The MAC itself is identical and requires no modifications to work.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agohtb: reorder struct htb_class fields for performance
Eric Dumazet [Thu, 13 Jun 2013 14:58:30 +0000 (07:58 -0700)]
htb: reorder struct htb_class fields for performance

htb_class structures are big, and source of false sharing on SMP.

By carefully splitting them in two parts, we can improve performance.

I got 9 % performance increase on a 24 threads machine, with 200
concurrent netperf in TCP_RR mode, using a HTB hierarchy of 4 classes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet-rps: fixes for rps flow limit
Willem de Bruijn [Thu, 13 Jun 2013 19:29:38 +0000 (15:29 -0400)]
net-rps: fixes for rps flow limit

Caught by sparse:
- __rcu: missing annotation to sd->flow_limit
- __user: direct access in cpumask_scnprintf

Also
- add endline character when printing bitmap if room in buffer
- avoid bucket overflow by reducing FLOW_LIMIT_HISTORY

The last item warrants some explanation. The hashtable buckets are
subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
to bucket size, since all packets may end up in a single bucket. The
current (rather arbitrary) history value of 256 happens to match the
buffer size (u8).

As a result, with a single flow, the first 128 packets are accepted
(correct), the second 128 packets dropped (correct) and then the
history[] array has filled, so that each subsequent new packet
causes an increment in the bucket for new_flow plus a decrement
for old_flow: a steady state.

This is fine if packets are dropped, as the steady state goes away
as soon as a mix of traffic reappears. But, because the 256th packet
overflowed the bucket to 0: no packets are dropped.

Instead of explicitly adding an overflow check, this patch changes
FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
(first item)

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotcp: properly send new data in fast recovery in first RTT
Yuchung Cheng [Tue, 11 Jun 2013 22:35:32 +0000 (15:35 -0700)]
tcp: properly send new data in fast recovery in first RTT

Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2).  One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.

But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agosh_eth: remove '__maybe_unused' annotations
Sergei Shtylyov [Tue, 11 Jun 2013 23:07:29 +0000 (03:07 +0400)]
sh_eth: remove '__maybe_unused' annotations

Now that  the SoC specific support is no longer done with help of #ifdef'fery,
we  no longer need '__maybe_unused' annotations to sh_eth_select_mii() and
sh_eth_set_duplex()...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: Convert uses of typedef ctl_table to struct ctl_table
Joe Perches [Wed, 12 Jun 2013 06:04:25 +0000 (23:04 -0700)]
net: Convert uses of typedef ctl_table to struct ctl_table

Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
  xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
        s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: make all team port device link events urgent
Flavio Leitner [Tue, 11 Jun 2013 21:09:29 +0000 (23:09 +0200)]
net: make all team port device link events urgent

Since team functionality relies heavily on userspace daemon, we need to
deliver event to userspace via Netlink as quick as possible. So make all
team port device link events urgent.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: ping_check_bind_addr() etc. can be static
Wu Fengguang [Wed, 12 Jun 2013 13:04:16 +0000 (21:04 +0800)]
net: ping_check_bind_addr() etc. can be static

net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?

net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agocxgb4: Do not set net_device::dev_id to VI index
Ben Hutchings [Mon, 10 Jun 2013 16:34:13 +0000 (17:34 +0100)]
cxgb4: Do not set net_device::dev_id to VI index

net_device::dev_id should not be used merely to indicate a VI index,
as it affects the way the local part of IPv6 addresses is normally
generated.

This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
T4 VIs each have their own MAC address.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agomacvtap: fix uninitialized return value macvtap_ioctl_set_queue()
Jason Wang [Thu, 13 Jun 2013 06:23:36 +0000 (14:23 +0800)]
macvtap: fix uninitialized return value macvtap_ioctl_set_queue()

Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agomacvtap: slient sparse warnings
Jason Wang [Thu, 13 Jun 2013 06:23:35 +0000 (14:23 +0800)]
macvtap: slient sparse warnings

This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:98:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:120:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
  CC [M]  drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agosctp: Correct byte order of access to skb->{network, transport}_header
Simon Horman [Thu, 13 Jun 2013 07:04:33 +0000 (16:04 +0900)]
sctp: Correct byte order of access to skb->{network, transport}_header

Corrects an byte order conflict introduced by
158874cac61245b84e939c92c53db7000122b7b0
("sctp: Correct access to skb->{network, transport}_header").
The values in question are host byte order.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonetlink: make compare exist all the time
Gao feng [Thu, 13 Jun 2013 02:05:38 +0000 (10:05 +0800)]
netlink: make compare exist all the time

Commit da12c90e099789a63073fc82a19542ce54d4efb9
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.

So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: add doc for ip_early_demux sysctl
Cong Wang [Tue, 11 Jun 2013 10:54:39 +0000 (18:54 +0800)]
net: add doc for ip_early_demux sysctl

commit 6648bd7e0e62c0c8c03b (ipv4: Add sysctl knob to control
early socket demux) introduced such sysctl, but forgot to add
doc into Documentation/networking/ip-sysctl.txt. This patch adds it.

Basically I grab the doc from the description of commit 41063e9dd11956f2d285
(ipv4: Early TCP socket demux.) and the above commit.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotun: Turn tun_flow_init() into void fn
Pavel Emelyanov [Tue, 11 Jun 2013 13:01:08 +0000 (17:01 +0400)]
tun: Turn tun_flow_init() into void fn

This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache)
so it makes sense to compact the code a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agotun: Report "persist" flag to userspace
Pavel Emelyanov [Tue, 11 Jun 2013 10:41:24 +0000 (14:41 +0400)]
tun: Report "persist" flag to userspace

The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs
"flags" attribute skip one. Knowing whether a device is persistent or not
is critical for checkpoint-restore, thus I propose to add the read-only
IFF_PERSIST one for this.

Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check
for unknown flags being zero and thus there can be trash.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoudp: fix two sparse errors
Eric Dumazet [Wed, 12 Jun 2013 21:31:39 +0000 (14:31 -0700)]
udp: fix two sparse errors

commit ba418fa357a7b3c ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :

net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agogro: remove a sparse error
Eric Dumazet [Wed, 12 Jun 2013 21:23:15 +0000 (14:23 -0700)]
gro: remove a sparse error

Fix following sparse error :

net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer

added in commit db8caf3dbc77599
("gro: should aggregate frames without DF")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville...
David S. Miller [Wed, 12 Jun 2013 21:23:41 +0000 (14:23 -0700)]
Merge branch 'for-davem' of git://git./linux/kernel/git/linville/wireless-next

John W. Linville says:

====================
This pull request is intended for the 3.11 stream...

One big highlight is the cw1200 driver the ST-E CW1100 & CW1200
WLAN chipsets.  This one has been lingering for a while, lacking
some review comments.  Once started getting pulled into linux-next,
it got a bit more attention and a number of improvements were made
over the initial cut.  No doubt there will be more changes ahead,
but I think it is looking alright at this point.

Along with that, there is the usual flurry of updates to the mac80211
core and the iwlwifi, mwifiex, ath9k, rt2x00, wil6210, and other
drivers.  A few of the highlights are some rt2x00 refactoring/cleanup
by Gabor Juhos, some rt2800 hardware support enhancements by Stanislaw
Gruszka, some iwlwifi power management updates from Alexander Bondar,
some enhanced bcma SPROM support from Rafał Miłecki, and a variety
of other things here and there.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agosh_eth: split 'sh_eth_netdev_ops'
Sergei Shtylyov [Wed, 12 Jun 2013 20:55:34 +0000 (00:55 +0400)]
sh_eth: split 'sh_eth_netdev_ops'

Commit 9f86134155047720a3685cda21467f68695152d2 (sh_eth: remove SH_ETH_HAS_TSU)
removes 'const' from 'sh_eth_netdev_ops'  and modifies it in case TSU registers
are present. I've originally suggested to Iwamatsu-san to split  this structure
in two instead and afterwards Dave M. suggested doing the same.
Split 'sh_eth_netdev_ops_tsu' from 'sh_eth_netdev_ops', making both 'const', and
assigning 'ndev->detdev_ops'  depending on the presence of TSU registers.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoigmp: fix new sparse errors
Eric Dumazet [Wed, 12 Jun 2013 21:11:16 +0000 (14:11 -0700)]
igmp: fix new sparse errors

Fix following sparse errors :

net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
net/ipv4/igmp.c:1234:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1234:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1234:31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1250:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1250:31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32

These were added by commit e9897071350bd9
("igmp: hash a hash table to speedup ip_check_mc_rcu()")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agogianfar: Add backwards compatible Single Queue mode polling
Claudiu Manoil [Mon, 10 Jun 2013 17:19:48 +0000 (20:19 +0300)]
gianfar: Add backwards compatible Single Queue mode polling

Older Single Queue (SQ_SG_MODE) devices like TSEC (i.e. mpc83xx)
don't feature the frame receive indication bits (RXF) in RSTAT.
For these and for the rest of the SQ_SG_MODE devices, provide the
appropiate polling routine that handles a single pair of Rx/Tx
BD rings, removing the overhead incurred by the multiple queues/
multiple interrupt group devices (veTSEC/ eTSEC2.0 devices).
So this is primarily a fix for the TSEC devices.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agosfc: Store port number in private data, not net_device::dev_id
Ben Hutchings [Mon, 10 Jun 2013 17:03:17 +0000 (18:03 +0100)]
sfc: Store port number in private data, not net_device::dev_id

We should not use net_device::dev_id to indicate the port number, as
this affects the way the local part of IPv6 addresses is normally
generated.

This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
Siena's two ports each have their own MAC addresses.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoipv4: remove is_data also from ip_options documentation.
Rami Rosen [Mon, 10 Jun 2013 15:58:16 +0000 (18:58 +0300)]
ipv4: remove is_data also from ip_options documentation.

commit ef722495c8867aacc1db0675a6737e5cf1e72e07
( [IPV4]: Remove unused ip_options->is_data) removed the unused is_data
member from ip_options struct.

This patch removes is_data also from the documentation of the ip_options struct.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoteam: remove synchronize_rcu() called during port disable
Jiri Pirko [Mon, 10 Jun 2013 15:42:25 +0000 (17:42 +0200)]
team: remove synchronize_rcu() called during port disable

Check the unlikely case of team->en_port_count == 0 before modulo
operation.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoteam: use kfree_rcu instead of synchronize_rcu in team_port_dev
Jiri Pirko [Mon, 10 Jun 2013 15:42:24 +0000 (17:42 +0200)]
team: use kfree_rcu instead of synchronize_rcu in team_port_dev

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoteam: remove synchronize_rcu() called during queue override change
Jiri Pirko [Mon, 10 Jun 2013 15:42:23 +0000 (17:42 +0200)]
team: remove synchronize_rcu() called during queue override change

This patch removes synchronize_rcu() from function
__team_queue_override_port_del(). That can be done because it is ok to
do list_del_rcu() and list_add_tail_rcu() on the same list_head member
without calling synchronize_rcu() in between. A bit of refactoring
needed to be done because INIT_LIST_HEAD needed to be removed (to not
kill the forward pointer) as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agodoc:networking: Update comment for dev_id field in netdevice.h
Narendra K [Mon, 10 Jun 2013 14:04:03 +0000 (19:34 +0530)]
doc:networking: Update comment for dev_id field in netdevice.h

This patch updates the comment for 'dev_id' field in
'include/linux/netdevice.h' to reflect the intended
usage of 'dev_id'.

References: http://marc.info/?l=linux-netdev&m=136992115300526&w=2
References: http://marc.info/?l=linux-netdev&m=137062569014612&w=2

Signed-off-by: Narendra K <narendra_k@dell.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: can: Convert to use devm_ioremap_resource
Tushar Behera [Mon, 10 Jun 2013 11:35:07 +0000 (17:05 +0530)]
net: can: Convert to use devm_ioremap_resource

Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: linux-can@vger.kernel.org
CC: Marc Kleine-Budde <mkl@pengutronix.de>
CC: Wolfgang Grandegger <wg@grandegger.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: emaclite: Convert to use devm_ioremap_resource
Tushar Behera [Mon, 10 Jun 2013 11:35:06 +0000 (17:05 +0530)]
net: emaclite: Convert to use devm_ioremap_resource

Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
CC: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: fec: Convert to use devm_ioremap_resource
Tushar Behera [Mon, 10 Jun 2013 11:35:05 +0000 (17:05 +0530)]
net: fec: Convert to use devm_ioremap_resource

Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years ago3c59x: consolidate error cleanup in vortex_init_one()
Sergei Shtylyov [Sun, 9 Jun 2013 20:16:52 +0000 (00:16 +0400)]
3c59x: consolidate error cleanup in vortex_init_one()

The PCI driver's probe() method  duplicates the error cleanup code each time it
has to do error exit. Consolidate the error cleanup code  in  one place and use
*goto* to jump to the right places.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoveth: remove redundant call of dev_alloc_name
Hong zhi guo [Sun, 9 Jun 2013 12:15:20 +0000 (20:15 +0800)]
veth: remove redundant call of dev_alloc_name

it's called in the following register_netdevice. No need to call it
here.
Tested with "ip link add type veth" and "ip link add xxx%d type veth".

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agopktgen: ipv6: numa: consolidate skb allocation to pktgen_alloc_skb
Daniel Borkmann [Sat, 8 Jun 2013 12:18:16 +0000 (14:18 +0200)]
pktgen: ipv6: numa: consolidate skb allocation to pktgen_alloc_skb

We currently allow for numa-node aware skb allocation only within the
fill_packet_ipv4() path, but not in fill_packet_ipv6(). Consolidate that
code to a common allocation helper to enable numa-node aware skb
allocation for ipv6, and use it in both paths. This also makes both
functions a bit more readable.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: udp4: move GSO functions to udp_offload
Daniel Borkmann [Sat, 8 Jun 2013 10:56:03 +0000 (12:56 +0200)]
net: udp4: move GSO functions to udp_offload

Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoigmp: remove unnecessary in_device member zeroing
Shawn Bohrer [Fri, 7 Jun 2013 17:34:43 +0000 (12:34 -0500)]
igmp: remove unnecessary in_device member zeroing

ip_mc_init_dev() is passed a freshly kzalloc'd in_device so it is
unnecessary to explicitly zero out the members.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoigmp: hash a hash table to speedup ip_check_mc_rcu()
Eric Dumazet [Fri, 7 Jun 2013 15:48:57 +0000 (08:48 -0700)]
igmp: hash a hash table to speedup ip_check_mc_rcu()

After IP route cache removal, multicast applications using
a lot of multicast addresses hit a O(N) behavior in ip_check_mc_rcu()

Add a per in_device hash table to get faster lookup.

This hash table is created only if the number of items in mc_list is
above 4.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet_sched: htb: do not setup default rate estimators
Eric Dumazet [Thu, 6 Jun 2013 21:53:16 +0000 (14:53 -0700)]
net_sched: htb: do not setup default rate estimators

With a thousand htb classes, est_timer() spends ~5 million cpu cycles
and throws out cpu cache, because each htb class has a default
rate estimator (est 4sec 16sec).

Most users do not use default rate estimators, so switch htb
to not setup ones.

Add a module parameter (htb_rate_est) so that users relying
on this default rate estimator can revert the behavior.

echo 1 >/sys/module/sch_htb/parameters/htb_rate_est

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet_sched: psched_ratecfg_precompute() improvements
Eric Dumazet [Thu, 6 Jun 2013 20:56:19 +0000 (13:56 -0700)]
net_sched: psched_ratecfg_precompute() improvements

Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wirel...
John W. Linville [Tue, 11 Jun 2013 18:48:32 +0000 (14:48 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/linville/wireless-next into for-davem

Conflicts:
drivers/net/wireless/ath/ath9k/debug.c
net/mac80211/iface.c

11 years agocw1200: Fix an assorted pile of checkpatch warnings.
Solomon Peachy [Tue, 11 Jun 2013 13:49:40 +0000 (09:49 -0400)]
cw1200: Fix an assorted pile of checkpatch warnings.

Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
11 years agocw1200: Eliminate the ETF debug/engineering code.
Solomon Peachy [Tue, 11 Jun 2013 13:49:39 +0000 (09:49 -0400)]
cw1200: Eliminate the ETF debug/engineering code.

This is only really useful for people who are bringing up new hardware
designs and have access to the proprietary vendor tools that interface
with this mode.

It'll live out of tree until it's rewritten to use a less kludgy interface.

Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
11 years agocw1200: Remove "ITP" debug subsystem.
Solomon Peachy [Tue, 11 Jun 2013 13:49:38 +0000 (09:49 -0400)]
cw1200: Remove "ITP" debug subsystem.

This can live on as an out-of-tree patch for those that care.

Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
11 years agonet_sched: add 64bit rate estimators
Eric Dumazet [Thu, 6 Jun 2013 15:43:22 +0000 (08:43 -0700)]
net_sched: add 64bit rate estimators

struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.

Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.

This structure is dumped to user space as a new attribute :

TCA_STATS_RATE_EST64

Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.

Old tc command output, after patch :

eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
 Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
 rate 34360Mbit 189696pps backlog 0b 0p requeues 0

This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: pass correct parameter to skb_headers_offset_update()
Peter Pan(潘卫平) [Thu, 6 Jun 2013 13:27:21 +0000 (21:27 +0800)]
net: pass correct parameter to skb_headers_offset_update()

Since commit 1a37e412a022(net: Use 16bits for *_headers fields of struct
skbuff), skb->*_header are relative to skb->head,
so copy_skb_header() should not call skb_headers_offset_update() now,
and we should pass correct parameter to skb_headers_offset_update() in
pskb_expand_head() and skb_copy_expand().

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonetlink: Add compare function for netlink_table
Gao feng [Thu, 6 Jun 2013 06:49:11 +0000 (14:49 +0800)]
netlink: Add compare function for netlink_table

As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.

Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.

So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.

This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.

The behavior isn't changed if we don't provide the compare
function for netlink_table.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoxen-netfront: use skb_partial_csum_set() to simplify the codes
Li RongQing [Thu, 6 Jun 2013 06:35:18 +0000 (14:35 +0800)]
xen-netfront: use skb_partial_csum_set() to simplify the codes

use skb_partial_csum_set() to simplify the codes

Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'bridge_flags'
David S. Miller [Tue, 11 Jun 2013 09:04:43 +0000 (02:04 -0700)]
Merge branch 'bridge_flags'

Vlad Yasevich says:

====================
The following series adds 2 new flags to bridge.  One flag allows
the user to control whether mac learning is performed on the interface
or not.  By default mac learning is on.
The other flag allows the user to control whether unicast traffic
is flooded (send without an fdb) to a given unicast port.  Default is
on.

Changes since v4:
 - Implemented Stephen's suggestions.

Changes since v2:
 - removed unused "unlock" tag.

Changes since v1:
 - Integrated suggestion from MST to not impact RTM_NEWNEIGH and to
   skip lookups when learning is disabled.

Vlad Yasevich (2):
  bridge: Add flag to control mac learning.
  bridge: Add a flag to control unicast packet flood.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobridge: Add a flag to control unicast packet flood.
Vlad Yasevich [Wed, 5 Jun 2013 14:08:01 +0000 (10:08 -0400)]
bridge: Add a flag to control unicast packet flood.

Add a flag to control flood of unicast traffic.  By default, flood is
on and the bridge will flood unicast traffic if it doesn't know
the destination.  When the flag is turned off, unicast traffic
without an FDB will not be forwarded to the specified port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agobridge: Add flag to control mac learning.
Vlad Yasevich [Wed, 5 Jun 2013 14:08:00 +0000 (10:08 -0400)]
bridge: Add flag to control mac learning.

Allow user to control whether mac learning is enabled on the port.
By default, mac learning is enabled.  Disabling mac learning will
cause new dynamic FDB entries to not be created for a particular port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agonet: remove last caller of skb_tail_offset() and itself
Cong Wang [Wed, 5 Jun 2013 12:14:10 +0000 (20:14 +0800)]
net: remove last caller of skb_tail_offset() and itself

Similar to the following commits:

commit 00f97da17a0c8d656d0c9 (netpoll: fix position of network header)
commit 525cebedb32a87fa48584 (pktgen: Fix position of ip and udp header)

using skb_tail_offset() seems not correct since the offset
is based on head pointer.

With the last caller removed, skb_tail_offset() can be killed
finally.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoMerge branch 'll_poll'
David S. Miller [Tue, 11 Jun 2013 04:23:57 +0000 (21:23 -0700)]
Merge branch 'll_poll'

Eliezer Tamir says:

====================
This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from
last year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.

Performance numbers:
     setup                         TCP_RR           UDP_RR
kernel  Config     C3/6 rx-usecs tps cpu% S.dem  tps cpu% S.dem
patched optimized  on   100      87k 3.13 11.4   94K 3.17 10.7
patched optimized  on   0        71k 3.12 14.0   84k 3.19 12.0
patched optimized  on   adaptive 80k 3.13 12.5   90k 3.46 12.2
patched typical    on   100      72  3.13 14.0   79k 3.17 12.8
patched typical    on   0        60k 2.13 16.5   71k 3.18 14.0
patched typical    on   adaptive 67k 3.51 16.7   75k 3.36 14.5
3.9     optimized  on   adaptive 25k 1.0  12.7   28k 0.98 11.2
3.9     typical    off  0        48k 1.09  7.3   52k 1.11 4.18
3.9     typical    0ff  adaptive 35k 1.12 4.08   38k 0.65 5.49
3.9     optimized  off  adaptive 40k 0.82 4.83   43k 0.70 5.23
3.9     optimized  off  0        57k 1.17 4.08   62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical
NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down
config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive,
100 us
When C3/6 states were turned on (via BIOS) the performance governor
was used.

These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.

Design:
A global hash table that allows us to look up a struct napi by a
unique id was added.

A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific
socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the
protocol handler.

When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally
disables busy-polling)

Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), caller must take care to wait an rcu
grace period before freeing the memory containing the napi struct.
(Ixgbe already had this because the queue vector structure uses rcu to
protect the statistics counters in it.)

how to test:

1. The patchset should apply cleanly to net-next.
(don't forget to configure INET_LL_RX_POLL).

2. The ethtool -c setting for rx-usecs should be on the order of 100.

3. Use ethtool -K to disable GRO and LRO
(You are encouraged to try it both ways. If you find that your
workload
does better with GRO on do tell us.)

4. Sysctl value net.core.low_latency_poll controls how long
(in us) to busy-wait for more data, You are encouraged to play
with this and see what works for you. The default is now 0 so you need
to
set it to turn the feature on. I recommend a value around 50.

4. benchmark thread and IRQ should be bound to separate cores.
Both cores should be on the same CPU NUMA node as the NIC.
When the app and the IRQ run on the same CPU  you get a small penalty.
If interrupt coalescing is set to a low value this penalty can be very
large.

5. If you suspect that your machine is not configured properly,
use numademo to make sure that the CPU to memory BW is OK.
numademo 128m memcpy local copy numbers should be more than
8GB/s on a properly configured machine.

Change log:
v10
- removed select/poll support. (we will work on this some more and try again)
v9
- correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai.
- more int -> bool changes, reported by Eric Dumazet.
- better mask testing in sock_poll(), reported by Eric Dumazet.

v8
- split out udp and select/poll into separate patches.
  what used to be patch 2/5 is now three patches.
- type corrections from Amir Vadai and Cong Wang:
  one unsigned long that was left when changing to cycles_t
  int -> bool
- more detailed patch descriptions.

v7
- suggested by Ben Hutchings and Eric Dumazet:
  type fixes, static for globals in net/core.c,
  avoid napi_id collisions in napi_hash_add()

v6
- many small fixes suggested by Eric Dumazet:
  data locality, typos, documentation
  protect napi_hash insert/delete with a spinlock (napi_gen_id is no
  longer atomic_t since it's only accessed with the spinlock held.)
- added IPv6 TCP and UDP support (only minimally tested)

v5
- corrections suggested by Ben Hutchings:
  fixed typos, moved the config option and sysctl value from IPv4 to net
- moved sk_mark_ll() to the protocol handlers
- removed global id mechanism, replaced with a hashed napi_id.
  based on code sample from Eric Dumazet
  Note that ixgbe_free_q_vector() already waits an rcu grace period
  before freeing the q_vector, so nothing additional needs to be done
  when adding a call to napi_hash_del().
- simple poll/select support

v4
- removed separate config option for TCP as suggested Eric Dumazet.
- added linux mib counter for packets received through the low latency path,
  as suggested by Andi Kleen.
- re-allow module unloading, remove module param, use a global generation id
  instead to prevent the use of a stale napi pointer, as suggested
  by Eric Dumazet
- updated Documentation/networking/ip-sysctl.txt text

v3
- coding style changes suggested by Dave Miller

v2
- the sysctl knob is now in microseconds. The default value is now 0 (off).
- for now the code depends at configure time on CONFIG_I86_TSC
- the napi reference in struct skb is now a union with the dma cookie
  since the former is only used on RX and the latter on TX,
  as suggested by Eric Dumazet.
- we do a better job at honoring non-blocking operations.
- removed busy-polling support for tcp_read_sock()
- remove dynamic disabling of GRO
- coding style fixes
- disallow unloading the device module after the feature has been used

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
11 years agoixgbe: add extra stats for ndo_ll_poll
Eliezer Tamir [Mon, 10 Jun 2013 08:40:31 +0000 (11:40 +0300)]
ixgbe: add extra stats for ndo_ll_poll

Add additional statistics to the ixgbe driver for ndo_ll_poll
Defined under LL_EXTENDED_STATS

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>