Use percpu allocator for stats due to objection to stats array.
But percpu allocator is not designed for high churn allocation/
deallcation. so we need to avoid allocating percpu flow for
short lived flows. One cheaper way to detect flow is by checking
if 5-tuple used in RSS are masked or not. if any one of them is
masked, flow is likely shared across CPU where percpu stat
should be more scalable. And that flow should be relatively
long lived flow.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Flow lookup can happen either in packet processing context or userspace
context but it was annotated as requiring RCU read lock to be held. This
also allows OVS mutex to be held without causing warnings.
Reported-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
API changes only for code readability. No functional chnages.
This patch removes the underscored version. Added a new API ovs_flow_tbl_lookup_stats() that
returns the n_mask_hits.
Reported by: Ben Pfaff <blp@nicira.com>
Reviewed-by: Thomas Graf <tgraf@redhat.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t).
The kernel interface remains at 8 bits, which makes no functional
difference now, as none of the higher bits is currently of interest
to the userspace.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. Following patch allocates separate stats for each
CPU to make stats update scalable.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Collect mega flow mask stats. ovs-dpctl show command can be used to
display them.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Hides mega-flow implementation in flow_table.c rather than
datapath.c.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
ovs-flow rehash does not touch mega flow list. Following patch
moves it dp struct datapath. Avoid one extra indirection for
accessing mega-flow list head on every packet receive.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Over the time datapath.c and flow.c has became pretty large files.
Following patch restructures functionality of component into three
different components:
flow.c: contains flow extract.
flow_netlink.c: netlink flow api.
flow_table.c: flow table api.
Diffstat is showing wrong count. This patch mostly restructures code
without changing logic.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
OVS has its own workq implementation for coupe of reasons. first
was to avoid system freeze due to ovs-flow rehash softlockup.
We have moved out rehash from workq, So this problem does not exist.
second was related bugs in kernel workq implementation in pre-2.6.32
kernel. But we have dropped support for older kernel.
So there is no reason to keep ovs-workq around. Following patch
removes it.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Rehashing in ovs-workqueue can cause ovs-mutex lock contentions
in case of heavy flow setups where both needs ovs-mutex. So by
moving rehashing to flow-setup we can eliminate contention.
This also simplify ovs locking and reduces dependence on
workqueue.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Rather than having compile time check in datapath.c, its better
to check kernel version at configuration step.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This makes datapath module much close to upstream datapath and
make code easy to understand.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This patch adds support for rewriting SCTP src,dst ports similar to the
functionality already available for TCP/UDP.
Rewriting SCTP ports is expensive due to double-recalculation of the
SCTP checksums; this is performed to ensure that packets traversing OVS
with invalid checksums will continue to the destination with any
checksum corruption intact.
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Generic tunnel rcv and send function are only used by lisp tunneling
module, so It make sense to move them to lisp module.
CC: Lori Jakab <lojakab@cisco.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Lorand Jakab <lojakab@cisco.com>
Acked-by: Jesse Gross <jesse@nicira.com>
OVS locking was recently changed to have private OVS lock which
simplified overall locking. Therefore there is no need to have
another global genl lock to protect OVS data structures. Following
patch uses of parallel_ops genl family for OVS. This also allows
more granual OVS locking using ovs_mutex for protecting OVS data
structures, which gives more concurrencey. E.g multiple genl
operations OVS_PACKET_CMD_EXECUTE can run in parallel, etc.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The external symbols in the OVS kernel module are prefixed with
'ovs_' with the exception of ipv4_tun_to/from_nlattr(). This adds
the prefix and makes the out of tree version consistent with
upstream.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
In certain cases we need to ensure we save off skb->cb before
calling __skb_gso_segment() since in kernels >= 3.9 skb->cb is
used by this routine.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Kyle Mestery <kmestery@cisco.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
After a mask is assigned to a flow, it will not change for the life of
the flow. Since flow access is protected by RCU lock, access to
flow->mask after getting a flow is always safe.
Suggested-by: Jesse Gross <jesse@nicira.com>
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
It is important to validate flow actions to ensure that they do
not try to write off the end of a packet. The mechanism to do this
is to ensure that a flow is precise enough to describe valid vs.
invalid packets and only allowing actions on valid flows.
The introduction of megaflows broke this by using a narrow base
flow but a potentially wide match. This meant that while the
original flow was properly validated, later packets might not
conform to that flow and could be truncated. This switches to
using the masked flow instead, effectively requiring that all
possible matching packets be valid in order for a flow's actions
to be accepted.
This change only affects the flow setup path - executed packets
have always used the flow extracted from the packet and therefore
were properly validated.
Signed-off-by: Jesse Gross <jesse@nicira.com>
RCUfy dp-dump operation which is already read-only. This
makes all ovs dump operations lockless.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
When kernel rejects a netlink message, it usually returns EINVAL
error code to the userspace. The actual reason for rejecting the
netlinke message is not available, making it harder to debug netlink
issues. This patch adds kernel log messages whenever a netlink message
is rejected with reasons. Those messages are logged at the info level.
Those messages are logged only once per message, to keep kernel log noise
level down. Reload the kernel module to re-enable already logged
messages.
The messages are meant to help developers to debug userspace and kernel
intergration issues. The actual message may change or be removed over time.
These messages are not expected to show up in a production environment.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Flow dump operation is read-only operation. There is no need to
take ovs-lock. Following patch use rcu-lock for dumping flows.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
OVS_ACTION_ATTR_USERSPACE action was sending the key from the matching
flow. This works for exact match flows because flow keys are the
same as packet keys. However, it does not work with wildcarded flows as
the packet keys may be different than the flow keys. This patch uses
the packet keys carried in OVS_CB(skb) when calling output_userspace().
Bug #18163
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Add wildcarded flow support in kernel datapath.
Wildcarded flow can improve OVS flow set up performance by avoid sending
matching new flows to the user space program. The exact performance boost
will largely dependent on wildcarded flow hit rate.
In case all new flows hits wildcard flows, the flow set up rate is
within 5% of that of linux bridge module.
Pravin has made significant contributions to this patch. Including API
clean ups and bug fixes.
Co-authored-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
[jesse: Additional documentation, fix memory leak, and improve validation.]
Signed-off-by: Jesse Gross <jesse@nicira.com>
It is an error to try to change the type of a vport using the set
command. However, while we check that this is an error, we still
proceed to allocate memory which then gets freed immediately.
This stops processing after noticing the error, which does not
actually fix a bug but is more correct.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
The only user is get_dpifindex(), no need to redirect via the port
operations.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
There is a inconsistent ordering in function ovs_vport_cmd_set()
between upstream and out of tree ovs module. Following patch
fixes it by releasing lock before calling ovs_notify.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Currently OVS uses combination of genl and rtnl lock to protect
datapath state. This was done due to networking stack locking.
But this has complicated locking and there are few lock ordering
issues with new tunneling protocols.
Following patch simplifies locking by introducing new ovs mutex
and now this lock is used to protect entire ovs state.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Kills the FLOW_BUFSIZE constant which needs to be calculated manually
and replaces it with key_attr_size() based on nla_total_size().
Calculates the size of datapath messages instead of relying on
NLMSG_DEFAULT_SIZE and moves the existing message size calculations
into own functions for clarity.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Less error prone as it takes into account the length of both the
destination buffer and the source attribute and documents when
data is copied from an attribute.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Specifying the minimal length in the policy makes it reuseable
and documents the interface.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Allocation of the Netlink notification skb can potentially fail
after changing vport configuration. In general, we try to avoid
this by undoing any change we made but that is difficult for existing
objects. This avoids the problem by preallocating the buffer (which
is fixed size).
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Header caching used to store a precomputed flow along with the skb
but no longer exists. There were a few remaining checks for those
flows, which this removes. It simplifies the code slightly and brings
us closer to upstream.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Since userspace flow based tunneling code is checked in, the kernel
port based tunneling code can be removed.
Patch removes following components:
- tunnel ports hash table and moved tunnel ports list to individual
vports.
- Cleaned per tnl-port config.
- OVS_KEY_ATTR_TUN_ID action is removed.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #15078
This reverts commit 82b0d75509.
This patch introduced bug by calling vfree() from interrupt context.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The switch to flow based tunneling increased the size of each output
action in the flow action list. In extreme cases, this can result
in the action list exceeding the maximum buffer size.
This doubles the maximum buffer size to compensate for the increase
in action size. In the common case, most allocations will be
less than a page and those uses kmalloc. Therefore, for the majority
of situations, this will have no impact.
Bug #15203
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>