Add wildcarded flow support in kernel datapath.
Wildcarded flow can improve OVS flow set up performance by avoid sending
matching new flows to the user space program. The exact performance boost
will largely dependent on wildcarded flow hit rate.
In case all new flows hits wildcard flows, the flow set up rate is
within 5% of that of linux bridge module.
Pravin has made significant contributions to this patch. Including API
clean ups and bug fixes.
Co-authored-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
[jesse: Additional documentation, fix memory leak, and improve validation.]
Signed-off-by: Jesse Gross <jesse@nicira.com>
It is an error to try to change the type of a vport using the set
command. However, while we check that this is an error, we still
proceed to allocate memory which then gets freed immediately.
This stops processing after noticing the error, which does not
actually fix a bug but is more correct.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
The only user is get_dpifindex(), no need to redirect via the port
operations.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
There is a inconsistent ordering in function ovs_vport_cmd_set()
between upstream and out of tree ovs module. Following patch
fixes it by releasing lock before calling ovs_notify.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Currently OVS uses combination of genl and rtnl lock to protect
datapath state. This was done due to networking stack locking.
But this has complicated locking and there are few lock ordering
issues with new tunneling protocols.
Following patch simplifies locking by introducing new ovs mutex
and now this lock is used to protect entire ovs state.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Kills the FLOW_BUFSIZE constant which needs to be calculated manually
and replaces it with key_attr_size() based on nla_total_size().
Calculates the size of datapath messages instead of relying on
NLMSG_DEFAULT_SIZE and moves the existing message size calculations
into own functions for clarity.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Less error prone as it takes into account the length of both the
destination buffer and the source attribute and documents when
data is copied from an attribute.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Specifying the minimal length in the policy makes it reuseable
and documents the interface.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Allocation of the Netlink notification skb can potentially fail
after changing vport configuration. In general, we try to avoid
this by undoing any change we made but that is difficult for existing
objects. This avoids the problem by preallocating the buffer (which
is fixed size).
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Header caching used to store a precomputed flow along with the skb
but no longer exists. There were a few remaining checks for those
flows, which this removes. It simplifies the code slightly and brings
us closer to upstream.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Since userspace flow based tunneling code is checked in, the kernel
port based tunneling code can be removed.
Patch removes following components:
- tunnel ports hash table and moved tunnel ports list to individual
vports.
- Cleaned per tnl-port config.
- OVS_KEY_ATTR_TUN_ID action is removed.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #15078
This reverts commit 82b0d75509.
This patch introduced bug by calling vfree() from interrupt context.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
The switch to flow based tunneling increased the size of each output
action in the flow action list. In extreme cases, this can result
in the action list exceeding the maximum buffer size.
This doubles the maximum buffer size to compensate for the increase
in action size. In the common case, most allocations will be
less than a page and those uses kmalloc. Therefore, for the majority
of situations, this will have no impact.
Bug #15203
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Commit e995e3df57 (Allow
OVS_USERSPACE_ATTR_USERDATA to be variable length.) introduced an
open coded version of nla_len() in queue_userspace_packet(). This
replaces it with the equivalent function call.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Until now, the optional OVS_USERSPACE_ATTR_USERDATA attribute had to be
exactly 64 bits long, if it was present. However, 64 bits is not enough
space to associate as much information with a flow as would be convenient
for some userspace features now under development. This commit generalizes
the attribute, allowing it to be any length.
This generalization is backward-compatible: if userspace only uses 64-bit
attributes, then it will not see any change in behavior.
CC: Romain Lenglet <rlenglet@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Without genlmsg_end the upcall message ends (according to nlmsg_len) after the
struct ovs_header.
Signed-off-by: Rich Lane <rlane@bigswitch.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
This bug was introduced in 1fc7083d (datapath: Remove vport MAC address
configuration.)
Signed-off-by: Rich Lane <rlane@bigswitch.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
If the pointer does not represent an error then the PTR_ERR macro may still
return a nonzero value. The fix is the same as in ovs_vport_cmd_set.
Signed-off-by: Rich Lane <rlane@bigswitch.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
skb_gso_segment() is almost always called in tx path,
except for openvswitch. It calls this function when
it receives the packet and tries to queue it to user-space.
In this special case, the ->ip_summed check inside
skb_gso_segment() is no longer true, as ->ip_summed value
has different meanings on rx path.
This patch adjusts skb_gso_segment() so that we can at least
avoid such warnings on checksum.
Cc: Jesse Gross <jesse@nicira.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[jesse: backport to kernels before 3.9 and add to tunnel.c]
Signed-off-by: Jesse Gross <jesse@nicira.com>
The ability to retrieve and set MAC addresses on vports is only
necessary for tunnel ports (the addresses for actual devices can be
retrieved through direct Linux mechanisms). Tunnel ports only used
the information for the purpose of generating path MTU discovery
packets, which has now been removed. Current userspace code already
reflects these changes, so this drops the functionality from the
kernel.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Kyle Mestery <kmestery@cisco.com>
Currently, if there isn't enough space to store the actions in a
flow during a dump we return -ENOMEM. However, the standard error
in this situation is -EMSGSIZE so this changes the behavior to match.
This issue was introduced in 354d4c98a8
(datapath: Fix nelink attribute size for flow.).
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
After commit 9b405f1aa8 (datapath: More
flexible kernel/userspace tunneling attribute.), it was possible for a
flow dump to return a partial action list. It's better to return no
action list at all in this situation since then userspace will know
that it should request the full thing if it wants rather than have
incorrect results. Therefore, this prevents those partial lists in
situations where we have a very large number of actions.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
This reverts commit 00c7faf3e5.
In general, it should not be possible have a NULL return value from
skb_gso_segment() since we're not actually trying to verify the
header integrity. No other callers with similar needs have NULL
checks. The actual cause of the problem was LRO packets, which
OVS isn't equipped to handle. The commit
33e031e99c (datapath: Move LRO check
from transmit to receive.) solves that problem by fixing the LRO
check. In order to avoid possibly masking any other problems, this
reverts the GSO check which should no longer be needed.
Signed-off-by: Jesse Gross <jesse@nicira.com>
skb_gso_segment() has the following comment:
* It may return NULL if the skb requires no segmentation. This is
* only possible when GSO is used for verifying header integrity.
Somehow queue_gso_packets() has never hit this case before, but some
failures have suddenly been reported. This commit should fix the problem.
Additional commentary by Jesse: We shouldn't normally be hitting this case
because we're actually trying to do GSO, not header validation. However, I
guess the guest/backend must be generating a packet with an MSS, which
tricks us into thinking that it's GSO, but no GSO is actually requested.
In the case of the bridge, header validation does take place so the
situation is handled already. It seems not ideal that the network backend
doesn't sanitize these packets but it's probably good that we handle
it in any case.
Bug #14772.
Reported-by: Deepesh Govindan <dgovindan@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Following patch adds null check while inserting new netlink attribute.
This was introduced by commit 9b405f1aa8 (datapath: More
flexible kernel/userspace tunneling attribute.)
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Bug #14767
Following patch breaks down single ipv4_tunnel netlink attribute into
individual member attributes. It will help when we extend tunneling
parameters in future.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #14611
Add Linux 3.8 kernel to the range of supported kernel versions.
Signed-off-by: James Page <james.page@ubuntu.com>
[jesse: Update NEWS and FAQ]
Signed-off-by: Jesse Gross <jesse@nicira.com>
This reduces the number of valid "no such device" error values that
need special attention by the caller.
Userspace code will need to keep on checking for both ENODEV and
ENOENT as long as older kernel modules are around.
Signed-off-by: Jarno Rajahalme <jarno.rajahalme@nsn.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Currently brcompat does not work on master due to recent
datapath changes. We have decided to remove it as it is
not used very widely.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
This patch adds support for skb mark matching and set action.
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
just use more faster this_cpu_ptr instead of per_cpu_ptr(p, smp_processor_id());
Signed-off-by: Shan Wei <davidshan@tencent.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Header caching previously required the ability to maintain the lifetime
of flows across RCU boundaries. However, now that header caching is
gone we can simplfy the code and make it match the upstream version.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Following patch adds start offset for sw_flow-key, so that we can
skip tunneling information in key for non-tunnel flows.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Kyle Mestery <kmestery@cisco.com>
Acked-by: Ben Pfaff <blp@nicira.com>
This is a first pass at providing a tun_key which can be
used as the basis for flow-based tunnelling. The
tun_key includes and replaces the tun_id in both struct
ovs_skb_cb and struct sw_tun_key.
This patch allows all existing tun_id behaviour to still work. Existing
users of tun_id are redirected to tun_key->tun_id to retain compatibility.
However, when the userspace code is updated to make use of the new
tun_key, the old behaviour will be deprecated and removed.
NOTE: With these changes, the tunneling code no longer assumes input and
output keys are symmetric. If they are not, PMTUD needs to be disabled
for tunneling to work.
Signed-off-by: Kyle Mestery <kmestery@cisco.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
The upstream version of the module always has the version of the running kernel
but for out-of-tree modules it can be difficult to tell the current version.
This adds the information to the module where it can be read using modinfo for
the on-disk version or from /sys/module/openvswitch/version for the currently
loaded module.
Suggested-by: Duffie Cooley <dcooley@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Kyle Mestery <kmestery@cisco.com>
If a datapath fails to initialze fully (likely due to out-of-memory)
then it's possible that we can take a reference to a network
namespace but never release it. This fixes the problem by releasing
any resources in the event of an error.
Found by code inspection, it's likely to be extremely rare in practice.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
When installing a flow with an action to set a particular field we
need to validate that the packets that are part of the flow actually
contain that header. With IP we use zeroed addresses and with TCP/UDP
the check is for zeroed ports. This check is overly broad and can catch
packets like DHCP requests that have a zero source address in a
legitimate header. This changes the check to look for a zeroed protocol
number for IP or for both ports be zero for TCP/UDP before considering
the header to not exist.
Bug #12769
Reported-by: Ethan Jackson <ethan@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
At the point where it was used, skb_shinfo(skb)->gso_type referred to a
post-GSO sk_buff. Thus, it would always be 0. We want to know the pre-GSO
gso_type, so we need to obtain it before segmenting.
Before this change, the kernel would pass inconsistent data to userspace:
packets for UDP fragments with nonzero offset would be passed along with
flow keys that indicate a zero offset (that is, the flow key for "later"
fragments claimed to be "first" fragments). This inconsistency tended
to confuse Open vSwitch userspace, causing it to log messages about
"failed to flow_del" the flows with "later" fragments.
Bug #12394.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
When the kernel validates set TCP/UDP port actions, it looks at
the ports in the existing flow to make sure that the L4 header exists.
However, these actions always use the IPv4 version of the struct.
Following patch fixes this by checking for flow ip protocol first.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Bug #11205