2
0
mirror of https://github.com/openvswitch/ovs synced 2025-10-17 14:28:02 +00:00
Commit Graph

196 Commits

Author SHA1 Message Date
Joe Stringer
038e34abaa datapath: Allow matching on conntrack label
Allow matching and setting the ct_label field. As with ct_mark, this is
populated by executing the CT action. The label field may be modified by
specifying a label and mask nested under the CT action. It is stored as
metadata attached to the connection. Label modification occurs after
lookup, and will only persist when the conntrack entry is committed by
providing the COMMIT flag to the CT action. Labels are currently fixed
to 128 bits in size.

Upstream: c2ac667 "openvswitch: Allow matching on conntrack label"
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2015-12-03 17:17:25 -08:00
Joe Stringer
a94ebc3999 datapath: Add conntrack action
Expose the kernel connection tracker via OVS. Userspace components can
make use of the CT action to populate the connection state (ct_state)
field for a flow. This state can be subsequently matched.

Exposed connection states are OVS_CS_F_*:
- NEW (0x01) - Beginning of a new connection.
- ESTABLISHED (0x02) - Part of an existing connection.
- RELATED (0x04) - Related to an established connection.
- INVALID (0x20) - Could not track the connection for this packet.
- REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
- TRACKED (0x80) - This packet has been sent through conntrack.

When the CT action is executed by itself, it will send the packet
through the connection tracker and populate the ct_state field with one
or more of the connection state flags above. The CT action will always
set the TRACKED bit.

When the COMMIT flag is passed to the conntrack action, this specifies
that information about the connection should be stored. This allows
subsequent packets for the same (or related) connections to be
correlated with this connection. Sending subsequent packets for the
connection through conntrack allows the connection tracker to consider
the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.

The CT action may optionally take a zone to track the flow within. This
allows connections with the same 5-tuple to be kept logically separate
from connections in other zones. If the zone is specified, then the
"ct_zone" match field will be subsequently populated with the zone id.

IP fragments are handled by transparently assembling them as part of the
CT action. The maximum received unit (MRU) size is tracked so that
refragmentation can occur during output.

IP frag handling contributed by Andy Zhou.

Based on original design by Justin Pettit.

Upstream: 7f8a436 "openvswitch: Add conntrack action"
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2015-12-03 17:17:25 -08:00
Pravin B Shelar
e23775f20e datapath: Add support for lwtunnel
Following patch adds support for lwtunnel to OVS datapath.
With this change OVS datapath detect lwtunnel support and
make use of new APIs if available. On older kernel where the
support is not there the backported tunnel modules are used.
These backported tunnel devices acts as lwtunnel devices.
I tried to keep backported module same as upstream for easier
bug-fix backport. Since STT and LISP are not upstream OVS
always needs to use respective modules from tunnel compat layer.
To make it work on kernel 4.3 I have converted STT and LISP
modules to lwtunnel API model.

lwtunnel make use of skb-dst to pass tunnel information to the
tunnel module. On older kernel this is not possible. So the in
case of old kernel metadata ref is stored in OVS_CB and direct
call to tunnel transmit function is made by respective tunnel
vport modules. Similarly on receive side tunnel recv directly
call netdev-vport-receive to pass the skb to OVS.

Major backported components include:
Geneve, GRE, VXLAN, ip_tunnel, udp-tunnels GRO.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Jesse Gross <jesse@kernel.org>
2015-12-03 16:30:21 -08:00
Joe Stringer
c0cddcec39 datapath: Add support for 4.1 kernel.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2015-09-18 13:27:24 -07:00
Alexander Duyck
935fc58209 datapath: Use eth_proto_is_802_3.
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).

Backport of upstream commit 6713fc9b8fa33444aa000f0f31076f6a859ccb34:
"openvswitch: Use eth_proto_is_802_3"

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2015-07-30 16:42:07 -07:00
Thomas Graf
4b1632249f datapath: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
Backport of upstream commit:

    openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()

    Also factors out Geneve validation code into a new separate function
    validate_and_copy_geneve_opts().

    A subsequent patch will introduce VXLAN options. Rename the existing
    GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
    tunnel metadata options.

    Signed-off-by: Thomas Graf <tgraf@suug.ch>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream: d91641d ("openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2015-02-03 21:56:51 +01:00
Thomas Graf
efd8a18e8d datapath: Account for "rename vlan_tx_* helpers since "tx" is misleading there"
Upstream commit:
    net: rename vlan_tx_* helpers since "tx" is misleading there

    The same macros are used for rx as well. So rename it.

    Signed-off-by: Jiri Pirko <jiri@resnulli.us>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream: df8a39d ("net: rename vlan_tx_* helpers since "tx" is misleading there")
Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2015-02-03 21:55:38 +01:00
Ben Pfaff
9e917416f8 datapath: Consistently include VLAN header in flow and port stats.
Until now, when VLAN acceleration was in use, the bytes of the VLAN header
were not included in port or flow byte counters.  They were however
included when VLAN acceleration was not used.  This commit corrects the
inconsistency, by always including the VLAN header in byte counters.

Previous discussion at
http://openvswitch.org/pipermail/dev/2014-December/049521.html

Already committed to upstream Linux netdev tree as
24cc59d1ebaac54d933dc0b30abcd8bd86193eef.

Reported-by: Motonori Shindo <mshindo@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Reviewed-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2015-01-06 09:06:10 -08:00
Pravin B Shelar
7d16c8478e datapath: fix coding style.
Kernel datapath code has diverged from upstream code.  This
makes porting patches between these two code bases harder
than it needs to be. Following patch fixes this by fixing
coding style issues on this branch.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-09 20:03:33 -08:00
Pravin B Shelar
2baf0e0c6c datapath: Fix few mpls issues.
Found during MPLS upstreaming.  Also sync-up MPLS header files
with upstream code.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-09 20:03:33 -08:00
Pravin B Shelar
af465b67a9 datapath: Fix comment style.
Use netdev comment style.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
2014-10-23 19:09:23 -07:00
Li RongQing
10173ceaa9 datapath: fix a use after free
pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
setting after arphdr_ok to avoid the use the freed memory

Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.")
Cc: Jesse Gross <jesse@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-10-17 14:51:37 -07:00
Jarno Rajahalme
9233cef706 datapath: Add support for OVS_FLOW_ATTR_PROBE.
This new flag is useful for suppressing error logging while probing
for datapath features using flow commands.  For backwards
compatibility reasons the commands are executed normally, but error
logging is suppressed.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-10-03 13:31:07 -07:00
Thomas Graf
f1f60b8583 datapath: Constify various function arguments
Help produce better optimized code.

Signed-off-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-09-23 14:47:58 -07:00
Pravin B Shelar
e74d48171e datapath: Remove pkt_key from OVS_CB.
OVS keeps pointer to packet key in skb->cb, but the packet key is
store on stack. This could make code bit tricky. So it is better to
get rid of the pointer.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
2014-09-20 19:45:56 -07:00
Andreea-Cristina Bernat
bc36fcd3d3 datapath: Replace rcu_dereference() with rcu_access_pointer()
The "rcu_dereference()" call is used directly in a condition.
Since its return value is never dereferenced it is recommended to use
"rcu_access_pointer()" instead of "rcu_dereference()".
Therefore, this patch makes the replacement.

The following Coccinelle semantic patch was used:
@@
@@

(
 if(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
|
 while(
 (<+...
- rcu_dereference
+ rcu_access_pointer
  (...)
  ...+>)) {...}
)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-09-12 14:28:59 -07:00
Andy Zhou
e0a42ae29a datapath: Update flow key before recirc
When flow key becomes invalid due to push or pop actions, current
implementation leaves it as invalid, only rebuild the flow key used
for recirculation.

This works, but is less efficient in case of multiple recirc
actions. Each recirc action will have to re-extract
its own flow keys.

This patch update the original flow key as soon as the first recirc
action is encountered, avoiding expensive flow extract call for any
future recirc actions as long as the flow key remains valid.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-08-12 10:35:08 -07:00
Pravin B Shelar
fb66fbd15b datapath: Use tun_info only for egress tunnel path.
Currently tun_info is used for passing tunnel information
on ingress and egress path, this cause confusion.  Following
patch removes its use on ingress path make it egress only parameter.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
2014-08-06 22:12:25 -07:00
Pravin B Shelar
7f45215aec datapath: Avoid using wrong metadata for recic action.
Recirc action needs to extract flow key from packet, it uses tun_info
from OVS_CB for setting tunnel meta data in flow key. But tun_info
can be overwritten by tunnel send action. This would result in wrong
flow key for the recirculation.
Following patch copies flow-key meta data from OVS_CB packet key
itself thus avoids this bug.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
2014-08-06 18:03:19 -07:00
Pravin B Shelar
c135bba1a9 datapath: refactor ovs flow extract API.
OVS flow extract is called on packet receive or packet
execute code path.  Following patch defines separate API
for extracting flow-key in packet execute code path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>
2014-08-06 18:03:17 -07:00
Simon Horman
ccf4378615 datapath: Add basic MPLS support to kernel
Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-06-24 16:02:02 -07:00
Jesse Gross
c1fc1411d2 datapath: Add support for Geneve tunneling.
This adds support for Geneve - Generic Network Virtualization
Encapsulation. The protocol is documented at
http://tools.ietf.org/html/draft-gross-geneve-00

The kernel implementation is completely agnostic to the options
that are in use and can handle newly defined options without
further work. It does this by simply matching on a byte array
of options and allowing userspace to setup flows on this array.

Userspace currently implements only support for basic version of
Geneve. It can work with the base header (including the VNI) and
is capable of parsing options but does not currently support any
particular option definitions. Over time, the intention is to
allow options to be matched through OpenFlow without requiring
explicit support in OVS userspace.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-06-20 15:19:35 -07:00
Jesse Gross
f0cd669f19 datapath: Wrap struct ovs_key_ipv4_tunnel in a new structure.
Currently, the flow information that is matched for tunnels and
the tunnel data passed around with packets is the same. However,
as additional information is added this is not necessarily desirable,
as in the case of pointers.

This adds a new structure for tunnel metadata which currently contains
only the existing struct. This change is purely internal to the kernel
since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version
of OVS_KEY_ATTR_TUNNEL that is translated at flow setup.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-06-19 18:33:28 -07:00
Jesse Gross
9cef26ac6a datapath: Eliminate memset() from flow_extract.
As new protocols are added, the size of the flow key tends to
increase although few protocols care about all of the fields. In
order to optimize this for hashing and matching, OVS uses a variable
length portion of the key. However, when fields are extracted from
the packet we must still zero out the entire key.

This is no longer necessary now that OVS implements masking. Any
fields (or holes in the structure) which are not part of a given
protocol will be by definition not part of the mask and zeroed out
during lookup. Furthermore, since masking already uses variable
length keys this zeroing operation automatically benefits as well.

In principle, the only thing that needs to be done at this point
is remove the memset() at the beginning of flow. However, some
fields assume that they are initialized to zero, which now must be
done explicitly. In addition, in the event of an error we must also
zero out corresponding fields to signal that there is no valid data
present. These increase the total amount of code but very little of
it is executed in non-error situations.

Removing the memset() reduces the profile of ovs_flow_extract()
from 0.64% to 0.56% when tested with large packets on a 10G link.

Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-06-19 18:33:27 -07:00
Ben Pfaff
a5b8d49bc6 datapath: Fix tracking of flags seen in TCP flows.
Flow statistics need to take into account the TCP flags from the packet
currently being processed (in 'key'), not the TCP flags matched by the
flow found in the kernel flow table (in 'flow').

This bug made the Open vSwitch userspace fin_timeout action have no effect
in many cases.

Bug #1219516.
Reported-by: Len Gao <leng@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2014-04-08 15:39:18 -07:00
Jarno Rajahalme
4bb90bea0c datapath/flow: Fix ovs_flow_stats_get/clear RCU dereference.
For ovs_flow_stats_get() using ovsl_dereference() was wrong, since
flow dumps call this with RCU read lock.

ovs_flow_stats_clear() is always called with ovs_mutex, so can use
ovsl_dereference().

Also, make the ovs_flow_stats_get() 'flow' argument const to make
later patches cleaner.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-04-02 11:14:58 -07:00
Jarno Rajahalme
aa91700611 datapath: Clarify locking.
Remove unnecessary locking from functions that are always called with
appropriate locking.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
2014-03-25 09:12:44 -07:00
Jarno Rajahalme
708fb4c50a datapath: Compact sw_flow_key.
Minimize padding in sw_flow_key and move 'tp' top the main struct.
These changes simplify code when accessing the transport port numbers
and the tcp flags, and makes the sw_flow_key 8 bytes smaller on 64-bit
systems (128->120 bytes).  These changes also make the keys for IPv4
packets to fit in one cache line.

There is a valid concern for safety of packing the struct
ovs_key_ipv4_tunnel, as it would be possible to take the address of
the tun_id member as a __be64 * which could result in unaligned access
in some systems. However:

- sw_flow_key itself is 64-bit aligned, so the tun_id within is always
  64-bit aligned.
- We never make arrays of ovs_key_ipv4_tunnel (which would force every
  second tun_key to be misaligned).
- We never take the address of the tun_id in to a __be64 *.
- Whereever we use struct ovs_key_ipv4_tunnel outside the sw_flow_key,
  it is in stack (on tunnel input functions), where compiler has full
  control of the alignment.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-03-24 10:45:47 -07:00
Jarno Rajahalme
9ae5ab3c35 datapath: Use TCP flags in the flow key for stats.
We already extract the TCP flags for the key, might as well use that
for stats.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-03-24 09:52:03 -07:00
Ben Pfaff
4029c0435a datapath: Correctly report flow used times for first 5 minutes after boot.
The kernel starts out its "jiffies" timer as 5 minutes below zero, as
shown in include/linux/jiffies.h:

  /*
   * Have the 32 bit jiffies value wrap 5 minutes after boot
   * so jiffies wrap bugs show up earlier.
   */
  #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

The loop in ovs_flow_stats_get() starts out with 'used' set to 0, then
takes any "later" time.  This means that for the first five minutes after
boot, flows will always be reported as never used, since 0 is greater than
any time already seen.

Bug #1192516.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
2014-02-28 15:14:25 -08:00
Joe Perches
982a47ecea datapath: Use ether_addr_copy
It's slightly smaller/faster for some architectures.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-02-16 08:31:45 -08:00
Jarno Rajahalme
9ac56358de datapath: Per NUMA node flow stats.
Keep kernel flow stats for each NUMA node rather than each (logical)
CPU.  This avoids using the per-CPU allocator and removes most of the
kernel-side OVS locking overhead otherwise on the top of perf reports
and allows OVS to scale better with higher number of threads.

With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
rate doubles on a server with two hyper-threaded physical CPUs (16
logical cores each) compared to the current OVS master.  Tested with
non-trivial flow table with a TCP port match rule forcing all new
connections with unique port numbers to OVS userspace.  The IP
addresses are still wildcarded, so the kernel flows are not considered
as exact match 5-tuple flows.  This type of flows can be expected to
appear in large numbers as the result of more effective wildcarding
made possible by improvements in OVS userspace flow classifier.

Perf results for this test (master):

Events: 305K cycles
+   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
+   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
+   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
+   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
+   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
+   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
+   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
+   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
+   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
+   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
+   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
+   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
+   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
...

And after this patch:

Events: 356K cycles
+   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
+   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
+   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
+   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
+   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
+   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
+   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
+   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
+   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
+   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
+   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
+   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
+   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
+   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
+   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
...

There is a small increase in kernel spinlock overhead due to the same
spinlock being shared between multiple cores of the same physical CPU,
but that is barely visible in the netperf TCP_CRR test performance
(maybe ~1% performance drop, hard to tell exactly due to variance in
the test results), when testing for kernel module throughput (with no
userspace activity, handful of kernel flows).

On flow setup, a single stats instance is allocated (for the NUMA node
0).  As CPUs from multiple NUMA nodes start updating stats, new
NUMA-node specific stats instances are allocated.  This allocation on
the packet processing code path is made to never block or look for
emergency memory pools, minimizing the allocation latency.  If the
allocation fails, the existing preallocated stats instance is used.
Also, if only CPUs from one NUMA-node are updating the preallocated
stats instance, no additional stats instances are allocated.  This
eliminates the need to pre-allocate stats instances that will not be
used, also relieving the stats reader from the burden of reading stats
that are never used.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-02-18 09:56:55 -08:00
Jarno Rajahalme
df65fec117 datapath: Remove 5-tuple optimization.
The 5-tuple optimization becomes unnecessary with a later per-NUMA
node stats patch.  Remove it first to make the changes easier to
grasp.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-18 09:07:03 -08:00
Jarno Rajahalme
ac3e564e43 datapath: Read tcp flags only then the tranport header is present.
Only the first IP fragment can have a TCP header, check for this.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-02-10 08:52:25 -08:00
Pravin B Shelar
9d73c9cac7 datapath: Fix deadlock during stats update.
Stats-read needs to lock stats but same lock is taken in stats
update in irq context. Therefore it needs to disable irq to
avoid following deadlock :-

BUG: soft lockup - CPU#1 stuck for 23s! [ovs-vswitchd:1425]
CPU 1
Pid: 1425, comm: ovs-vswitchd Tainted: G           O 3.2.39-server-nn23 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff8103db22>]  [<ffffffff8103db22>] __ticket_spin_lock+0x22/0x30
RSP: 0018:ffff88003fd03b30  EFLAGS: 00000297
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000050
RDX: 0000000000000002 RSI: ffff88003d0a9900 RDI: ffff88003ae19598
RBP: ffff88003fd03b30 R08: 0000000000000000 R09: ffff88003ad44048
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88003fd03aa8
R13: ffffffff8164e5de R14: ffff88003fd03b30 R15: ffff88003ae19580
FS:  00007ffb0b428940(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3c0ef94000 CR3: 00000000250e2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ovs-vswitchd (pid: 1425, threadinfo ffff88002514a000, task ffff8800250aae00)
Stack:
 ffff88003fd03b40 ffffffff8164596e ffff88003fd03b70 ffffffffa027622d
 ffff88003d0a9900 ffffe8ffffd03800 ffff8800297f5a80 ffff88003fd03ba8
 ffff88003fd03c60 ffffffffa02759af ffff88003fd03de0 ffff88003fd03e4c
Call Trace:
 <IRQ>
 [<ffffffff8164596e>] _raw_spin_lock+0xe/0x20
 [<ffffffffa027622d>] ovs_flow_stats_update+0x5d/0x100 [openvswitch]
 [<ffffffffa02759af>] ovs_dp_process_received_packet+0x8f/0x130 [openvswitch]
 [<ffffffffa027c0ca>] ovs_vport_receive+0x2a/0x30 [openvswitch]
 [<ffffffffa027db18>] netdev_frame_hook+0xb8/0x120 [openvswitch]
 [<ffffffffa027da60>] ? free_port_rcu+0x30/0x30 [openvswitch]
 [<ffffffff81539318>] __netif_receive_skb+0x1c8/0x620
 [<ffffffff8153a4c0>] netif_receive_skb+0x80/0x90
 [<ffffffff8115f14c>] ? ksize+0x1c/0xc0
 [<ffffffff8153a610>] napi_skb_finish+0x50/0x70
 [<ffffffff8153ac15>] napi_gro_receive+0xf5/0x140
 [<ffffffffa00368ae>] vmxnet3_rq_rx_complete+0x51e/0x7c0 [vmxnet3]
 [<ffffffff8101ac90>] ? nommu_map_sg+0xe0/0xe0
 [<ffffffffa0036da5>] vmxnet3_poll_rx_only+0x45/0xc0 [vmxnet3]
 [<ffffffff8153ae64>] net_rx_action+0x134/0x290
 [<ffffffff8103db0d>] ? __ticket_spin_lock+0xd/0x30
 [<ffffffff8106e1a8>] __do_softirq+0xa8/0x210
 [<ffffffff8164596e>] ? _raw_spin_lock+0xe/0x20
 [<ffffffff8164fd6c>] call_softirq+0x1c/0x30
 [<ffffffff81016215>] do_softirq+0x65/0xa0
 [<ffffffff8106e58e>] irq_exit+0x8e/0xb0
 [<ffffffff81650633>] do_IRQ+0x63/0xe0
 [<ffffffff81645e2e>] common_interrupt+0x6e/0x6e

-----------
Bug #21853
Reported-by: Pawan Shukla <shuklap@vmware.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-12-15 20:37:07 -08:00
Pravin B Shelar
b0f3a2feef datapath: Use percpu allocator for flow-stats.
Use percpu allocator for stats due to objection to stats array.
But percpu allocator is not designed for high churn allocation/
deallcation. so we need to avoid allocating percpu flow for
short lived flows. One cheaper way to detect flow is by checking
if 5-tuple used in RSS are masked or not. if any one of them is
masked, flow is likely shared across CPU where percpu stat
should be more scalable. And that flow should be relatively
long lived flow.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-12-03 08:57:56 -08:00
Jarno Rajahalme
dc235f7fbc TCP flags matching support.
tcp_flags=flags/mask
        Bitwise  match on TCP flags.  The flags and mask are 16-bit num‐
        bers written in decimal or in hexadecimal prefixed by 0x.   Each
        1-bit  in  mask requires that the corresponding bit in port must
        match.  Each 0-bit in mask causes the corresponding  bit  to  be
        ignored.

        TCP  protocol  currently  defines  9 flag bits, and additional 3
        bits are reserved (must be transmitted as zero), see  RFCs  793,
        3168, and 3540.  The flag bits are, numbering from the least
	significant bit:

        0: FIN No more data from sender.

        1: SYN Synchronize sequence numbers.

        2: RST Reset the connection.

        3: PSH Push function.

        4: ACK Acknowledgement field significant.

        5: URG Urgent pointer field significant.

        6: ECE ECN Echo.

        7: CWR Congestion Windows Reduced.

        8: NS  Nonce Sum.

        9-11:  Reserved.

        12-15: Not matchable, must be zero.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-10-29 09:43:59 -07:00
Jarno Rajahalme
a66733a8bc Widen TCP flags handling.
Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t).
The kernel interface remains at 8 bits, which makes no functional
difference now, as none of the higher bits is currently of interest
to the userspace.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-10-29 09:40:19 -07:00
Pravin B Shelar
b0b906ccf4 datapath: Per cpu flow stats.
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. Following patch allocates separate stats for each
CPU to make stats update scalable.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-10-21 08:42:20 -07:00
Pravin B Shelar
a097c0b230 datapath: Restructure datapath.c and flow.c
Over the time datapath.c and flow.c has became pretty large files.
Following patch restructures functionality of component into three
different components:

flow.c: contains flow extract.
flow_netlink.c: netlink flow api.
flow_table.c: flow table api.

Diffstat is showing wrong count. This patch mostly restructures code
without changing logic.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-10-01 17:11:16 -07:00
Daniel Borkmann
eba93614ba datapath: flow: fix potential illegal memory access in __parse_flow_nlattrs
In function __parse_flow_nlattrs(), we check for condition
(type > OVS_KEY_ATTR_MAX) and if true, print an error, but we do
not return from this function as in other checks. It seems this
has been forgotten, as otherwise, we could access beyond the
memory of ovs_key_lens, which is of ovs_key_lens[OVS_KEY_ATTR_MAX + 1].
Hence, a maliciously prepared nla_type from user space could access
beyond this upper limit.

Introduced by 03f0d916a ("openvswitch: Mega flow implementation").

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-09-09 13:27:27 -07:00
Pravin B Shelar
3025a772a1 datapath: Remove skb->mark compat code.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-09-06 09:51:31 -07:00
Jesse Gross
9a6fe4c1ae datapath: Fix alignment of struct sw_flow_key.
sw_flow_key alignment was declared as " __aligned(__alignof__(long))".
However, this breaks on the m68k architecture where long is 32 bit in
size but 16 bit aligned by default. This aligns to the size of a long to
ensure that we can always do comparsions in full long-sized chunks. It
also adds an additional build check to catch any reduction in alignment.

CC: Andy Zhou <azhou@nicira.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-09-05 13:03:51 -07:00
Andy Zhou
c2dd5e999f datapath: optimize flow compare and mask functions
Make sure the sw_flow_key structure and valid mask boundaries are always
machine word aligned. Optimize the flow compare and mask operations
using machine word size operations. This patch improves throughput on
average by 15% when CPU is the bottleneck of forwarding packets.

This patch is inspired by ideas and code from a patch submitted by Peter
Klausler titled "replace memcmp() with specialized comparator".
However, The original patch only optimizes for architectures
support unaligned machine word access. This patch optimizes for all
architectures.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-27 12:53:12 -07:00
Jesse Gross
0154b18b84 datapath: Remove redundant EtherType check in vlan translation.
This was supposed to be part of the previous patch but I forgot
to commit it before pushing.

Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-22 13:30:45 -07:00
Andy Zhou
799fe14776 datapath: More strict vlan encap netlink check
Only parse the encap key field if eth_type is 802.1Q and
VLAN_TAG_PRESENT bit is set. Add a few more eror checks and logs.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-22 13:25:25 -07:00
Andy Zhou
cc611f66a6 datapath: Rename key_len to key_end
Key_end is a better name describing the ending boundary than key_len.
Rename those variables to make it less confusing.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-22 11:39:53 -07:00
Joe Stringer
10f72e3da9 datapath: Add SCTP support
This patch adds support for rewriting SCTP src,dst ports similar to the
functionality already available for TCP/UDP.

Rewriting SCTP ports is expensive due to double-recalculation of the
SCTP checksums; this is performed to ensure that packets traversing OVS
with invalid checksums will continue to the destination with any
checksum corruption intact.

Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Ben Pfaff <blp@nicira.com>
2013-08-22 09:29:39 -07:00
Justin Pettit
7d8777cdf4 datapath: Remove old argument description in flow.c.
Signed-off-by: Justin Pettit <jpettit@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
2013-08-19 17:45:54 -07:00
pritesh
e821558d0a datapath: Fix typos in Netlink debugging messages.
Signed-off-by: pritesh <pritesh.kothari@cisco.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-19 17:42:57 -07:00