This header only defines sockaddr_pkt, which this source file doesn't use.
This was the only user of net/if_packet.h, so also remove the
configure-time test for it (which netdev-linux wasn't using anyway).
Reported-by: Andre McCurdy <armccurdy@gmail.com>
Reported-at: https://github.com/openvswitch/ovs/pull/253
Signed-off-by: Ben Pfaff <blp@ovn.org>
ifi_flags in struct netdev_linux is an unsigned int, therefore use
unsigned int for variables which will hold ifi_flags values.
Signed-off-by: Andre McCurdy <armccurdy@gmail.com>
The macros are hard to read. This makes it a little more readable.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
If the kernel reported a value of 0 for the second value in
/proc/net/psched, it would cause a division-by-zero fault in
read_psched(). I don't know of a kernel that would actually do that, but
it's still better to be safe.
Found by clang static analyzer.
Reported-by: Bhargava Shastry <bshastry@sect.tu-berlin.de>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
A bissect shows that commit d22f892 ("netdev-linux: monitor and offload
LAG slaves to TC") introduced netdev_linux_update_lag(), which is now
triggering a crash in the "datapath - ping over bond" test in
system-userspace-testsuite:
(gdb) bt
#0 0x00000000009762e7 in netdev_linux_update_lag (change=0x7ffdff013750) at lib/netdev-linux.c:728
728 if (is_netdev_linux_class(master_netdev->netdev_class)) {
This fixes the crash by simply returning in case netdev_from_name()
returns NULL, as this should indicate the master is not attached to the
bridge.
Additionally, netdev_linux_update_lag() isn't "clearing" the netdev
reference it gets from netdev_from_name(), meaning its ref_cnt is
incremented but never decremented. Thus, also call netdev_close() before
returning.
CC: John Hurley <john.hurley@netronome.com>
Fixes: d22f8927 ("netdev-linux: monitor and offload LAG slaves to TC")
Signed-off-by: Tiago Lam <tiago.lam@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
A LAG slave cannot be added directly to an OvS bridge, nor can a OvS
bridge port be added to a LAG dev. However, LAG masters can be added to
OvS.
Use TC blocks to indirectly offload slaves when their master is attached
as a linux-netdev to an OvS bridge. In the kernel TC datapath, blocks link
together netdevs in a similar way to LAG devices. For example, if a filter
is added to a block then it is added to all block devices, or if stats are
incremented on 1 device then the stats on the entire block are incremented.
This mimics LAG devices in that if a rule is applied to the LAG master
then it should be applied to all slaves etc.
Monitor LAG slaves via the netlink socket in netdev-linux and, if their
master is attached to the OvS bridge and has a block id, add the slave's
qdisc to the same block. Similarly, if a slave is freed from a master,
remove the qdisc from the masters block.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Assign block ids to LAG masters that are added to OvS as linux-netdevs and
offloaded via offload API calls. Only LAG masters are assigned to blocks.
To ensure uniqueness, the block ids are determined by the netdev ifindex.
Implement a get_block_id op for linux netdevs to achieve this.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
If a linux netdev is added to OvS that is a LAG master (for example, a
bond or team netdev) then record this in bool form in the dev struct. Use
the link info extracted from rtnetlink calls to determine this.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Add a new class op for netdevs to get the block_id if one exists. The
block_id is used in offload ops to group multiple qdiscs together.
Stub calls are made to the new class op (implementation to follow in
further patches). The default block_id of 0 (no block) will be used in
these cases.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Blocks, in tc classifiers, allow the grouping of multiple qdiscs with an
associated block id. Whenever a filter is added to/removed from this
block, the filter is added to/removed from all associated qdiscs.
Extend TC offload functions to take a block id as a parameter. If the id
is zero then the dqisc is not considered part of a block.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Assume the device is present if it can be opened.
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
If the 'openvswitch' kernel module is not loaded, the API is not
available and the userspace will keep retrying. This approach is
not ideal for the netdev datapath type.
This patch disables network netns support if the error code returned
indicates that the API is not available.
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
Tap device is not added to the kernel datapath, so there is
no way to get netns information.
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Tested-by: Eelco Chaudron <echaudro@redhat.com>
If the caller provides a non-NULL qfill pointer and the netdev
implemementation supports reading the rx queue fill level, the rxq_recv()
function returns the remaining number of packets in the rx queue after
reception of the packet burst to the caller. If the implementation does
not support this, it returns -ENOTSUP instead. Reading the remaining queue
fill level should not substantilly slow down the recv() operation.
A first implementation is provided for ethernet and vhostuser DPDK ports
in netdev-dpdk.c.
This output parameter will be used in the upcoming commit for PMD
performance metrics to supervise the rx queue fill level for DPDK
vhostuser ports.
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
When the netdev is in another namespace and the operation doesn't
support network namespaces, return the correct error.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Internal ports may be moved to another network namespace
and when that happens, the vswitch stops receiving netlink
notifications.
This patch enables the vswitch to listen to all network
namespaces that have a nsid assigned into the network
namespace where the socket has been opened.
It requires kernel 4.2 or newer.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
The ioctl interface doesn't support network namespaces, so
try updating the netdev using netlink message instead.
To provide backwards compatibility, fall back to the previous
method if netlink isn't supported or fails.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Recent kernels provide the network namespace ID of a port,
so use that to discover where the port currently is.
A network device in another network namespace could have the
same name, so once the socket starts listening to other network
namespaces, it is necessary to confirm the netnsid.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
The netlink notification's ancillary data contains the network
namespace id (netnsid) needed to identify the device correctly.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
When mac addr of ports on bridge has been changed, for example,
$ ip link set dev eth0 address 00:11:22:33:44:55
we should reconfigure the datapath id and mac addr of local port.
But now openvswitch dont do that as expected.
A simple example of how to reproduce it:
$ ovs-vsctl add-br br0
$ ifconfig br0 # for example, mac is c6:c6:d7:46:b4:4b
$ ip link set dev br0 address 00:11:22:33:44:55
$ ifconfig br0 # mac of br0 will be 00:11:22:33:44:55
then repeat:
$ ip link set dev br0 address 00:11:22:33:44:55
$ ifconfig br0 # mac of br0 will be c6:c6:d7:46:b4:4b
This patch reports the mac changed event when ports changed, then
openvswitch will reconfigure the datapath id and mac addr of local
port.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Today OVS pushes packets to the TAP interface ignoring its
current state. That works because the kernel will return -EIO
when it's not UP and OVS will just ignore that as it is not
an OVS issue.
However, it causes a huge impact when broadcasts happen when
using userspace datapath accelerated with DPDK (e.g.: action
NORMAL). This patch improves the situation by checking the
TAP's interface state before issueing any syscall.
However, there might be use-cases moving interfaces to other
networking namespaces and in that case, OVS can't retrieve
the iface state (sets it to DOWN). That would stop the traffic
breaking the use-case. This patch relies on netlink notifications
to find out if the device is local or not. When it's local, the
device state is checked otherwise it will behave as before.
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
- New get_custom_stats interface function is added to netdev. It
allows particular netdev implementation to expose custom
counters in dictionary format (counter name/counter value).
- New statistics are retrieved using experimenter code and
are printed as a result to ofctl dump-ports.
- New counters are available for OpenFlow 1.4+.
- New statistics are printed to output via ofctl only if those
are present in reply message.
- New statistics definition is added to include/openflow/intel-ext.h.
- Custom statistics are implemented only for dpdk-physical
port type.
- DPDK-physical implementation uses xstats to collect statistics.
Only dropped and error counters are exposed.
Co-authored-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Michal Weglicki <michalx.weglicki@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
FreeBSD insists that <sys/types.h> be included before <netinet/in.h> and
that <netinet/in.h> be included before <arpa/inet.h>. This adds guards to
the "sparse" headers to yield a warning if this order is violated. This
commit also adjusts the order of many #includes to suit this requirement.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Not needed anymore because 'may_steal' already handled on
dpif-netdev layer and always true.
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com
Use DP_PACKET_BATCH_FOR_EACH macro and dp_packet_batch_size() API
in netdev_linux_sock_batch_send().
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Use DP_PACKET_BATCH_FOR_EACH macro in netdev_linux_tap_batch_send().
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Poll-loop is the core to implement main loop. It should be available in
libopenvswitch.
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
When max-rate is less than 8bit, the hc->max_rate will be set
as htb->max_rate mistakenly instead of mtu of netdev.
Fixes: 13c1637 ("smap: New function smap_get_ullong().")
Signed-off-by: Kaige Fu <fukaige@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
These are normal and unavoidable, because the vifs
disappear from the kernel before they are removed them from the OVS
database.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Sendmmsg can reduce cpu cycles in sending packets to kernel.
Replace sendmsg with sendmmsg in function netdev_linux_send to send
batch packets if sendmmsg is available.
If kernel side doesn't support sendmmsg, will fallback to sendmsg.
netserver
|------------|
| |
| container |
|----veth----|
|
| |------------|
|---veth-| dpdk-ovs | netperf
| | |--------------|
|----dpdk----| | bare-metal |
| |--------------|
| |
| |
pnic-----------pnic
Netperf was consumed to test the performance:
1)cmd:netperf -H remote-container -t UDP_STREAM -l 60 -- -m 1400
result: netserver received 2383.21Mb(sendmsg)/2551.64Mb(sendmmsg)
2)cmd:netperf -H remote-container -t UDP_STREAM -l 60 -- -m 60
result: netserver received 109.72Mb(sendmsg)/115.18Mb(sendmmsg)
Sendmmsg show about 6% improvement in netperf UDP testing.
Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Shadowing is when a variable with a given name in an inner scope hides a
different variable with the same name in a surrounding scope. This is
generally undesirable because it can confuse programmers. This commit
eliminates most of it.
Found with -Wshadow=local in GCC 7. The repo is not really ready to enable
this option by default because of a few cases that are harder to fix, and
harmless, such as nested use of CMAP_FOR_EACH.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Andy Zhou <azhou@ovn.org>
In netdev_gre_build_header(), GRE protocol and VXLAN next_potocol is set based
on packet_type of flow. If it's about an Ethernet packet, it is set to
ETP_TYPE_TEB. Otherwise, if the name space is OFPHTN_ETHERNET, it is set
according to the name space type.
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Fix minor style variations and unnecessary includes.
Signed-off-by: Joe Stringer <joe@ovn.org>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Greg Rose <gvrose8192@gmail.com>
netdev vports are backed by actualy netdev at the kernel
level, so they can use the common netdev-tc offloads interface
for flow offloading (if enabled).
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Notify as not supported. Otherwise the ingress qdisc is being removed and
offload rules will be removed.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Add a new API interface for offloading dpif flows to netdev.
The API consist on the following:
flow_put - offload a new flow
flow_get - query an offloaded flow
flow_del - delete an offloaded flow
flow_flush - flush all offloaded flows
flow_dump_* - dump all offloaded flows
In upcoming commits we will introduce an implementation of this
API for netdev-linux.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Add tc module to expose tc operations to be used by other modules.
Move some tc related functions from netdev-linux.c to tc.c
This patch doesn't change any functionality.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Co-authored-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Refactor tc_make_request and tc_add_del_ingress_qdisc to accept
ifindex instead of netdev struct.
We later want to move those outside netdev-linux module to be
used by other modules.
This patch doesn't change any functionality.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Co-authored-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
It is important to maintain the original state when
the device already exists in the system.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
When using data path type "netdev", bridge port is a tun device
and when OVS restarts, that device and its network configuration
is lost.
This patch enables the tap device to persist instead.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
We don't need to initialize sock,msg and sll before calling sendmsg each
time. Just initialize them before the loop, it can reduce cpu cycles.
Signed-off-by: Zhenyu Gao <sysugaozhenyu@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
This commit adds a packet_type attribute to the structs dp_packet and flow
to explicitly carry the type of the packet as prepration for the
introduction of the so-called packet type-aware pipeline (PTAP) in OVS.
The packet_type is a big-endian 32 bit integer with the encoding as
specified in OpenFlow verion 1.5.
The upper 16 bits contain the packet type name space. Pre-defined values
are defined in openflow-common.h:
enum ofp_header_type_namespaces {
OFPHTN_ONF = 0, /* ONF namespace. */
OFPHTN_ETHERTYPE = 1, /* ns_type is an Ethertype. */
OFPHTN_IP_PROTO = 2, /* ns_type is a IP protocol number. */
OFPHTN_UDP_TCP_PORT = 3, /* ns_type is a TCP or UDP port. */
OFPHTN_IPV4_OPTION = 4, /* ns_type is an IPv4 option number. */
};
The lower 16 bits specify the actual type in the context of the name space.
Only name spaces 0 and 1 will be supported for now.
For name space OFPHTN_ONF the relevant packet type is 0 (Ethernet).
This is the default packet_type in OVS and the only one supported so far.
Packets of type (OFPHTN_ONF, 0) are called Ethernet packets.
In name space OFPHTN_ETHERTYPE the type is the Ethertype of the packet.
A packet of type (OFPHTN_ETHERTYPE, <Ethertype>) is a standard L2 packet
whith the Ethernet header (and any VLAN tags) removed to expose the L3
(or L2.5) payload of the packet. These will simply be called L3 packets.
The Ethernet address fields dl_src and dl_dst in struct flow are not
applicable for an L3 packet and must be zero. However, to maintain
compatibility with the large code base, we have chosen to copy the
Ethertype of an L3 packet into the the dl_type field of struct flow.
This does not mean that it will be possible to match on dl_type for L3
packets with PTAP later on. Matching must be done on packet_type instead.
New dp_packets are initialized with packet_type Ethernet. Ports that
receive L3 packets will have to explicitly adjust the packet_type.
Signed-off-by: Jean Tourrilhes <jt@labs.hpe.com>
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Co-authored-by: Zoltan Balogh <zoltan.balogh@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Under Linux, when users create bridge named "default" or "all", although
ovs-vsctl fails but vswitchd in the background will keep retrying it,
causing the systemd-udev to reach 100% cpu utilization. The patch prevents
any attempt to create or open a netdev named "default" or "all" because
these two names are reserved on Linux due to
/proc/sys/net/ipv4/conf/ always contains directories by these names.
The reason for high CPU utilization is due to frequent calls into kernel's
register_netdevice function, which will invoke several kernel elements who
has registered on the netdevice notifier chain. And due to creation failed,
OVS wakes up and re-recreate the device, which ends up as a high CPU loop.
VMWare-BZ: #1842388
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Greg Rose <gvrose8192@gmail.com>
One common use case of 'struct dp_packet_batch' is to process all
packets in the batch in order. Add an iterator for this use case
to simplify the logic of calling sites,
Another common use case is to drop packets in the batch, by reading
all packets, but writing back pointers of fewer packets. Add macros
to support this use case.
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Jarno Rajahalme <jarno@ovn.org>