mirror of
https://github.com/openvswitch/ovs
synced 2025-08-22 01:51:26 +00:00
dpif-netdev: Add PMD load based sleeping.
Sleep for an incremental amount of time if none of the Rx queues assigned to a PMD have at least half a batch of packets (i.e. 16 pkts) on an polling iteration of the PMD. Upon detecting the threshold of >= 16 pkts on an Rxq, reset the sleep time to zero (i.e. no sleep). Sleep time will be increased on each iteration where the low load conditions remain up to a total of the max sleep time which is set by the user e.g: ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500 The default pmd-maxsleep value is 0, which means that no sleeps will occur and the default behaviour is unchanged from previously. Also add new stats to pmd-perf-show to get visibility of operation e.g. ... - sleep iterations: 153994 ( 76.8 % of iterations) Sleep time (us): 9159399 ( 59 us/iteration avg.) ... Reviewed-by: Robin Jarry <rjarry@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit is contained in:
parent
f4c8841351
commit
de3bbdc479
@ -324,5 +324,59 @@ A user can use this option to set a minimum frequency of Rx queue to PMD
|
||||
reassignment due to PMD Auto Load Balance. For example, this could be set
|
||||
(in min) such that a reassignment is triggered at most every few hours.
|
||||
|
||||
PMD load based sleeping (Experimental)
|
||||
--------------------------------------
|
||||
|
||||
PMD threads constantly poll Rx queues which are assigned to them. In order to
|
||||
reduce the CPU cycles they use, they can sleep for small periods of time
|
||||
when there is no load or very-low load on all the Rx queues they poll.
|
||||
|
||||
This can be enabled by setting the max requested sleep time (in microseconds)
|
||||
for a PMD thread::
|
||||
|
||||
$ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
|
||||
|
||||
Non-zero values will be rounded up to the nearest 10 microseconds to avoid
|
||||
requesting very small sleep times.
|
||||
|
||||
With a non-zero max value a PMD may request to sleep by an incrementing amount
|
||||
of time up to the maximum time. If at any point the threshold of at least half
|
||||
a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
|
||||
polling is met, the requested sleep time will be reset to 0. At that point no
|
||||
sleeps will occur until the no/low load conditions return.
|
||||
|
||||
Sleeping in a PMD thread will mean there is a period of time when the PMD
|
||||
thread will not process packets. Sleep times requested are not guaranteed
|
||||
and can differ significantly depending on system configuration. The actual
|
||||
time not processing packets will be determined by the sleep and processor
|
||||
wake-up times and should be tested with each system configuration.
|
||||
|
||||
Sleep time statistics for 10 secs can be seen with::
|
||||
|
||||
$ ovs-appctl dpif-netdev/pmd-stats-clear \
|
||||
&& sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
|
||||
|
||||
Example output, showing that during the last 10 seconds, 76.8% of iterations
|
||||
had a sleep of some length. The total amount of sleep time was 9.15 seconds and
|
||||
the average sleep time per iteration was 46 microseconds::
|
||||
|
||||
- sleep iterations: 153994 ( 76.8 % of iterations)
|
||||
Sleep time (us): 9159399 ( 59 us/iteration avg.)
|
||||
|
||||
Any potential power saving from PMD load based sleeping is dependent on the
|
||||
system configuration (e.g. enabling processor C-states) and workloads.
|
||||
|
||||
.. note::
|
||||
|
||||
If there is a sudden spike of packets while the PMD thread is sleeping and
|
||||
the processor is in a low-power state it may result in some lost packets or
|
||||
extra latency before the PMD thread returns to processing packets at full
|
||||
rate.
|
||||
|
||||
.. note::
|
||||
|
||||
By default Linux kernel groups timer expirations and this can add an
|
||||
overhead of up to 50 microseconds to a requested timer expiration.
|
||||
|
||||
.. _ovs-vswitchd(8):
|
||||
http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html
|
||||
|
3
NEWS
3
NEWS
@ -30,6 +30,9 @@ Post-v3.0.0
|
||||
- Userspace datapath:
|
||||
* Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
|
||||
the pmd usage of an Rx queue over a configurable time period.
|
||||
* Add new experimental PMD load based sleeping feature. PMD threads can
|
||||
request to sleep up to a user configured 'pmd-maxsleep' value under
|
||||
low load conditions.
|
||||
|
||||
|
||||
v3.0.0 - 15 Aug 2022
|
||||
|
@ -230,18 +230,26 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
|
||||
uint64_t tot_iter = histogram_samples(&s->pkts);
|
||||
uint64_t idle_iter = s->pkts.bin[0];
|
||||
uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
|
||||
uint64_t sleep_iter = stats[PMD_SLEEP_ITER];
|
||||
uint64_t tot_sleep_cycles = stats[PMD_CYCLES_SLEEP];
|
||||
|
||||
ds_put_format(str,
|
||||
" Iterations: %12"PRIu64" (%.2f us/it)\n"
|
||||
" - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n"
|
||||
" - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
|
||||
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n",
|
||||
tot_iter, tot_cycles * us_per_cycle / tot_iter,
|
||||
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
|
||||
" - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n"
|
||||
" Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n",
|
||||
tot_iter,
|
||||
(tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
|
||||
tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
|
||||
idle_iter,
|
||||
100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
|
||||
busy_iter,
|
||||
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
|
||||
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
|
||||
sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
|
||||
tot_sleep_cycles * us_per_cycle,
|
||||
sleep_iter ? (tot_sleep_cycles * us_per_cycle) / sleep_iter : 0);
|
||||
if (rx_packets > 0) {
|
||||
ds_put_format(str,
|
||||
" Rx packets: %12"PRIu64" (%.0f Kpps, %.0f cycles/pkt)\n"
|
||||
@ -518,14 +526,15 @@ OVS_REQUIRES(s->stats_mutex)
|
||||
|
||||
void
|
||||
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
|
||||
int tx_packets, bool full_metrics)
|
||||
int tx_packets, uint64_t sleep_cycles,
|
||||
bool full_metrics)
|
||||
{
|
||||
uint64_t now_tsc = cycles_counter_update(s);
|
||||
struct iter_stats *cum_ms;
|
||||
uint64_t cycles, cycles_per_pkt = 0;
|
||||
char *reason = NULL;
|
||||
|
||||
cycles = now_tsc - s->start_tsc;
|
||||
cycles = now_tsc - s->start_tsc - sleep_cycles;
|
||||
s->current.timestamp = s->iteration_cnt;
|
||||
s->current.cycles = cycles;
|
||||
s->current.pkts = rx_packets;
|
||||
@ -539,6 +548,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
|
||||
histogram_add_sample(&s->cycles, cycles);
|
||||
histogram_add_sample(&s->pkts, rx_packets);
|
||||
|
||||
if (sleep_cycles) {
|
||||
pmd_perf_update_counter(s, PMD_SLEEP_ITER, 1);
|
||||
pmd_perf_update_counter(s, PMD_CYCLES_SLEEP, sleep_cycles);
|
||||
}
|
||||
|
||||
if (!full_metrics) {
|
||||
return;
|
||||
}
|
||||
|
@ -80,6 +80,8 @@ enum pmd_stat_type {
|
||||
PMD_CYCLES_ITER_IDLE, /* Cycles spent in idle iterations. */
|
||||
PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */
|
||||
PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */
|
||||
PMD_SLEEP_ITER, /* Iterations where a sleep has taken place. */
|
||||
PMD_CYCLES_SLEEP, /* Total cycles slept to save power. */
|
||||
PMD_N_STATS
|
||||
};
|
||||
|
||||
@ -408,7 +410,8 @@ void
|
||||
pmd_perf_start_iteration(struct pmd_perf_stats *s);
|
||||
void
|
||||
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
|
||||
int tx_packets, bool full_metrics);
|
||||
int tx_packets, uint64_t sleep_cycles,
|
||||
bool full_metrics);
|
||||
|
||||
/* Formatting the output of commands. */
|
||||
|
||||
|
@ -171,6 +171,11 @@ static struct odp_support dp_netdev_support = {
|
||||
/* Time in microseconds to try RCU quiescing. */
|
||||
#define PMD_RCU_QUIESCE_INTERVAL 10000LL
|
||||
|
||||
/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
|
||||
#define PMD_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
|
||||
/* Time in uS to increment a pmd thread sleep time. */
|
||||
#define PMD_SLEEP_INC_US 10
|
||||
|
||||
struct dpcls {
|
||||
struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */
|
||||
odp_port_t in_port;
|
||||
@ -279,6 +284,8 @@ struct dp_netdev {
|
||||
atomic_uint32_t emc_insert_min;
|
||||
/* Enable collection of PMD performance metrics. */
|
||||
atomic_bool pmd_perf_metrics;
|
||||
/* Max load based sleep request. */
|
||||
atomic_uint64_t pmd_max_sleep;
|
||||
/* Enable the SMC cache from ovsdb config */
|
||||
atomic_bool smc_enable_db;
|
||||
|
||||
@ -4821,8 +4828,10 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
|
||||
uint64_t rebalance_intvl;
|
||||
uint8_t cur_rebalance_load;
|
||||
uint32_t rebalance_load, rebalance_improve;
|
||||
uint64_t pmd_max_sleep, cur_pmd_max_sleep;
|
||||
bool log_autolb = false;
|
||||
enum sched_assignment_type pmd_rxq_assign_type;
|
||||
static bool first_set_config = true;
|
||||
|
||||
tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
|
||||
DEFAULT_TX_FLUSH_INTERVAL);
|
||||
@ -4969,6 +4978,19 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
|
||||
bool autolb_state = smap_get_bool(other_config, "pmd-auto-lb", false);
|
||||
|
||||
set_pmd_auto_lb(dp, autolb_state, log_autolb);
|
||||
|
||||
pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0);
|
||||
pmd_max_sleep = ROUND_UP(pmd_max_sleep, 10);
|
||||
pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
|
||||
atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
|
||||
if (first_set_config || pmd_max_sleep != cur_pmd_max_sleep) {
|
||||
atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
|
||||
VLOG_INFO("PMD max sleep request is %"PRIu64" usecs.", pmd_max_sleep);
|
||||
VLOG_INFO("PMD load based sleeps are %s.",
|
||||
pmd_max_sleep ? "enabled" : "disabled" );
|
||||
}
|
||||
|
||||
first_set_config = false;
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -6929,6 +6951,7 @@ pmd_thread_main(void *f_)
|
||||
int poll_cnt;
|
||||
int i;
|
||||
int process_packets = 0;
|
||||
uint64_t sleep_time = 0;
|
||||
|
||||
poll_list = NULL;
|
||||
|
||||
@ -6989,10 +7012,13 @@ reload:
|
||||
ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
|
||||
for (;;) {
|
||||
uint64_t rx_packets = 0, tx_packets = 0;
|
||||
uint64_t time_slept = 0;
|
||||
uint64_t max_sleep;
|
||||
|
||||
pmd_perf_start_iteration(s);
|
||||
|
||||
atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
|
||||
atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);
|
||||
|
||||
for (i = 0; i < poll_cnt; i++) {
|
||||
|
||||
@ -7011,6 +7037,9 @@ reload:
|
||||
dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
|
||||
poll_list[i].port_no);
|
||||
rx_packets += process_packets;
|
||||
if (process_packets >= PMD_SLEEP_THRESH) {
|
||||
sleep_time = 0;
|
||||
}
|
||||
}
|
||||
|
||||
if (!rx_packets) {
|
||||
@ -7018,7 +7047,30 @@ reload:
|
||||
* Check if we need to send something.
|
||||
* There was no time updates on current iteration. */
|
||||
pmd_thread_ctx_time_update(pmd);
|
||||
tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
|
||||
tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
|
||||
max_sleep && sleep_time
|
||||
? true : false);
|
||||
}
|
||||
|
||||
if (max_sleep) {
|
||||
/* Check if a sleep should happen on this iteration. */
|
||||
if (sleep_time) {
|
||||
struct cycle_timer sleep_timer;
|
||||
|
||||
cycle_timer_start(&pmd->perf_stats, &sleep_timer);
|
||||
xnanosleep_no_quiesce(sleep_time * 1000);
|
||||
time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer);
|
||||
pmd_thread_ctx_time_update(pmd);
|
||||
}
|
||||
if (sleep_time < max_sleep) {
|
||||
/* Increase sleep time for next iteration. */
|
||||
sleep_time += PMD_SLEEP_INC_US;
|
||||
} else {
|
||||
sleep_time = max_sleep;
|
||||
}
|
||||
} else {
|
||||
/* Reset sleep time as max sleep policy may have been changed. */
|
||||
sleep_time = 0;
|
||||
}
|
||||
|
||||
/* Do RCU synchronization at fixed interval. This ensures that
|
||||
@ -7058,7 +7110,7 @@ reload:
|
||||
break;
|
||||
}
|
||||
|
||||
pmd_perf_end_iteration(s, rx_packets, tx_packets,
|
||||
pmd_perf_end_iteration(s, rx_packets, tx_packets, time_slept,
|
||||
pmd_perf_metrics_enabled(pmd));
|
||||
}
|
||||
ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
|
||||
@ -9909,7 +9961,7 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
|
||||
struct polled_queue *poll_list, int poll_cnt)
|
||||
{
|
||||
struct dpcls *cls;
|
||||
uint64_t tot_idle = 0, tot_proc = 0;
|
||||
uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
|
||||
unsigned int pmd_load = 0;
|
||||
|
||||
if (pmd->ctx.now > pmd->next_cycle_store) {
|
||||
@ -9926,10 +9978,13 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
|
||||
pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
|
||||
tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
|
||||
pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
|
||||
tot_sleep = pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP] -
|
||||
pmd->prev_stats[PMD_CYCLES_SLEEP];
|
||||
|
||||
if (pmd_alb->is_enabled && !pmd->isolated) {
|
||||
if (tot_proc) {
|
||||
pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
|
||||
pmd_load = ((tot_proc * 100) /
|
||||
(tot_idle + tot_proc + tot_sleep));
|
||||
}
|
||||
|
||||
atomic_read_relaxed(&pmd_alb->rebalance_load_thresh,
|
||||
@ -9946,6 +10001,8 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
|
||||
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
|
||||
pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
|
||||
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
|
||||
pmd->prev_stats[PMD_CYCLES_SLEEP] =
|
||||
pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP];
|
||||
|
||||
/* Get the cycles that were used to process each queue and store. */
|
||||
for (unsigned i = 0; i < poll_cnt; i++) {
|
||||
|
46
tests/pmd.at
46
tests/pmd.at
@ -1254,3 +1254,49 @@ ovs-appctl: ovs-vswitchd: server returned an error
|
||||
|
||||
OVS_VSWITCHD_STOP
|
||||
AT_CLEANUP
|
||||
|
||||
dnl Check default state
|
||||
AT_SETUP([PMD - pmd sleep])
|
||||
OVS_VSWITCHD_START
|
||||
|
||||
dnl Check default
|
||||
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
|
||||
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
|
||||
|
||||
dnl Check low value max sleep
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
|
||||
|
||||
dnl Check high value max sleep
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10000"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
|
||||
|
||||
dnl Check setting max sleep to zero
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
|
||||
|
||||
dnl Check above high value max sleep
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
|
||||
|
||||
dnl Check rounding
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
|
||||
dnl Check rounding
|
||||
get_log_next_line_num
|
||||
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="491"])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 500 usecs."])
|
||||
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
|
||||
|
||||
OVS_VSWITCHD_STOP
|
||||
AT_CLEANUP
|
||||
|
@ -788,6 +788,32 @@
|
||||
The default value is <code>25%</code>.
|
||||
</p>
|
||||
</column>
|
||||
<column name="other_config" key="pmd-maxsleep"
|
||||
type='{"type": "integer",
|
||||
"minInteger": 0, "maxInteger": 10000}'>
|
||||
<p>
|
||||
Specifies the maximum sleep time that will be requested in
|
||||
microseconds per iteration for a PMD thread which has received zero
|
||||
or a small amount of packets from the Rx queues it is polling.
|
||||
</p>
|
||||
<p>
|
||||
The actual sleep time requested is based on the load
|
||||
of the Rx queues that the PMD polls and may be less than
|
||||
the maximum value.
|
||||
</p>
|
||||
<p>
|
||||
The default value is <code>0 microseconds</code>, which means
|
||||
that the PMD will not sleep regardless of the load from the
|
||||
Rx queues that it polls.
|
||||
</p>
|
||||
<p>
|
||||
To avoid requesting very small sleeps (e.g. less than 10 us) the
|
||||
value will be rounded up to the nearest 10 us.
|
||||
</p>
|
||||
<p>
|
||||
The maximum value is <code>10000 microseconds</code>.
|
||||
</p>
|
||||
</column>
|
||||
<column name="other_config" key="userspace-tso-enable"
|
||||
type='{"type": "boolean"}'>
|
||||
<p>
|
||||
|
Loading…
x
Reference in New Issue
Block a user