2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00

dpif-netdev: Add PMD load based sleeping.

Sleep for an incremental amount of time if none of the Rx queues
assigned to a PMD have at least half a batch of packets (i.e. 16 pkts)
on an polling iteration of the PMD.

Upon detecting the threshold of >= 16 pkts on an Rxq, reset the
sleep time to zero (i.e. no sleep).

Sleep time will be increased on each iteration where the low load
conditions remain up to a total of the max sleep time which is set
by the user e.g:
ovs-vsctl set Open_vSwitch . other_config:pmd-maxsleep=500

The default pmd-maxsleep value is 0, which means that no sleeps
will occur and the default behaviour is unchanged from previously.

Also add new stats to pmd-perf-show to get visibility of operation
e.g.
...
   - sleep iterations:       153994  ( 76.8 % of iterations)
   Sleep time (us):         9159399  ( 59 us/iteration avg.)
...

Reviewed-by: Robin Jarry <rjarry@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
This commit is contained in:
Kevin Traynor 2023-01-11 09:35:01 +00:00 committed by Ilya Maximets
parent f4c8841351
commit de3bbdc479
7 changed files with 213 additions and 10 deletions

View File

@ -324,5 +324,59 @@ A user can use this option to set a minimum frequency of Rx queue to PMD
reassignment due to PMD Auto Load Balance. For example, this could be set
(in min) such that a reassignment is triggered at most every few hours.
PMD load based sleeping (Experimental)
--------------------------------------
PMD threads constantly poll Rx queues which are assigned to them. In order to
reduce the CPU cycles they use, they can sleep for small periods of time
when there is no load or very-low load on all the Rx queues they poll.
This can be enabled by setting the max requested sleep time (in microseconds)
for a PMD thread::
$ ovs-vsctl set open_vswitch . other_config:pmd-maxsleep=500
Non-zero values will be rounded up to the nearest 10 microseconds to avoid
requesting very small sleep times.
With a non-zero max value a PMD may request to sleep by an incrementing amount
of time up to the maximum time. If at any point the threshold of at least half
a batch of packets (i.e. 16) is received from an Rx queue that the PMD is
polling is met, the requested sleep time will be reset to 0. At that point no
sleeps will occur until the no/low load conditions return.
Sleeping in a PMD thread will mean there is a period of time when the PMD
thread will not process packets. Sleep times requested are not guaranteed
and can differ significantly depending on system configuration. The actual
time not processing packets will be determined by the sleep and processor
wake-up times and should be tested with each system configuration.
Sleep time statistics for 10 secs can be seen with::
$ ovs-appctl dpif-netdev/pmd-stats-clear \
&& sleep 10 && ovs-appctl dpif-netdev/pmd-perf-show
Example output, showing that during the last 10 seconds, 76.8% of iterations
had a sleep of some length. The total amount of sleep time was 9.15 seconds and
the average sleep time per iteration was 46 microseconds::
- sleep iterations: 153994 ( 76.8 % of iterations)
Sleep time (us): 9159399 ( 59 us/iteration avg.)
Any potential power saving from PMD load based sleeping is dependent on the
system configuration (e.g. enabling processor C-states) and workloads.
.. note::
If there is a sudden spike of packets while the PMD thread is sleeping and
the processor is in a low-power state it may result in some lost packets or
extra latency before the PMD thread returns to processing packets at full
rate.
.. note::
By default Linux kernel groups timer expirations and this can add an
overhead of up to 50 microseconds to a requested timer expiration.
.. _ovs-vswitchd(8):
http://openvswitch.org/support/dist-docs/ovs-vswitchd.8.html

3
NEWS
View File

@ -30,6 +30,9 @@ Post-v3.0.0
- Userspace datapath:
* Add '-secs' argument to appctl 'dpif-netdev/pmd-rxq-show' to show
the pmd usage of an Rx queue over a configurable time period.
* Add new experimental PMD load based sleeping feature. PMD threads can
request to sleep up to a user configured 'pmd-maxsleep' value under
low load conditions.
v3.0.0 - 15 Aug 2022

View File

@ -230,18 +230,26 @@ pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats *s,
uint64_t tot_iter = histogram_samples(&s->pkts);
uint64_t idle_iter = s->pkts.bin[0];
uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 0;
uint64_t sleep_iter = stats[PMD_SLEEP_ITER];
uint64_t tot_sleep_cycles = stats[PMD_CYCLES_SLEEP];
ds_put_format(str,
" Iterations: %12"PRIu64" (%.2f us/it)\n"
" - Used TSC cycles: %12"PRIu64" (%5.1f %% of total cycles)\n"
" - idle iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n",
tot_iter, tot_cycles * us_per_cycle / tot_iter,
" - busy iterations: %12"PRIu64" (%5.1f %% of used cycles)\n"
" - sleep iterations: %12"PRIu64" (%5.1f %% of iterations)\n"
" Sleep time (us): %12.0f (%3.0f us/iteration avg.)\n",
tot_iter,
(tot_cycles + tot_sleep_cycles) * us_per_cycle / tot_iter,
tot_cycles, 100.0 * (tot_cycles / duration) / tsc_hz,
idle_iter,
100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
busy_iter,
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles,
sleep_iter, tot_iter ? 100.0 * sleep_iter / tot_iter : 0,
tot_sleep_cycles * us_per_cycle,
sleep_iter ? (tot_sleep_cycles * us_per_cycle) / sleep_iter : 0);
if (rx_packets > 0) {
ds_put_format(str,
" Rx packets: %12"PRIu64" (%.0f Kpps, %.0f cycles/pkt)\n"
@ -518,14 +526,15 @@ OVS_REQUIRES(s->stats_mutex)
void
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
int tx_packets, bool full_metrics)
int tx_packets, uint64_t sleep_cycles,
bool full_metrics)
{
uint64_t now_tsc = cycles_counter_update(s);
struct iter_stats *cum_ms;
uint64_t cycles, cycles_per_pkt = 0;
char *reason = NULL;
cycles = now_tsc - s->start_tsc;
cycles = now_tsc - s->start_tsc - sleep_cycles;
s->current.timestamp = s->iteration_cnt;
s->current.cycles = cycles;
s->current.pkts = rx_packets;
@ -539,6 +548,11 @@ pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
histogram_add_sample(&s->cycles, cycles);
histogram_add_sample(&s->pkts, rx_packets);
if (sleep_cycles) {
pmd_perf_update_counter(s, PMD_SLEEP_ITER, 1);
pmd_perf_update_counter(s, PMD_CYCLES_SLEEP, sleep_cycles);
}
if (!full_metrics) {
return;
}

View File

@ -80,6 +80,8 @@ enum pmd_stat_type {
PMD_CYCLES_ITER_IDLE, /* Cycles spent in idle iterations. */
PMD_CYCLES_ITER_BUSY, /* Cycles spent in busy iterations. */
PMD_CYCLES_UPCALL, /* Cycles spent processing upcalls. */
PMD_SLEEP_ITER, /* Iterations where a sleep has taken place. */
PMD_CYCLES_SLEEP, /* Total cycles slept to save power. */
PMD_N_STATS
};
@ -408,7 +410,8 @@ void
pmd_perf_start_iteration(struct pmd_perf_stats *s);
void
pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
int tx_packets, bool full_metrics);
int tx_packets, uint64_t sleep_cycles,
bool full_metrics);
/* Formatting the output of commands. */

View File

@ -171,6 +171,11 @@ static struct odp_support dp_netdev_support = {
/* Time in microseconds to try RCU quiescing. */
#define PMD_RCU_QUIESCE_INTERVAL 10000LL
/* Number of pkts Rx on an interface that will stop pmd thread sleeping. */
#define PMD_SLEEP_THRESH (NETDEV_MAX_BURST / 2)
/* Time in uS to increment a pmd thread sleep time. */
#define PMD_SLEEP_INC_US 10
struct dpcls {
struct cmap_node node; /* Within dp_netdev_pmd_thread.classifiers */
odp_port_t in_port;
@ -279,6 +284,8 @@ struct dp_netdev {
atomic_uint32_t emc_insert_min;
/* Enable collection of PMD performance metrics. */
atomic_bool pmd_perf_metrics;
/* Max load based sleep request. */
atomic_uint64_t pmd_max_sleep;
/* Enable the SMC cache from ovsdb config */
atomic_bool smc_enable_db;
@ -4821,8 +4828,10 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
uint64_t rebalance_intvl;
uint8_t cur_rebalance_load;
uint32_t rebalance_load, rebalance_improve;
uint64_t pmd_max_sleep, cur_pmd_max_sleep;
bool log_autolb = false;
enum sched_assignment_type pmd_rxq_assign_type;
static bool first_set_config = true;
tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
DEFAULT_TX_FLUSH_INTERVAL);
@ -4969,6 +4978,19 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
bool autolb_state = smap_get_bool(other_config, "pmd-auto-lb", false);
set_pmd_auto_lb(dp, autolb_state, log_autolb);
pmd_max_sleep = smap_get_ullong(other_config, "pmd-maxsleep", 0);
pmd_max_sleep = ROUND_UP(pmd_max_sleep, 10);
pmd_max_sleep = MIN(PMD_RCU_QUIESCE_INTERVAL, pmd_max_sleep);
atomic_read_relaxed(&dp->pmd_max_sleep, &cur_pmd_max_sleep);
if (first_set_config || pmd_max_sleep != cur_pmd_max_sleep) {
atomic_store_relaxed(&dp->pmd_max_sleep, pmd_max_sleep);
VLOG_INFO("PMD max sleep request is %"PRIu64" usecs.", pmd_max_sleep);
VLOG_INFO("PMD load based sleeps are %s.",
pmd_max_sleep ? "enabled" : "disabled" );
}
first_set_config = false;
return 0;
}
@ -6929,6 +6951,7 @@ pmd_thread_main(void *f_)
int poll_cnt;
int i;
int process_packets = 0;
uint64_t sleep_time = 0;
poll_list = NULL;
@ -6989,10 +7012,13 @@ reload:
ovs_mutex_lock(&pmd->perf_stats.stats_mutex);
for (;;) {
uint64_t rx_packets = 0, tx_packets = 0;
uint64_t time_slept = 0;
uint64_t max_sleep;
pmd_perf_start_iteration(s);
atomic_read_relaxed(&pmd->dp->smc_enable_db, &pmd->ctx.smc_enable_db);
atomic_read_relaxed(&pmd->dp->pmd_max_sleep, &max_sleep);
for (i = 0; i < poll_cnt; i++) {
@ -7011,6 +7037,9 @@ reload:
dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
poll_list[i].port_no);
rx_packets += process_packets;
if (process_packets >= PMD_SLEEP_THRESH) {
sleep_time = 0;
}
}
if (!rx_packets) {
@ -7018,7 +7047,30 @@ reload:
* Check if we need to send something.
* There was no time updates on current iteration. */
pmd_thread_ctx_time_update(pmd);
tx_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
max_sleep && sleep_time
? true : false);
}
if (max_sleep) {
/* Check if a sleep should happen on this iteration. */
if (sleep_time) {
struct cycle_timer sleep_timer;
cycle_timer_start(&pmd->perf_stats, &sleep_timer);
xnanosleep_no_quiesce(sleep_time * 1000);
time_slept = cycle_timer_stop(&pmd->perf_stats, &sleep_timer);
pmd_thread_ctx_time_update(pmd);
}
if (sleep_time < max_sleep) {
/* Increase sleep time for next iteration. */
sleep_time += PMD_SLEEP_INC_US;
} else {
sleep_time = max_sleep;
}
} else {
/* Reset sleep time as max sleep policy may have been changed. */
sleep_time = 0;
}
/* Do RCU synchronization at fixed interval. This ensures that
@ -7058,7 +7110,7 @@ reload:
break;
}
pmd_perf_end_iteration(s, rx_packets, tx_packets,
pmd_perf_end_iteration(s, rx_packets, tx_packets, time_slept,
pmd_perf_metrics_enabled(pmd));
}
ovs_mutex_unlock(&pmd->perf_stats.stats_mutex);
@ -9909,7 +9961,7 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
struct polled_queue *poll_list, int poll_cnt)
{
struct dpcls *cls;
uint64_t tot_idle = 0, tot_proc = 0;
uint64_t tot_idle = 0, tot_proc = 0, tot_sleep = 0;
unsigned int pmd_load = 0;
if (pmd->ctx.now > pmd->next_cycle_store) {
@ -9926,10 +9978,13 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
tot_sleep = pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP] -
pmd->prev_stats[PMD_CYCLES_SLEEP];
if (pmd_alb->is_enabled && !pmd->isolated) {
if (tot_proc) {
pmd_load = ((tot_proc * 100) / (tot_idle + tot_proc));
pmd_load = ((tot_proc * 100) /
(tot_idle + tot_proc + tot_sleep));
}
atomic_read_relaxed(&pmd_alb->rebalance_load_thresh,
@ -9946,6 +10001,8 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
pmd->prev_stats[PMD_CYCLES_SLEEP] =
pmd->perf_stats.counters.n[PMD_CYCLES_SLEEP];
/* Get the cycles that were used to process each queue and store. */
for (unsigned i = 0; i < poll_cnt; i++) {

View File

@ -1254,3 +1254,49 @@ ovs-appctl: ovs-vswitchd: server returned an error
OVS_VSWITCHD_STOP
AT_CLEANUP
dnl Check default state
AT_SETUP([PMD - pmd sleep])
OVS_VSWITCHD_START
dnl Check default
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
OVS_WAIT_UNTIL([tail ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
dnl Check low value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="1"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
dnl Check high value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10000"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
dnl Check setting max sleep to zero
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="0"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 0 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are disabled."])
dnl Check above high value max sleep
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="10001"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 10000 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
dnl Check rounding
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="490"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 490 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
dnl Check rounding
get_log_next_line_num
AT_CHECK([ovs-vsctl set open_vswitch . other_config:pmd-maxsleep="491"])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD max sleep request is 500 usecs."])
OVS_WAIT_UNTIL([tail -n +$LINENUM ovs-vswitchd.log | grep "PMD load based sleeps are enabled."])
OVS_VSWITCHD_STOP
AT_CLEANUP

View File

@ -788,6 +788,32 @@
The default value is <code>25%</code>.
</p>
</column>
<column name="other_config" key="pmd-maxsleep"
type='{"type": "integer",
"minInteger": 0, "maxInteger": 10000}'>
<p>
Specifies the maximum sleep time that will be requested in
microseconds per iteration for a PMD thread which has received zero
or a small amount of packets from the Rx queues it is polling.
</p>
<p>
The actual sleep time requested is based on the load
of the Rx queues that the PMD polls and may be less than
the maximum value.
</p>
<p>
The default value is <code>0 microseconds</code>, which means
that the PMD will not sleep regardless of the load from the
Rx queues that it polls.
</p>
<p>
To avoid requesting very small sleeps (e.g. less than 10 us) the
value will be rounded up to the nearest 10 us.
</p>
<p>
The maximum value is <code>10000 microseconds</code>.
</p>
</column>
<column name="other_config" key="userspace-tso-enable"
type='{"type": "boolean"}'>
<p>