mir/bind - bind - Mike's Git repositories

mir/bind

mirror of https://gitlab.isc.org/isc-projects/bind9 synced 2025-08-28 13:08:06 +00:00

Author	SHA1	Message	Date
Kevin Chen	0cdf85d204	Several serve-stale improvements Commit a83c8cb0afd88d54b9cf67239f2495c9b0391e97 updated masterdump so that stale records in "rndc dumpdb" output no longer shows 0 TTLs. In this commit we change the name of the `rdataset->stale_ttl` field to `rdataset->expired` to make its purpose clearer, and set it to zero in cases where it's unused. Add 'rbtdb->serve_stale_ttl' to various checks so that stale records are not purged from the cache when they've been stale for RBTDB_VIRTUAL (300) seconds. Increment 'ns_statscounter_usedstale' when a stale answer is used. Note: There was a question of whether 'overmem_purge' should be purging ancient records, instead of stale ones. It is left as purging stale records, since stale records could take up the majority of the cache. This submission is copyrighted Akamai Technologies, Inc. and provided under an MPL 2.0 license. This commit was originally authored by Kevin Chen, and was updated by Matthijs Mekking to match recent serve-stale developments.	2021-05-30 11:45:35 -07:00
Matthijs Mekking	c0dc5937c7	Reset DNS_FETCHOPT_TRYSTALE_ONTIMEOUT on resume Once we resume a query, we should clear DNS_FETCHOPT_TRYSTALE_ONTIMEOUT from the options to prevent triggering the stale-answer-client-timeout on subsequent fetches. If we don't this may cause a crash when for example when prefetch is triggered after a query restart.	2021-05-30 00:03:51 -07:00
Evan Hunt	8bd8e995f1	clean up query correctly if already answered by serve-stale when a serve-stale answer has been sent, the client continues waiting for a proper answer. if a final completion event for the client does arrive, it can just be cleaned up without sending a response, similar to a canceled fetch.	2021-05-27 10:35:48 -07:00
Mark Andrews	715a2c7fc1	Add missing initialisations configuring with --enable-mutex-atomics flagged these incorrectly initialised variables on systems where pthread_mutex_init doesn't just zero out the structure.	2021-05-26 08:15:08 +00:00
Ondřej Surý	2db5290579	Fix the sizeof() for array holding the pointers to clientmgr The size of the array holding the pointers to clientmgr was created so big it could hold the actual clientmgr objects, not just the pointer. This commit fixes the size to be just the ncpus * sizeof(pointer).	2021-05-26 10:03:52 +02:00
Ondřej Surý	50270de8a0	Refactor the interface handling in the netmgr The isc_nmiface_t type was holding just a single isc_sockaddr_t, so we got rid of the datatype and use plain isc_sockaddr_t in place where isc_nmiface_t was used before. This means less type-casting and shorter path to access isc_sockaddr_t members. At the same time, instead of keeping the reference to the isc_sockaddr_t that was passed to us when we start listening, we will keep a local copy. This prevents the data race on destruction of the ns_interface_t objects where pending nmsockets could reference the sockaddr of already destroyed ns_interface_t object.	2021-05-26 09:43:12 +02:00
Ondřej Surý	28b65d8256	Reduce the number of clientmgr objects created Previously, as a way of reducing the contention between threads a clientmgr object would be created for each interface/IP address. We tasks being more strictly bound to netmgr workers, this is no longer needed and we can just create clientmgr object per worker queue (ncpus). Each clientmgr object than would have a single task and single memory context.	2021-05-24 20:44:54 +02:00
Ondřej Surý	0be7ea78be	Reduce the number of client tasks and bind them to netmgr queues Since a client object is bound to a netmgr handle, each client will always be processed by the same netmgr worker, so we can simplify the code by binding client->task to the same thread as the client. Since ns__client_request() now runs in the same event loop as client->task events, is no longer necessary to pause the task manager before launching them. Also removed some functions in isc_task that were not used.	2021-05-24 20:02:20 +02:00
Ondřej Surý	c07f8c5a43	Reduce the number of tasks in the clientmgr We now use one task per CPU per dispatchmgr (that's still a lot).	2021-05-24 20:02:20 +02:00
Ondřej Surý	0719f032e1	Reduce the number of mctx created in clientmgr The number of memory contexts created in the clientmgr was enormous. It could easily create thousands of memory contexts because the formula was: nprotocols * ncpus * ninterfaces * CLIENT_NMCTXS_PERCPU (8) The original goal was to reduce the contention when allocating the memory, but after a while nobody noticed that the amount of memory context allocated would not reduce contention at all. This commit removes the whole mctxpool and just uses the mctx from clientmgr as the contention will be reduced directly in the allocator.	2021-05-24 20:02:20 +02:00
Evan Hunt	b0aadaac8e	rename dns_name_copynf() to dns_name_copy() dns_name_copy() is now the standard name-copying function.	2021-05-22 00:37:27 -07:00
Ondřej Surý	ce3e1abc1d	Use dns_name_copynf() with dns_message_gettempname() when needed dns_message_gettempname() returns an initialized name with a dedicated buffer, associated with a dns_fixedname object. Using dns_name_copynf() to write a name into this object will actually copy the name data from a source name. dns_name_clone() merely points target->ndata to source->ndata, so it is faster, but it can lead to a use-after-free if the source is freed before the target object is released via dns_message_puttempname(). In a few places, clone was being used where copynf should have been; this is now fixed. As a side note, no memory was lost, because the ndata buffer used in the dns_fixedname_t is internal to the structure, and is freed when the dns_fixedname_t is freed regardless of the .ndata contents.	2021-05-21 21:28:10 -07:00
Evan Hunt	e31cc1eeb4	use a fixedname buffer in dns_message_gettempname() dns_message_gettempname() now returns a pointer to an initialized name associated with a dns_fixedname_t object. it is no longer necessary to allocate a buffer for temporary names associated with the message object.	2021-05-20 20:41:29 +02:00
Ondřej Surý	365c6a9851	ensure interlocked netmgr events run on worker[0] Network manager events that require interlock (pause, resume, listen) are now always executed in the same worker thread, mgr->workers[0], to prevent races. "stoplistening" events no longer require interlock.	2021-05-07 14:28:32 -07:00
Ondřej Surý	b5bf58b419	Destroy netmgr before destroying taskmgr With taskmgr running on top of netmgr, the ordering of how the tasks and netmgr shutdown interacts was wrong as previously isc_taskmgr_destroy() was waiting until all tasks were properly shutdown and detached. This responsibility was moved to netmgr, so we now need to do the following: 1. shutdown all the tasks - this schedules all shutdown events onto the netmgr queue 2. shutdown the netmgr - this also makes sure all the tasks and events are properly executed 3. Shutdown the taskmgr - this now waits for all the tasks to finish running before returning 4. Shutdown the netmgr - this call waits for all the netmgr netievents to finish before returning This solves the race when the taskmgr object would be destroyed before all the tasks were finished running in the netmgr loops.	2021-05-07 14:28:30 -07:00
Ondřej Surý	a011d42211	Add new isc_managers API to simplify <>mgr create/destroy Previously, netmgr, taskmgr, timermgr and socketmgr all had their own isc_<>mgr_create() and isc_<>mgr_destroy() functions. The new isc_managers_create() and isc_managers_destroy() fold all four into a single function and makes sure the objects are created and destroy in correct order. Especially now, when taskmgr runs on top of netmgr, the correct order is important and when the code was duplicated at many places it's easy to make mistake. The former isc_<>mgr_create() and isc_<*>mgr_destroy() functions were made private and a single call to isc_managers_create() and isc_managers_destroy() is required at the program startup / shutdown.	2021-05-07 10:19:05 -07:00
Matthijs Mekking	66f2cd228d	Use isdigit instead of checking character range When looking for key files, we could use isdigit rather than checking if the character is within the range [0-9]. Use (unsigned char) cast to ensure the value is representable in the unsigned char type (as suggested by the isdigit manpage). Change " & 0xff" occurrences to the recommended (unsigned char) type cast.	2021-05-05 19:15:33 +02:00
Ondřej Surý	c37ff5d188	Add nanosleep and usleep Windows shims This commit adds POSIX nanosleep() and usleep() shim implementation for Windows to help implementors use less #ifdef _WIN32 in the code.	2021-05-03 20:22:54 +02:00
Mark Andrews	c1190a3fe0	Handle DNAME lookup via itself When answering a query, named should never attempt to add the same RRset to the ANSWER section more than once. However, such a situation may arise when chasing DNAME records: one of the DNAME records placed in the ANSWER section may turn out to be the final answer to a client query, but there is no way to know that in advance. Tweak the relevant INSIST assertion in query_respond() so that it handles this case properly. qctx->rdataset is freed later anyway, so there is no need to clean it up in query_respond().	2021-04-29 10:30:00 +02:00
Mark Andrews	29126500d2	Reduce nsec3 max iterations to 150	2021-04-29 17:18:26 +10:00
Matthijs Mekking	104b676235	Serve-stale nit fixes While working on the serve-stale backports, I noticed the following oddities: 1. In the serve-stale system test, in one case we keep track of the time how long it took for dig to complete. In commit aaed7f9d8c2465790d769221dfe8378c7147f5eb, the code removed the exception to check for result == ISC_R_SUCCESS on stale found answers, and adjusted the test accordingly. This failed to update the time tracking accordingly. Move the t1/t2 time track variables back around the two dig commands to ensure the lookups resolved faster than the resolver-query-timeout. 2. We can remove the setting of NS_QUERYATTR_STALEOK and DNS_RDATASETATTR_STALE_ADDED on the "else if (stale_timeout)" code path, because they are added later when we know we have actually found a stale answer on a stale timeout lookup. 3. We should clear the NS_QUERYATTR_STALEOK flag from the client query attributes instead of DNS_RDATASETATTR_STALE_ADDED (that flag is set on the rdataset attributes). 4. In 'bin/named/config.c' we should set the configuration options in alpabetical order. 5. In the ARM, in the backports we have added "(stale)" between "cached" and "RRset" to make more clear a stale RRset may be returned in this scenario.	2021-04-28 12:24:24 +02:00
Ondřej Surý	b540722bc3	Refactor taskmgr to run on top of netmgr This commit changes the taskmgr to run the individual tasks on the netmgr internal workers. While an effort has been put into keeping the taskmgr interface intact, couple of changes have been made: * The taskmgr has no concept of universal privileged mode - rather the tasks are either privileged or unprivileged (normal). The privileged tasks are run as a first thing when the netmgr is unpaused. There are now four different queues in in the netmgr: 1. priority queue - netievent on the priority queue are run even when the taskmgr enter exclusive mode and netmgr is paused. This is needed to properly start listening on the interfaces, free resources and resume. 2. privileged task queue - only privileged tasks are queued here and this is the first queue that gets processed when network manager is unpaused using isc_nm_resume(). All netmgr workers need to clean the privileged task queue before they all proceed normal operation. Both task queues are processed when the workers are finished. 3. task queue - only (traditional) task are scheduled here and this queue along with privileged task queues are process when the netmgr workers are finishing. This is needed to process the task shutdown events. 4. normal queue - this is the queue with netmgr events, e.g. reading, sending, callbacks and pretty much everything is processed here. * The isc_taskmgr_create() now requires initialized netmgr (isc_nm_t) object. * The isc_nm_destroy() function now waits for indefinite time, but it will print out the active objects when in tracing mode (-DNETMGR_TRACE=1 and -DNETMGR_TRACE_VERBOSE=1), the netmgr has been made a little bit more asynchronous and it might take longer time to shutdown all the active networking connections. * Previously, the isc_nm_stoplistening() was a synchronous operation. This has been changed and the isc_nm_stoplistening() just schedules the child sockets to stop listening and exits. This was needed to prevent a deadlock as the the (traditional) tasks are now executed on the netmgr threads. * The socket selection logic in isc__nm_udp_send() was flawed, but fortunatelly, it was broken, so we never hit the problem where we created uvreq_t on a socket from nmhandle_t, but then a different socket could be picked up and then we were trying to run the send callback on a socket that had different threadid than currently running.	2021-04-20 23:22:28 +02:00
Matthijs Mekking	3d3a6415f7	If RPZ config'd, bail stale-answer-client-timeout When we are recursing, RPZ processing is not allowed. But when we are performing a lookup due to "stale-answer-client-timeout", we are still recursing. This effectively means that RPZ processing is disabled on such a lookup. In this case, bail the "stale-answer-client-timeout" lookup and wait for recursion to complete, as we we can't perform the RPZ rewrite rules reliably.	2021-04-02 10:02:40 +02:00
Matthijs Mekking	839df94190	Rename "staleonly" The dboption DNS_DBFIND_STALEONLY caused confusion because it implies we are looking for stale data only and ignore any active RRsets in the cache. Rename it to DNS_DBFIND_STALETIMEOUT as it is more clear the option is related to a lookup due to "stale-answer-client-timeout". Rename other usages of "staleonly", instead use "lookup due to...". Also rename related function and variable names.	2021-04-02 10:02:40 +02:00
Matthijs Mekking	3f81d79ffb	Restore the RECURSIONOK attribute after staleonly When doing a staleonly lookup we don't want to fallback to recursion. After all, there are obviously problems with recursion, otherwise we wouldn't do a staleonly lookup. When resuming from recursion however, we should restore the RECURSIONOK flag, allowing future required lookups for this client to recurse.	2021-04-02 10:02:40 +02:00
Matthijs Mekking	aaed7f9d8c	Remove result exception on staleonly lookup When implementing "stale-answer-client-timeout", we decided that we should only return positive answers prematurely to clients. A negative response is not useful, and in that case it is better to wait for the recursion to complete. To do so, we check the result and if it is not ISC_R_SUCCESS, we decide that it is not good enough. However, there are more return codes that could lead to a positive answer (e.g. CNAME chains). This commit removes the exception and now uses the same logic that other stale lookups use to determine if we found a useful stale answer (stale_found == true). This means we can simplify two test cases in the serve-stale system test: nodata.example is no longer treated differently than data.example.	2021-04-02 10:02:40 +02:00
Matthijs Mekking	3d5429f61f	Remove INSIST on NS_QUERYATTR_ANSWERED The NS_QUERYATTR_ANSWERED attribute is to prevent sending a response twice. Without the attribute, this may happen if a staleonly lookup found a useful answer and sends a response to the client, and later recursion ends and also tries to send a response. The attribute was also used to mask adding a duplicate RRset. This is considered harmful. When we created a response to the client with a stale only lookup (regardless if we actually have send the response), we should clear the rdatasets that were added during that lookup. Mark such rdatasets with the a new attribute, DNS_RDATASETATTR_STALE_ADDED. Set a query attribute NS_QUERYATTR_STALEOK if we may have added rdatasets during a stale only lookup. Before creating a response on a normal lookup, check if we can expect rdatasets to have been added during a staleonly lookup. If so, clear the rdatasets from the message with the attribute DNS_RDATASETATTR_STALE_ADDED set.	2021-04-02 09:15:07 +02:00
Matthijs Mekking	48b0dc159b	Simplify when to detach the client With stale-answer-client-timeout, we may send a response to the client, but we may want to hold on to the network manager handle, because recursion is going on in the background, or we need to refresh a stale RRset. Simplify the setting of 'nodetach': * During a staleonly lookup we should not detach the nmhandle, so just set it prior to 'query_lookup()'. * During a staleonly "stalefirst" lookup set the 'nodetach' to true if we are going to refresh the RRset. Now there is no longer the need to clear the 'nodetach' if we go through the "dbfind_stale", "stale_refresh_window", or "stale_only" paths.	2021-04-02 09:14:09 +02:00
Matthijs Mekking	92f7a67892	Refactor stale lookups, ignore active RRsets When doing a staleonly lookup, ignore active RRsets from cache. If we don't, we may add a duplicate RRset to the message, and hit an assertion failure in query.c because adding the duplicate RRset to the ANSWER section failed. This can happen on a race condition. When a client query is received, the recursion is started. When 'stale-answer-client-timeout' triggers around the same time the recursion completes, the following sequence of events may happen: 1. Queue the "try stale" fetch_callback() event to the client task. 2. Add the RRsets from the authoritative response to the cache. 3. Queue the "fetch complete" fetch_callback() event to the client task. 4. Execute the "try stale" fetch_callback(), which retrieves the just-inserted RRset from the database. 5. In "ns_query_done()" we are still recursing, but the "staleonly" query attribute has already been cleared. In other words, the query will resume when recursion ends (it already has ended but is still on the task queue). 6. Execute the "fetch complete" fetch_callback(). It finds the answer from recursion in the cache again and tries to add the duplicate to the answer section. This commit changes the logic for finding stale answers in the cache, such that on "stale_only" lookups actually only stale RRsets are considered. It refactors the code so that code paths for "dbfind_stale", "stale_refresh_window", and "stale_only" are more clear. First we call some generic code that applies in all three cases, formatting the domain name for logging purposes, increment the trystale stats, and check if we actually found stale data that we can use. The "dbfind_stale" lookup will return SERVFAIL if we didn't found a usable answer, otherwise we will continue with the lookup (query_gotanswer()). This is no different as before the introduction of "stale-answer-client-timeout" and "stale-refresh-time". The "stale_refresh_window" lookup is similar to the "dbfind_stale" lookup: return SERVFAIL if we didn't found a usable answer, otherwise continue with the lookup (query_gotanswer()). Finally the "stale_only" lookup. If the "stale_only" lookup was triggered because of an actual client timeout (stale-answer-client-timeout > 0), and if database lookup returned a stale usable RRset, trigger a response to the client. Otherwise return and wait until the recursion completes (or the resolver query times out). If the "stale_only" lookup is a "stale-anwer-client-timeout 0" lookup, preferring stale data over a lookup. In this case if there was no stale data, or the data was not a positive answer, retry the lookup with the stale options cleared, a.k.a. a normal lookup. Otherwise, continue with the lookup (query_gotanswer()) and refresh the stale RRset. This will trigger a response to the client, but will not detach the handle because a fetch will be created to refresh the RRset.	2021-04-02 09:14:09 +02:00
Matthijs Mekking	fee164243f	Keep track of allow client detach The stale-answer-client-timeout feature introduced a dependancy on when a client may be detached from the handle. The dboption DNS_DBFIND_STALEONLY was reused to track this attribute. This overloads the meaning of this database option, and actually introduced a bug because the option was checked in other places. In particular, in 'ns_query_done()' there is a check for 'RECURSING(qctx->client) && (!QUERY_STALEONLY(&qctx->client->query) \|\| ...' and the condition is satisfied because recursion has not completed yet and DNS_DBFIND_STALEONLY is already cleared by that time (in query_lookup()), because we found a useful answer and we should detach the client from the handle after sending the response. Add a new boolean to the client structure to keep track of client detach from handle is allowed or not. It is only disallowed if we are in a staleonly lookup and we didn't found a useful answer.	2021-04-02 09:14:09 +02:00
Ondřej Surý	36ddefacb4	Change the isc_nm_(get\|set)timeouts() to work with milliseconds The RFC7828 specifies the keepalive interval to be 16-bit, specified in units of 100 milliseconds and the configuration options tcp-*-timeouts are following the suit. The units of 100 milliseconds are very unintuitive and while we can't change the configuration and presentation format, we should not follow this weird unit in the API. This commit changes the isc_nm_(get\|set)timeouts() functions to work with milliseconds and convert the values to milliseconds before passing them to the function, not just internally.	2021-03-18 16:37:57 +01:00
Matthijs Mekking	87591de6f7	Fix servestale fetchlimits crash When we query the resolver for a domain name that is in the same zone for which is already one or more fetches outstanding, we could potentially hit the fetch limits. If so, recursion fails immediately for the incoming query and if serve-stale is enabled, we may try to return a stale answer. If the resolver is also is authoritative for the parent zone (for example the root zone), first a delegation is found, but we first check the cache for a better response. Nothing is found in the cache, so we try to recurse to find the answer to the query. Because of fetch-limits 'dns_resolver_createfetch()' returns an error, which 'ns_query_recurse()' propagates to the caller, 'query_delegation_recurse()'. Because serve-stale is enabled, 'query_usestale()' is called, setting 'qctx->db' to the cache db, but leaving 'qctx->version' untouched. Now 'query_lookup()' is called to search for stale data in the cache database with a non-NULL 'qctx->version' (which is set to a zone db version), and thus we hit an assertion in rbtdb. This crash was introduced in 'main' by commit 8bcd7fe69e5343071fc917738d6092a8b974ef3f.	2021-03-11 12:16:14 +01:00
Evan Hunt	88752b1121	refactor outgoing HTTP connection support - style, cleanup, and removal of unnecessary code. - combined isc_nm_http_add_endpoint() and isc_nm_http_add_doh_endpoint() into one function, renamed isc_http_endpoint(). - moved isc_nm_http_connect_send_request() into doh_test.c as a helper function; remove it from the public API. - renamed isc_http2 and isc_nm_http2 types and functions to just isc_http and isc_nm_http, for consistency with other existing names. - shortened a number of long names. - the caller is now responsible for determining the peer address. in isc_nm_httpconnect(); this eliminates the need to parse the URI and the dependency on an external resolver. - the caller is also now responsible for creating the SSL client context, for consistency with isc_nm_tlsdnsconnect(). - added setter functions for HTTP/2 ALPN. instead of setting up ALPN in isc_tlsctx_createclient(), we now have a function isc_tlsctx_enable_http2client_alpn() that can be run from isc_nm_httpconnect(). - refactored isc_nm_httprequest() into separate read and send functions. isc_nm_send() or isc_nm_read() is called on an http socket, it will be stored until a corresponding isc_nm_read() or _send() arrives; when we have both halves of the pair the HTTP request will be initiated. - isc_nm_httprequest() is renamed isc__nm_http_request() for use as an internal helper function by the DoH unit test. (eventually doh_test should be rewritten to use read and send, and this function should be removed.) - added implementations of isc__nm_tls_settimeout() and isc__nm_http_settimeout(). - increased NGHTTP2 header block length for client connections to 128K. - use isc_mem_t for internal memory allocations inside nghttp2, to help track memory leaks. - send "Cache-Control" header in requests and responses. (note: currently we try to bypass HTTP caching proxies, but ideally we should interact with them: https://tools.ietf.org/html/rfc8484#section-5.1)	2021-03-05 13:29:26 +02:00
Ondřej Surý	a0181056a8	Change the isc_thread_self() return type to uintptr_t The pthread_self(), thrd_current() or GetCurrentThreadId() could actually be a pointer, so we should rather convert the value into uintptr_t instead of unsigned long.	2021-02-25 16:21:10 +01:00
Matthijs Mekking	f8b7b597e9	Don't servfail on staleonly lookups When a staleonly lookup doesn't find a satisfying answer, it should not try to respond to the client. This is not true when the initial lookup is staleonly (that is when 'stale-answer-client-timeout' is set to 0), because no resolver fetch has been created at this point. In this case continue with the lookup normally.	2021-02-25 11:32:17 +01:00
Matthijs Mekking	9e061faaae	Don't allow recursion on staleonly lookups Fix a crash that can happen in the following scenario: A client request is received. There is no data for it in the cache, (not even stale data). A resolver fetch is created as part of recursion. Some time later, the fetch still hasn't completed, and stale-answer-client-timeout is triggered. A staleonly lookup is started. It will also find no data in the cache. So 'query_lookup()' will call 'query_gotanswer()' with ISC_R_NOTFOUND, so this will call 'query_notfound()' and this will start recursion. We will eventually end up in 'ns_query_recurse()' and that requires the client query fetch to be NULL: REQUIRE(client->query.fetch == NULL); If the previously started fetch is still running this assertion fails. The crash is easily prevented by not requiring recursion for staleonly lookups. Also remove a redundant setting of the staleonly flag at the end of 'query_lookup_staleonly()' before destroying the query context. Add a system test to catch this case.	2021-02-25 11:32:17 +01:00
Ondřej Surý	0f44139145	Bump the maximum number of hazard pointers in tests On 24-core machine, the tests would crash because we would run out of the hazard pointers. We now adjust the number of hazard pointers to be in the <128,256> interval based on the number of available cores. Note: This is just a band-aid and needs a proper fix.	2021-02-18 19:32:55 +01:00
Michal Nowak	fa505bfb0e	Record skipped unit test as skipped in Automake framework	2021-02-15 11:18:03 +01:00
Michal Nowak	613be8706e	Drop AddressSanitizer constraint from libns unit tests The AddressSanitizer constraint in some libns unit tests does not seem to be necessary anymore, these tests run fine under AddressSanitizer.	2021-02-10 09:54:32 +00:00
Matthijs Mekking	8bcd7fe69e	Use stale on error also when unable to recurse The 'query_usestale()' function was only called when in 'query_gotanswer()' and an unexpected error occurred. This may have been "quota reached", and thus we were in some cases returning stale data on fetch-limits (and if serve-stale enabled of course). But we can also hit fetch-limits when recursing because we are following a referral (in 'query_notfound()' and 'query_delegation_recurse()'). Here we should also check for using stale data in case an error occurred. Specifically don't check for using stale data when refetching a zero TTL RRset from cache. Move the setting of DNS_DBFIND_STALESTART into the 'query_usestale()' function to avoid code duplication.	2021-02-08 15:17:09 +01:00
Evan Hunt	fe99484e14	support "tls ephemeral" with https	2021-02-03 12:06:17 +01:00
Artem Boldariev	08da09bc76	Initial support for DNS-over-HTTP(S) This commit completes the support for DNS-over-HTTP(S) built on top of nghttp2 and plugs it into the BIND. Support for both GET and POST requests is present, as required by RFC8484. Both encrypted (via TLS) and unencrypted HTTP/2 connections are supported. The latter are mostly there for debugging/troubleshooting purposes and for the means of encryption offloading to third-party software (as might be desirable in some environments to simplify TLS certificates management).	2021-02-03 12:06:17 +01:00
Matthijs Mekking	aabdedeae3	Only start stale refresh window when resuming If we did not attempt a fetch due to fetch-limits, we should not start the stale-refresh-time window. Introduce a new flag DNS_DBFIND_STALESTART to differentiate between a resolver failure and unexpected error. If we are resuming, this indicates a resolver failure, then start the stale-refresh-time window, otherwise don't start the stale-refresh-time window, but still fall back to using stale data. (This commit also wraps some docstrings to 80 characters width)	2021-01-28 16:38:34 +01:00
Matthijs Mekking	c6fd02aed5	Use stale data also if we are not resuming Before this change, BIND will only fallback to using stale data if there was an actual attempt to resolve the query. Then on a timeout, the stale data from cache becomes eligible. This commit changes this so that on any unexpected error stale data becomes eligble (you would still have to have 'stale-answer-enable' enabled of course). If there is no stale data, this may return in an error again, so don't loop on stale data lookup attempts. If the DNS_DBFIND_STALEOK flag is set, this means we already tried to lookup stale data, so if that is the case, don't use stale again.	2021-01-28 16:36:46 +01:00
Matthijs Mekking	fa0c9280d2	Update code flow in query.c wrt stale data First of all, there was a flaw in the code related to the 'stale-refresh-time' option. If stale answers are enabled, and we returned stale data, then it was assumed that it was because we were in the 'stale-refresh-time' window. But now we could also have returned stale data because of a 'stale-answer-client-timeout'. To fix this, introduce a rdataset attribute DNS_RDATASETATTR_STALE_WINDOW to indicate whether the stale cache entry was returned because the 'stale-refresh-time' window is active. Second, remove the special case handling when the result is DNS_R_NCACHENXRRSET. This can be done more generic in the code block when dealing with stale data. Putting all stale case handling in the code block when dealing with stale data makes the code more easy to follow. Update documentation to be more verbose and to match then new code flow.	2021-01-25 10:48:16 -03:00
Diego Fronza	966060c03b	Extracted common function from query_lookup and query_refresh_rrset Both functions employed the same code lines to allocate query context buffers, which are used to store query results, so this shared portion of code was extracted out to a new function, qctx_prepare_buffers. Also, this commit uses qctx_init to initialize the query context whitin query_refresh_rrset function.	2021-01-25 10:48:16 -03:00
Diego Fronza	f89ac07b28	Small optimization in query_usestale This commit makes the code in query_usestale easier to follow, it also doesn't attach/detach to the database if stale answers are not enabled.	2021-01-25 10:48:16 -03:00
Diego Fronza	e219422575	Allow stale data to be used before name resolution This commit allows stale RRset to be used (if available) for responding a query, before an attempt to refresh an expired, or otherwise resolve an unavailable RRset in cache is made. For that to work, a value of zero must be specified for stale-answer-client-timeout statement. To better understand the logic implemented, there are three flags being used during database lookup and other parts of code that must be understood: . DNS_DBFIND_STALEOK: This flag is set when BIND fails to refresh a RRset due to timeout (resolver-query-timeout), its intent is to try to look for stale data in cache as a fallback, but only if stale answers are enabled in configuration. This flag is also used to activate stale-refresh-time window, since it is the only way the database knows that a resolution has failed. . DNS_DBFIND_STALEENABLED: This flag is used as a hint to the database that it may use stale data. It is always set during query lookup if stale answers are enabled, but only effectively used during stale-refresh-time window. Also during this window, the resolver will not try to resolve the query, in other words no attempt to refresh the data in cache is made when the stale-refresh-time window is active. . DNS_DBFIND_STALEONLY: This new introduced flag is used when we want stale data from the database, but not due to a failure in resolution, it also doesn't require stale-refresh-time window timer to be active. As long as there is a stale RRset available, it should be returned. It is mainly used in two situations: 1. When stale-answer-client-timeout timer is triggered: in that case we want to know if there is stale data available to answer the client. 2. When stale-answer-client-timeout value is set to zero: in that case, we also want to know if there is some stale RRset available to promptly answer the client. We must also discern between three situations that may happen when resolving a query after the addition of stale-answer-client-timeout statement, and how to handle them: 1. Are we running query_lookup() due to stale-answer-client-timeout timer being triggered? In this case, we look for stale data, making use of DNS_DBFIND_STALEONLY flag. If a stale RRset is available then respond the client with the data found, mark this query as answered (query attribute NS_QUERYATTR_ANSWERED), so when the fetch completes the client won't be answered twice. We must also take care of not detaching from the client, as a fetch will still be running in background, this is handled by the following snippet: if (!QUERY_STALEONLY(&client->query)) { isc_nmhandle_detach(&client->reqhandle); } Which basically tests if DNS_DBFIND_STALEONLY flag is set, which means we are here due to a stale-answer-client-timeout timer expiration. 2. Are we running query_lookup() due to resolver-query-timeout being triggered? In this case, DNS_DBFIND_STALEOK flag will be set and an attempt to look for stale data will be made. As already explained, this flag is algo used to activate stale-refresh-time window, as it means that we failed to refresh a RRset due to timeout. It is ok in this situation to detach from the client, as the fetch is already completed. 3. Are we running query_lookup() during the first time, looking for a RRset in cache and stale-answer-client-timeout value is set to zero? In this case, if stale answers are enabled (probably), we must do an initial database lookup with DNS_DBFIND_STALEONLY flag set, to indicate to the database that we want stale data. If we find an active RRset, proceed as normal, answer the client and the query is done. If we find a stale RRset we respond to the client and mark the query as answered, but don't detach from the client yet as an attempt in refreshing the RRset will still be made by means of the new introduced function 'query_resolve'. If no active or stale RRset is available, begin resolution as usual.	2021-01-25 10:47:14 -03:00
Diego Fronza	171a5b7542	Add stale-answer-client-timeout option The general logic behind the addition of this new feature works as folows: When a client query arrives, the basic path (query.c / ns_query_recurse) was to create a fetch, waiting for completion in fetch_callback. With the introduction of stale-answer-client-timeout, a new event of type DNS_EVENT_TRYSTALE may invoke fetch_callback, whenever stale answers are enabled and the fetch took longer than stale-answer-client-timeout to complete. When an event of type DNS_EVENT_TRYSTALE triggers fetch_callback, we must ensure that the folowing happens: 1. Setup a new query context with the sole purpose of looking up for stale RRset only data, for that matters a new flag was added 'DNS_DBFIND_STALEONLY' used in database lookups. . If a stale RRset is found, mark the original client query as answered (with a new query attribute named NS_QUERYATTR_ANSWERED), so when the fetch completion event is received later, we avoid answering the client twice. . If a stale RRset is not found, cleanup and wait for the normal fetch completion event. 2. In ns_query_done, we must change this part: /* * If we're recursing then just return; the query will * resume when recursion ends. */ if (RECURSING(qctx->client)) { return (qctx->result); } To this: if (RECURSING(qctx->client) && !QUERY_STALEONLY(qctx->client)) { return (qctx->result); } Otherwise we would not proceed to answer the client if it happened that a stale answer was found when looking up for stale only data. When an event of type DNS_EVENT_FETCHDONE triggers fetch_callback, we proceed as before, resuming query, updating stats, etc, but a few exceptions had to be added, most important of which are two: 1. Before answering the client (ns_client_send), check if the query wasn't already answered before. 2. Before detaching a client, e.g. isc_nmhandle_detach(&client->reqhandle), ensure that this is the fetch completion event, and not the one triggered due to stale-answer-client-timeout, so a correct call would be: if (!QUERY_STALEONLY(client)) { isc_nmhandle_detach(&client->reqhandle); } Other than these notes, comments were added in code in attempt to make these updates easier to follow.	2021-01-25 10:47:14 -03:00
Diego Fronza	74840ec50b	Added dns_view_staleanswerenabled() function Since it takes a couple lines of code to check whether stale answers are enabled for a given view, code was extracted out to a proper function.	2021-01-25 10:47:14 -03:00

... 2 3 4 5 6 ...

626 Commits