parser: update state machine README

Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com>
2025-08-31 22:35:35 +00:00 · 2024-05-30 00:08:51 -07:00
parent cf5be7d356
commit 5d6f875676
1 changed files with 152 additions and 63 deletions
--- a/parser/libapparmor_re/README
+++ b/parser/libapparmor_re/README
@@ -10,37 +10,65 @@ aare_rules.{h,cc} - code to that binds parse -> expr-tree -> hfa generation
                    -> chfa generation into a basic interface for converting
 		    rules to a runtime ready state machine.
-Regular Expression Scanner Generator
+Notes on the compress hfa file format (chfa)
-====================================
+==============================================
 Notes in the scanner File Format
 --------------------------------
 The file format used is based on the GNU flex table file format
 (--tables-file option; see Table File Format in the flex info pages and
 the flex sources for documentation). The magic number used in the header
 is set to 0x1B5E783D instead of 0xF13C57B1 though, which is meant to
 indicate that the file format logically is not the same: the YY_ID_CHK
-(check) and YY_ID_DEF (default) tables are used differently.
+(check) and YY_ID_DEF (default), YY_ID_BASE tables are used differently.
-Flex uses state compression to store only the differences between states
+The YY_ID_ACCEPTX tables either encode permissions directly, or are an
-for states that are similar. The amount of compression influences the parse
+index, into an external tables.
-speed.
+
 There are two DFA table formats to support different size state machines
 DFA16
  default/next/check - are 16 bit tables
 DFA32
  default/next/check - are 32 bit tables
 In both DFA16 and DFA32
   base and accept are 32 bit tables.
 State 0 is always used as the trap state. Its accept, base and default
 fields should be 0.
 State 1 is the default start state. Alternate start states are stored
 external to the state machine.
 The base table uses the lower 24 bits as index into the next/check tables,
 and the upper 8 bits are used as flags.
 The currently defined flags are
 #define MATCH_FLAG_DIFF_ENCODE 0x80000000
 #define MARK_DIFF_ENCODE 0x40000000
 #define MATCH_FLAG_OOB_TRANSITION 0x20000000
 Note the default[state] is used in two different ways.
 1. When diff_encode is set, the state stores the difference to another
   state defined by default. The next field will only store the
   transitions that are unique to this state. Those transition may mask
   transitions in the state that the current state is relative to, also
   note the state that this state is relative might also be relative to
   another state. Cycles are forbidden and checked for by the verifier.
   The exact algorithm used to build these state difference will be
   discussed in another section.
 The following two states could be stored as in the tables outlined
 below:
 States and transitions on specific characters to next states
 ------------------------------------------------------------
 1: ('a' => 2, 'b' => 3, 'c' => 4)
 2: ('a' => 2, 'b' => 3, 'd' => 5)
-Flex-like table format
+Table format - where D in base represnts Diff encode flag
 ----------------------
 index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
-    2: (      1,  256)
+    2: (      1, D  256)
  index: (next, check)
      0: (   0,     0)  <== unused entry
@@ -55,66 +83,74 @@ index: (default, base)
 Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
 as in state 1. The matching algorithm is as follows.
-Flex-like scanner algorithm
+Scanner algorithm
 ---------------------------
  /* current state is in <state>, input character <c> */
-  while (check[base[state] + c] != state)
+
-    state = default[state];
+  while (check[base[state] + c] != state) {
-  state = next[state];
+      diff = (FLAGS(base) & diff_encode);
      state = default[state];
      if (!diff)
         goto done;
  }
  state = next[base[state] + c];
  done:
  /* continue with the next input character */
-This state compression algorithm performs well, except when there are
+2. When diff_encode is NOT set, the default state is used to represent
-many inverted or wildcard matches ("[^x]", "."). Each input character
+   all none matching transitions (ie. check[base[state] + c] != state).
-may cause several iterations in the while loop.
+   The dfa build will compute the transition with the most transitions
   and use that for the default state. ie.
   if we have
       1: ('a' => 2)
          ("[^a]" => 0)
   then 0 will be used as the default state
   if we have
       1: ("[^a]" => 2)
          ('a' => 0)
   then 2 will be used as the default state, and the only state encoded
   in the next/check tables will be for 'a'
 The combination of the diff-encoded and non-diff encoded states performs
 well even when there are many inverted or wildcard matches ("[^x]", ".").
-We will have many inverted character classes ("[^/]") that wouldn't
+Simplified Regexp scanner algorithm for non-diff encoded state (note
-compress very well. Therefore, the regexp matcher uses no state
+diff encode algorithm above works as well)
 compression, and uses the check and default tables differently. The
 above states could be stored as follows:
 Regexp table format
 -------------------
 index: (default, base)
    0: (      0,    0)  <== dummy state (nonmatching)
    1: (      0,    0)
    2: (      1,    3)
  index: (next, check)
      0: (   0,     0)  <== unused entry
 	 (   0,     0)  <== ord('a') identical, unused entries
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  3+'c': (   0,     0)  <== entry is unused
  3+'d': (   5,     2)
 	 (   0,     0)  <== (255 - ord('d')) identical, unused entries
 All the entries with 0 in check (except the first entry, which is
 deliberately reserved) are still available for other states that
 fit in there.
 Regexp scanner algorithm
 ------------------------
  /* current state is in <state>, matching character <c> */
  if (check[base[state] + c] == state)
-    state = next[state];
+    state = next[base[state] + c];
  else
    state = default[state];
  /* continue with the next input character */
 This representation and algorithm allows states which match more
 characters than they do not match to be represented as their inverse. 
 For example, a third state that accepts everything other than 'a' can
 be added to the tables as one entry in (default, base) and one entry in
 (next, check):
-State
+Each input character may cause several iterations in the while loop,
-----
+but due to guarantees in the build at most 2n states will be
- 3: ('a' => 0, everything else => 5)
+transitioned for n input characters.  The expected number of states
 walked is much closer to n and in practice due to cache locality the
 diff encoded state machine is usually faster than a non-diff encoded
 state machine with a strict n state for n input walk.
 Comb Compression
 -----------------
 The next/check tables of states are only used to encode transitions
 not covered by the default transition. The input byte is indexed off
 the base value, covering 256 positions within the next/check
 tables. However a state may only encode a few transitions within that
 range, leaving holes.  These holes are filled by other states
 transitions whose range will overlap.
   1: ('a' => 2, 'b' => 3, 'c' => 4)
   2: ('a' => 2, 'b' => 3, 'd' => 5)
   3: ('a' => 0, everything else => 5)
 Regexp tables
 -------------
@@ -132,12 +168,65 @@ index: (default, base)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
-  3+'c': (   0,     0)  <== entry is unused
+  3+'c': (   0,     0)  <== entry is unused, hole that could be filled
  3+'d': (   5,     2)
  7+'a': (   0,     3)
 	 (   0,     0)  <== (255 - ord('a')) identical, unused entries
-While the current code does not implement any form of state compression,
+
-the flex state compression representation could be combined by
+Regexp tables comb compressed
-remembering (in a bit per state, for example) which default entries
+-------------
-refer to inverted matches, and which refer to parent states.
+index: (default, base)
    0: (      0,    0)
    1: (      0,    0)
    2: (      1,    3)
    3: (      5,    5)
  index: (next, check)
      0: (   0,     0)
 	 (   0,     0)
  0+'a': (   2,     1)
  0+'b': (   3,     1)
  0+'c': (   4,     1)
  3+'a': (   2,     2)
  3+'b': (   3,     2)
  5+'a': (   0,     3)  <== entry was previously at 7+'a'
  3+'d': (   5,     2)
 	 (   0,     0)  <== (255 - ord('a')) identical, unused entries
 Out of Band Transitions (oobs)
 ---------------------------------
 Out of band transitions (oobs) allow for a state to have transitions
 that can not be triggered by input. Any state that has oobs must have
 the OOB flag set on the state. An oob is triggered by subtracting the
 oob number from the the base index value, to find the next and check
 value. Current only single oob is supported. And all states using
 an oob must have the oob flag set.
  if ((FLAG(base) & OOB) && check[base[state] - oob] == state)
    state = next[base[state]] - oob]
 oobs might be expressed as a negative number eg. -1 for the first
 oob. In which case the oob transition above uses a + oob instead.
 If more oobs are needed a second oob flag can be allocated, and if
 used in combination with the original, would allow a state to have
 up to 3 oobs
  00 - none
  01 - 1
  10 - 2
  11 - 3
 Diff Encode Spanning Tree
 ============================================
 To build the state machine with diff encoded states and to still meet
 run time guaratees about traversing no more than 2n states for n input
 a spanning tree is use.
 * TODO *