mirror of
https://gitlab.com/apparmor/apparmor
synced 2025-08-30 22:05:27 +00:00
parser: update state machine README
Update the state machine readme to better reflect how the chfa is encoded and works. It still needs a lot more but fixes several errors in the doc and adds some info about state differential encoding, oobs, and comb compression. Signed-off-by: John Johansen <john.johansen@canonical.com>
This commit is contained in:
@@ -10,37 +10,65 @@ aare_rules.{h,cc} - code to that binds parse -> expr-tree -> hfa generation
|
||||
-> chfa generation into a basic interface for converting
|
||||
rules to a runtime ready state machine.
|
||||
|
||||
Regular Expression Scanner Generator
|
||||
====================================
|
||||
|
||||
Notes in the scanner File Format
|
||||
--------------------------------
|
||||
Notes on the compress hfa file format (chfa)
|
||||
==============================================
|
||||
|
||||
The file format used is based on the GNU flex table file format
|
||||
(--tables-file option; see Table File Format in the flex info pages and
|
||||
the flex sources for documentation). The magic number used in the header
|
||||
is set to 0x1B5E783D instead of 0xF13C57B1 though, which is meant to
|
||||
indicate that the file format logically is not the same: the YY_ID_CHK
|
||||
(check) and YY_ID_DEF (default) tables are used differently.
|
||||
(check) and YY_ID_DEF (default), YY_ID_BASE tables are used differently.
|
||||
|
||||
Flex uses state compression to store only the differences between states
|
||||
for states that are similar. The amount of compression influences the parse
|
||||
speed.
|
||||
The YY_ID_ACCEPTX tables either encode permissions directly, or are an
|
||||
index, into an external tables.
|
||||
|
||||
There are two DFA table formats to support different size state machines
|
||||
DFA16
|
||||
default/next/check - are 16 bit tables
|
||||
DFA32
|
||||
default/next/check - are 32 bit tables
|
||||
|
||||
In both DFA16 and DFA32
|
||||
base and accept are 32 bit tables.
|
||||
|
||||
State 0 is always used as the trap state. Its accept, base and default
|
||||
fields should be 0.
|
||||
|
||||
State 1 is the default start state. Alternate start states are stored
|
||||
external to the state machine.
|
||||
|
||||
The base table uses the lower 24 bits as index into the next/check tables,
|
||||
and the upper 8 bits are used as flags.
|
||||
|
||||
The currently defined flags are
|
||||
#define MATCH_FLAG_DIFF_ENCODE 0x80000000
|
||||
#define MARK_DIFF_ENCODE 0x40000000
|
||||
#define MATCH_FLAG_OOB_TRANSITION 0x20000000
|
||||
|
||||
Note the default[state] is used in two different ways.
|
||||
|
||||
1. When diff_encode is set, the state stores the difference to another
|
||||
state defined by default. The next field will only store the
|
||||
transitions that are unique to this state. Those transition may mask
|
||||
transitions in the state that the current state is relative to, also
|
||||
note the state that this state is relative might also be relative to
|
||||
another state. Cycles are forbidden and checked for by the verifier.
|
||||
The exact algorithm used to build these state difference will be
|
||||
discussed in another section.
|
||||
|
||||
The following two states could be stored as in the tables outlined
|
||||
below:
|
||||
|
||||
States and transitions on specific characters to next states
|
||||
------------------------------------------------------------
|
||||
1: ('a' => 2, 'b' => 3, 'c' => 4)
|
||||
2: ('a' => 2, 'b' => 3, 'd' => 5)
|
||||
|
||||
Flex-like table format
|
||||
Table format - where D in base represnts Diff encode flag
|
||||
----------------------
|
||||
index: (default, base)
|
||||
0: ( 0, 0) <== dummy state (nonmatching)
|
||||
1: ( 0, 0)
|
||||
2: ( 1, 256)
|
||||
2: ( 1, D 256)
|
||||
|
||||
index: (next, check)
|
||||
0: ( 0, 0) <== unused entry
|
||||
@@ -55,66 +83,74 @@ index: (default, base)
|
||||
Here, state 2 is described as ('c' => 0, 'd' => 5), and everything else
|
||||
as in state 1. The matching algorithm is as follows.
|
||||
|
||||
Flex-like scanner algorithm
|
||||
Scanner algorithm
|
||||
---------------------------
|
||||
/* current state is in <state>, input character <c> */
|
||||
while (check[base[state] + c] != state)
|
||||
state = default[state];
|
||||
state = next[state];
|
||||
|
||||
while (check[base[state] + c] != state) {
|
||||
diff = (FLAGS(base) & diff_encode);
|
||||
state = default[state];
|
||||
if (!diff)
|
||||
goto done;
|
||||
}
|
||||
state = next[base[state] + c];
|
||||
done:
|
||||
|
||||
/* continue with the next input character */
|
||||
|
||||
This state compression algorithm performs well, except when there are
|
||||
many inverted or wildcard matches ("[^x]", "."). Each input character
|
||||
may cause several iterations in the while loop.
|
||||
2. When diff_encode is NOT set, the default state is used to represent
|
||||
all none matching transitions (ie. check[base[state] + c] != state).
|
||||
The dfa build will compute the transition with the most transitions
|
||||
and use that for the default state. ie.
|
||||
|
||||
if we have
|
||||
1: ('a' => 2)
|
||||
("[^a]" => 0)
|
||||
then 0 will be used as the default state
|
||||
|
||||
if we have
|
||||
1: ("[^a]" => 2)
|
||||
('a' => 0)
|
||||
then 2 will be used as the default state, and the only state encoded
|
||||
in the next/check tables will be for 'a'
|
||||
|
||||
The combination of the diff-encoded and non-diff encoded states performs
|
||||
well even when there are many inverted or wildcard matches ("[^x]", ".").
|
||||
|
||||
|
||||
We will have many inverted character classes ("[^/]") that wouldn't
|
||||
compress very well. Therefore, the regexp matcher uses no state
|
||||
compression, and uses the check and default tables differently. The
|
||||
above states could be stored as follows:
|
||||
Simplified Regexp scanner algorithm for non-diff encoded state (note
|
||||
diff encode algorithm above works as well)
|
||||
|
||||
Regexp table format
|
||||
-------------------
|
||||
|
||||
index: (default, base)
|
||||
0: ( 0, 0) <== dummy state (nonmatching)
|
||||
1: ( 0, 0)
|
||||
2: ( 1, 3)
|
||||
|
||||
index: (next, check)
|
||||
0: ( 0, 0) <== unused entry
|
||||
( 0, 0) <== ord('a') identical, unused entries
|
||||
0+'a': ( 2, 1)
|
||||
0+'b': ( 3, 1)
|
||||
0+'c': ( 4, 1)
|
||||
3+'a': ( 2, 2)
|
||||
3+'b': ( 3, 2)
|
||||
3+'c': ( 0, 0) <== entry is unused
|
||||
3+'d': ( 5, 2)
|
||||
( 0, 0) <== (255 - ord('d')) identical, unused entries
|
||||
|
||||
All the entries with 0 in check (except the first entry, which is
|
||||
deliberately reserved) are still available for other states that
|
||||
fit in there.
|
||||
|
||||
Regexp scanner algorithm
|
||||
------------------------
|
||||
/* current state is in <state>, matching character <c> */
|
||||
if (check[base[state] + c] == state)
|
||||
state = next[state];
|
||||
state = next[base[state] + c];
|
||||
else
|
||||
state = default[state];
|
||||
/* continue with the next input character */
|
||||
|
||||
This representation and algorithm allows states which match more
|
||||
characters than they do not match to be represented as their inverse.
|
||||
For example, a third state that accepts everything other than 'a' can
|
||||
be added to the tables as one entry in (default, base) and one entry in
|
||||
(next, check):
|
||||
|
||||
State
|
||||
-----
|
||||
3: ('a' => 0, everything else => 5)
|
||||
Each input character may cause several iterations in the while loop,
|
||||
but due to guarantees in the build at most 2n states will be
|
||||
transitioned for n input characters. The expected number of states
|
||||
walked is much closer to n and in practice due to cache locality the
|
||||
diff encoded state machine is usually faster than a non-diff encoded
|
||||
state machine with a strict n state for n input walk.
|
||||
|
||||
|
||||
Comb Compression
|
||||
-----------------
|
||||
|
||||
The next/check tables of states are only used to encode transitions
|
||||
not covered by the default transition. The input byte is indexed off
|
||||
the base value, covering 256 positions within the next/check
|
||||
tables. However a state may only encode a few transitions within that
|
||||
range, leaving holes. These holes are filled by other states
|
||||
transitions whose range will overlap.
|
||||
|
||||
1: ('a' => 2, 'b' => 3, 'c' => 4)
|
||||
2: ('a' => 2, 'b' => 3, 'd' => 5)
|
||||
3: ('a' => 0, everything else => 5)
|
||||
|
||||
Regexp tables
|
||||
-------------
|
||||
@@ -132,12 +168,65 @@ index: (default, base)
|
||||
0+'c': ( 4, 1)
|
||||
3+'a': ( 2, 2)
|
||||
3+'b': ( 3, 2)
|
||||
3+'c': ( 0, 0) <== entry is unused
|
||||
3+'c': ( 0, 0) <== entry is unused, hole that could be filled
|
||||
3+'d': ( 5, 2)
|
||||
7+'a': ( 0, 3)
|
||||
( 0, 0) <== (255 - ord('a')) identical, unused entries
|
||||
|
||||
While the current code does not implement any form of state compression,
|
||||
the flex state compression representation could be combined by
|
||||
remembering (in a bit per state, for example) which default entries
|
||||
refer to inverted matches, and which refer to parent states.
|
||||
|
||||
Regexp tables comb compressed
|
||||
-------------
|
||||
index: (default, base)
|
||||
0: ( 0, 0)
|
||||
1: ( 0, 0)
|
||||
2: ( 1, 3)
|
||||
3: ( 5, 5)
|
||||
|
||||
index: (next, check)
|
||||
0: ( 0, 0)
|
||||
( 0, 0)
|
||||
0+'a': ( 2, 1)
|
||||
0+'b': ( 3, 1)
|
||||
0+'c': ( 4, 1)
|
||||
3+'a': ( 2, 2)
|
||||
3+'b': ( 3, 2)
|
||||
5+'a': ( 0, 3) <== entry was previously at 7+'a'
|
||||
3+'d': ( 5, 2)
|
||||
( 0, 0) <== (255 - ord('a')) identical, unused entries
|
||||
|
||||
|
||||
Out of Band Transitions (oobs)
|
||||
---------------------------------
|
||||
|
||||
Out of band transitions (oobs) allow for a state to have transitions
|
||||
that can not be triggered by input. Any state that has oobs must have
|
||||
the OOB flag set on the state. An oob is triggered by subtracting the
|
||||
oob number from the the base index value, to find the next and check
|
||||
value. Current only single oob is supported. And all states using
|
||||
an oob must have the oob flag set.
|
||||
|
||||
if ((FLAG(base) & OOB) && check[base[state] - oob] == state)
|
||||
state = next[base[state]] - oob]
|
||||
|
||||
oobs might be expressed as a negative number eg. -1 for the first
|
||||
oob. In which case the oob transition above uses a + oob instead.
|
||||
|
||||
If more oobs are needed a second oob flag can be allocated, and if
|
||||
used in combination with the original, would allow a state to have
|
||||
up to 3 oobs
|
||||
|
||||
00 - none
|
||||
01 - 1
|
||||
10 - 2
|
||||
11 - 3
|
||||
|
||||
|
||||
Diff Encode Spanning Tree
|
||||
============================================
|
||||
To build the state machine with diff encoded states and to still meet
|
||||
run time guaratees about traversing no more than 2n states for n input
|
||||
a spanning tree is use.
|
||||
|
||||
* TODO *
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user