.EQ
delim $$
.EN
.TL
The Diskless Fileserver
.AU
Erik Quanstrom
quanstro@coraid.com
.AB
The Plan 9 Fileserver is structured as a multilevel cache
for direct-attached WORM storage.  I describe how the
Fileserver is being adapted for modern hardware using
network-attached storage (AoE) over 10Gbps Ethernet.
This structure allows for good performance and high
reliability.  In addition it separates storage maintenance
from Fileserver maintenance and provides automatic offsite
backup without performance penalty.
.AE
.NH
Introduction
.LP
In order to meet our growing performance and reliability
demands, I am in the process of rolling out a diskless
Fileserver.  The system consists of a diskless Intel-based
Fileserver, a local AoE target and an offset AoE target.
A backup Fileserver in “standby” mode is available in case
the main Fileserver should fail.
The AoE targets are stock
.I SR1521
machines with added 10Gbps Ethernet cards.
This configuration is pictured in \*(Fn.
.F1
.PS
scale=10
u=2.5
gap=19
define sr |
[
	box "\f2SR1521\f1"	ht 3*u wid 19
] |
define fs |
[
	box $1		ht u wid 19
] |

A: fs("Fileserver")
B: [sr] at A+(0, -4*u)
C: spline <-> " 10Gbe" from B.nw+(0,-3/2*u)  left .75*u then up 4*u then right .5*u to A.sw+(0,u*.5)
D: fs("backup Fileserver") at A+(gap+19, 0)
E: [sr] at A+(gap+19, -4*u)
F: spline <-> "Wireless" from A.ne+(0,-1/2*u) right gap/2 to E.nw+(0,-3*u/2)
G: spline <-> "10Gbe " from E.ne+(0,-3/2*u)  right .75*u then up 4*u then left .5*u to D.se+(0,u*.5)
.PE
.F2
.F3
.PP
The configuration string[2] for this Fileserver is
.P1
.CW "filsys main ce565.0{e565.1e545.1}" .
.P2
The configuration string for the backup Fileserver is
.P1
.CW "filsys main ce545.0e545.1" .
.P2
The targets
.CW e565.\f2x\fP
are connected to the Fileserver by a point-to-point 10Gbps
Ethernet link.  Except during a dump or in the event of a
failure of
.CW e565.1 ,
all I/O is performed over this link.  The target
.CW e545.0
is in another building, connected by a shared 54Mbps wireless
link.
.PP
The AoE targets are managed independently from the
Fileserver.  Maintenance tasks, like replacing failed drives,
reconfiguring or adding storage, do not require knowledge of
the Fileserver and may be performed without shutting down
the Fileserver.  Conversely, the Fileserver does not require
knowledge of how to perform maintenance on the AoE targets.
.NH
Fileserver Basics
.LP
The Fileserver serves files via the Plan 9 file protocol,
9P2000.  Requests that cannot be directly satisfied by the
in-memory Block Cache are resolved by devices.  The Block
Cache is indexed by device and device address.  The
.CW cw
device serves the WORM filesystem.  It is comprised of three
on-disk devices: cache, read-only and cached
WORM.  These devices are known as
.CW c ,
.CW w ,
and 
.CW cw .
All Blocks have a 
.CW w-address 
and a cache state.
Blocks not in the cache are state
.CW none .
Freshly written blocks are state
.CW write .
Blocks on the 
.CW w
device that are rewritten are state
.CW dirty .
A “dump,” a permanent snapshot of the filesystem, is taken
by converting modified blocks to state
.CW dump .
This process takes just a few seconds.  Other activity on
the Fileserver is halted during the dump.  Copying takes place
in the background and does not impact the performance of the
Fileserver.
Once state
.CW dump
blocks are copied to the WORM, their state is changed to
.CW read 
or
.CW none ,
if it is dropped from the cache.  The copying phase of any
number of dumps may overlap.
.LP
The implemented Fileserver has a Block Cache of
402,197 8192-byte blocks (3137MB), a cache device,
.CW e565.0 ,
of 3,276,800 blocks (25GB) and a WORM device,
.CW "{e565.1e545.1}" ,
of 1.5TB.  The WORM device is the loose mirror
of AoE targets
.CW e565.1
and
.CW e545.1 .
Writes are preformed on the mirrored devices sequentially
but data is read from the first device only.  Thus the wireless
connection which limits dumps to ~1MB/s is not part of
the client's I/O path.
.LP
The WORM filesystem is fully described in [1], [2] and [3].
.NH
“Standby” Mode
.LP
It is not possible use both Fileservers at the same time.  Both
will try to allocate
.CW w-addresses 
without respect to the other.  To solve this problem a
configuration item and command, both named
.CW dumpctl 
were added.
The main Fileserver is configured with
.CW "dumpctl yes"
and the backup Fileserver is configured with
.CW "dumpctl no" .
To prevent writes, attaches may be disallowed.
In the event that the Fileserver fails, the command
.CW "dumpctl yes"
is executed on the backup Fileserver's console.
And if disabled, attaches are allowed.
.LP
While the backup Fileserver is running, it will not
see the new data written by the dump process
on the main Fileserver.  The backup Fileserver must
be halted each day after the dump on the Fileserver
and the command
.CW "recover main"
must be typed at the
.CW config
prompt.  This will cause the cache to be flushed and
the filesystem to be initialized from the new dump.
.NH
Changed Assumptions
.LP
In the fifteen odd years since the Fileserver was developed,
a few of its assumptions have ceased to hold.
The most obvious is the
.CW worm
device is no longer a WORM. Even if we were to use WORM
storage, disk space is inexpensive enough that it would be
practical to keep an entire copy of the WORM on magnetic
storage for performance reasons.  This means that the
cache and the WORM devices have the same performance.
Therefore it no longer makes sense to copy blocks in state
.CW Cread
to the cache device.  Blocks in state
.CW Cread
have been read from the
.CW worm
device but not modified[3].  A new option,
.CW conf.fastworm ,
inhibits copying these blocks to the disk cache.
.PP
A less obvious difference is in the structure of the cache.
The cache device is structured as a hash table.  The hash
function is simply modulo the number of hash lines and
the lines are written sequentially to disk.  If we let
.I n
be the number of rows and
.I l
be the number of columns in our hash, he function
is
.P1
    row = w % rows
    c = column + row*n,
.P2
the blocks will be
linearized onto the disk in the following order
.P1
    $0, n, 2n, ..., (l-1)n, 1, 1+n, 1+2n, ...$
.P2
Suppose that two blocks $w$ and $w+1$ are written to the
Fileserver with an empty cache.  Suppose further that $w+1~%~l~≠~0$.
Then blocks $w$ and $w+1$ map to disk blocks $c$ and
$c+"CEPERBK"$.  With a block size of 8192 bytes, current
Fileserver parameters and 512 byte disk sectors, this works
out to 1072 sectors between “sequential” blocks.
.PP
With disk drives of the same era as the original Fileserver,
disk transfer rates were limited by hardware buffer sizes and
interface bandwidth[5].  Assuming a transfer rate of 1MB/s and
a seek time of 15ms, it would take 8ms to transfer 8192 bytes
from the disk and less than 15ms to seek to another track or
about 347KB/s.
On modern SATA drives, it would take 26µs to transfer 8192
bytes from the disk and up to 9ms to seek.  this would only
yield 890KB/s.  During testing about 2MB/s was observed.  If
this same ratio of calculated versus actual seek time were
to hold for older drives, the older drives would operate at
near rated bandwidth.
.PP
When the formula was changed to 
.P1
    n = w % rows
    c = column*CEPERBK + n,
.P2
the blocks are linearized onto disk in the following order
.P1
    $0, 1, ..., l-1, l, l+1, ...$,
.P2
changing from row- to column-major ordering, performance
increased to ~25MB/s.  Note that not caching blocks in state
.CW Cread
insures that $w$ and $w+1$ will be stored sequentially on
disk, as
.CW column
will be the same for $w$ and $w+1$ unless $w+1~%~rows~=~0$.
However, in this case the blocks will also be stored
sequentially because row $r$ and row $r+1$ are also
sequential.
.NH
Assumptions Redux
.LP
.DS I
If a cat can kill a rat in a minute, how long would it be killing 60,000 rats?
Ah, how long, indeed!  My private opinion is that the rats would kill the cat.
.br
	– Lewis Carroll
.DE
The Fileserver's read-ahead system consists of a queue of
blocks to be read and a set of processes which read them
into the cache.  Although the original paper on the
Fileserver only lists one
.CW rah
process, the earliest Fileserver at the Labs' WORM started four.
The Fourth Edition Fileserver again started one
.CW rah
process but attempted to sort the blocks by 
.CW w-\fRaddress\fR
before processing.  This approach probably makes sense
on slow, partitioned disks.  However, it has the disadvantage
of processing blocks serially.  The more parallelism one can
achieve among or within the Fileserver's devices, the greater
the performance penalty of the sequential approach.
.PP
To test this idea, a 1GB file was created on the Fileserver
on AoE storage.  The AoE driver has a maximum of 24
outstanding frames per target.  After rebooting the
Fileserver to flush the Block Cache, it took 25.5s to read
the file.  Subsequent reads took and average of 13.72s.
After changing the read-ahead algorithm to use 10
independent
.CW rah
processes, the test was rerun.  It took 15.74s to read the
file.  Increasing the number of
.CW rah
processes to 20 reduced the uncached read time to 13.75s,
the same as the cached read time.  Two concurrent readers
can each read the entire file in 15.17s, so the throughput
appears to be limited by
.CW 9P/IL
latency.
.NH
Core Improvements
.LP
The
.CW port
directory underwent some housecleaning.  The
.CW 9p1
protocol was removed.  The console code was rewritten to use
the
.CW 9p2
code.  The time zone code was replaced with the offset pairs
from the CPU kernel to allow for arbitrary time zones.  A CEC
console was added to allow access without a serial console.
.PP
More significantly,
.CW Lock s
were changed from queueing locks to spin locks.  Since a
significant use of spin locks is to lock queues to add work
and wake consumers,
.CW unlock
reschedules if the current process no longer holds any locks
and has woken processes while it held locks.  Also, the
scheduler takes care not to preempt a process with locks
held.  This improved the throughput of single-threaded reads
by 25%.  These ideas were taken from the CPU kernel.
.PP
Networking was changed to allow interfaces with jumbo
MTUs.  This is not currently used by the IL code as it has
no MTU discovery mechanism.  
.NH
PC Architecture Improvements
.LP
By far the largest change in the PC architecture was
to memory handling.  The primary goal was to be able to
handle most of the bottom 4GB of memory.  Thus
the definition of 
.CW KZERO
needed to be changed.  The PC port inherited its memory
layout from the MIPS port.  On the MIPS processor, the high
bit indicated kernel mode.  Thus Fileserver memory was mapped from
.CW 0x80000000
to the top of memory.  Converting between a physical and
virtual address was done by inverting the high bit.  While
simple, this scheme allows for a maximum of only 2GB. 
Lowering
.CW KZERO
to
.CW 0x30000000
and mapping PCI space to
.CW 0x20000000
allows for
.CW 3328MB
memory.
.PP
Unfortunately, being able to recognize more memory puts us
in greater danger of running into PCI space while sizing
memory, so another method is needed.  A BIOS
.CW 0xe820
scan was chosen.  Unfortunately, the processor must be
in Real mode to perform the scan and the processor is
already in Protected mode when the Fileserver kernel
is started.  So, Instead of switching back to real mode,
.CW 9load
was modified to perform the scan before turning on
paging[8].
.PP
Surprisingly, the preceding changes were not enough
to enable more memory.  The Fileserver faulted when
building page tables.  It turned out this is because
the 4MB temporary pagetables built by
.CW 9load
were not enough.  The BIOS scan of the testing machine
yielded 3326MB of accessible memory.  This would require
3.25MB of page tables.  Since the bottom megabyte of memory
is unusable, we don't have any room left for the kernel.
The solution was to use 4MB pages.  This eliminates the need
for page tables, as the 1024-entry page directory has enough
space to map 4GB of memory.
.PP
On 64-bit processors, it would be relatively easy to fill in
more memory from above 4GB by using the 40-bit extensions
to 4MB pages.
.NH
The AoE Driver
.DS I
If you were plowing a field what would you rather use, 2 strong oxen or
1024 chickens? 
.br
	– Seymour Cray
.DE
.LP
This is the Fileserver's raison d'être.  The AoE driver is based on
the Plan 9 driver.  It is capable of sending jumbo or standard
AoE frames.  It allows up to 24 outstanding frames per target.
It also allows a many-to-many relationship between local interfaces
and target interfaces. 
.PP
When the AoE driver gets an I/O request, a 
.CW Srb
structure is allocated with
.CW mballoc .
Then the request is chopped up into 
.CW Frame
structures as available  Each is sized to MTU of the chosen link.  A
link is chosen round-robin fashion first among local
interfaces which can see the target and then among the
target's MAC addresses.  MTUs may be freely mixed.  The
frames are sent and the number of outstanding frames is
appropriately incremented.  The driver then sleeps on the
.CW Srb .
When awoken, the process is repeated until all the bytes in
the request have been received.
.PP
When an AoE frame is received that corresponds to I/O,
the frame is copied into the buffer of the
.CW Srb
and the number of outstanding frames is decremented.  If
there are no outstanding frame remaining, the
.CW Srb
is woken.
.PP
Since the Myricom 10Gbe cards have an MTU of 9000 bytes,
an entire 8192 byte block and the AoE header fit into a single
frame.  Thus sequential read performance depends on
frame latency.  Performance was measured with a process running
the following code
.P1
static void
devcopy(Dcopy *d)
{
	Iobuf *b;

	for(d->p = d->start; d->p < d->lim; d->p++){
		b = getbuf(d->from, d->p, Bread);
		if(b == 0)
			continue;
		putbuf(b);
	}
}
.P2
The latency for an frame with 8192 data bytes
is 79µs giving 12,500pps or 103MB/s while two concurrent reads
yield 201MB/s.  Testing beyond this level of performance has
not been performed.
.NH
System Performance
.LP
I measured both latency and throughput
of reading and writing bytes between two processes
for a number of different paths.
2007 measurements were made using an
.I SR1521
AoE target, an Intel Xeon-5000-based cpu server with
a 1.6Ghz processor and a Xeon-5000-based Fileserver
with a 3.0Ghz processor.  1993 measurements are
from [6].
The latency is measured as the round trip time
for a byte sent from one process to another and
back again.
Throughput is measured using 16k writes from
one process to another.
.ps -2
.DS C
.TS
box, tab(:);
c s s s s
c | c | c | c | c
a | n | n | n | n.
Table 1 – Performance
_
test:93 throughput:93 latency:07 throughput:07 latency
:MB/s:µs:MB/s:µs
_
pipes:8.15:255:2500:19
_
IL/ether:1.02:1420:78:72
_
URP/Datakit:0.22:1750:N/A\&:N/A\&
_
Cyclone; AoE:3.2:375:≥250:49
.TE
.DE
.NL
.LP
Random I/O was not tested for two reasons.  First, ~3GB of
recient reads and writes are stored in the Block Cache and
when new or newly modified files are reread from the cache,
they are reread sequentially.  It is expected that the
working set of the Fileserver fit in the Block Cache.
Second, since a single IL connection is latency limited,
reads of highly fragmented files like
.CW /sys/log/auth
from the WORM are not meaningfully slower (59MB/s) than
reads from the cache (62MB/s).
.NH
Discussion
.LP
Decoupling storage from the Fileserver with AoE allows for
automatic offsite backup, affords good availability,
scalablity and performance.  The Fileserver is not involved
in storage management.  It is possible to grow the existing
WORM to 9TB without restarting the fileserver.  By
reconfiguring the Fileserver, essentially unlimited storage
may be added.
.LP
The size of the WORM and Block Cache have scaled by a factor
of 1000 since [3] and single IL connections have scaled by a
factor of 200 since [7].  The Block Cache is currently at a
practical maximum for a kernel with 32-bit memory addresses.
A kernel with 64-bit memory addresses in the next logical
step.
.LP
The disk cache has not been scaled to the same extent as the
increased number of cache buckets put more pressure on the
Block Cache and would not provide much benefit.  With the
.CW conf.fastworm
option, the cache only need to be large enough to hold the
free list and any blocks in state
.CW dirty
or
.CW write .
Eliminating the cache device may make sense in the future.
The cache device could be replaced with address of the
current Superblock.  Addresses below the current Superblock
would be read only.  The disadvantage to such a scheme is
that the dump processes gives the (unused) opportunity to
optimize the ordering of
.CW w-address es.
.NH
References
.IP [1]
K. Thompson, “The Plan 9 File Server”,
Plan 9 Programmer's Manual, Second Edition,
volume 2, AT&T Bell Laboratories, Murry Hill, NJ,
1995.
.IP [2]
K. Thompson, G. Collyer,
“The 64-Bit Standalone Plan 9 File Server”,
Plan 9 Programmer's Manual, Fourth Edition,
volume 2, AT&T Bell Laboratories, Murry Hill, NJ, 2002.
.IP [3]
S. Quinlan, “A Cached WORM File System”,
\f2Software — Practice and Experience\f1, 
volume 21, number 12, pp. 1289—99.
.IP [4]
S. Hopkins, B. Coile, “ATA over Ethernet”,
published online at
http://www.coraid.com/documents/AoEr10.txt
.IP [5]
A. Tanenbaum
\f2Operating Systems, design and implementation\f1,
Prentice Hall, Englewood Cliffs, New Jersey, 1987, p. 272.
.IP [6]
Diskless Fileserver source code at /n/sources/contrib/quanstro/src/myfs.
.IP [7]
D. Presotto, P. Winterbottom,
The Organization of Networks in Plan 9
\f2Proc. of the Winter 1993 USENIX Conf\f1., pp. 271-280, San Diego, CA
.IP [8]
Modified 9load source code at /n/sources/contrib/quanstro/src/9loadaoe.