The Plan 9 file protocol, 9P, defines per-connection state such as open files and pending requests. When a connection is lost due to network interruptions or machine failures, that state is lost. The Plan 9 kernels make no attempt to reestablish per-connection state. Instead, each application must, if it wishes to continue running across a network failure, remount its file servers and reopen any files. .PP Modifying every program to accomodate connection failures would be difficult and error-prone. Instead, we On long distance networks, where connection failures are more common, In environments such as the internet, where connection failures are more common, In environments where network connections fail frequently, 9p, the standard protocol for exporting filesystems in Plan 9 is stateful. That means that for each client connection there is state kept on the server. When the connection breaks this state is lost. An application using this filesystem will not be able to continue using it. .CW Recover was a program originally written by Russ Cox for Plan 9 second edition. It is meant to interpose itself between a 9p client and a 9p server on the client side in order to recover broken connections or help clients survive in the case of a server failure. It works by decoupling the client from the server by keeping account of the state kept by the server and pushing it again if the connection breaks. Instead of seeing an "i/o on hungup channel", the connection will recover and the clients will have only seen a momentary block of the filesystem operations. .AE .SH Introduction .LP Plan 9 [Pike95] is a flexible distributed system. It owes its versatility to three simple principles. First, resources are named and accessed like files in a hierarchical file system. Second, there is a standard protocol, called 9P, for accessing these resources. Third, the disjoint hierarchies provided by different services are joined together into a single private hierarchical file name space. As resources are represented as files and universally accessed through 9P, recover was written as an interposer between a 9P server and a 9P client. .PP A 9P .I server .I intro (5)¹ .FS ¹ From now on, a number in parenthesis beside a word, like .I proc (3) is a reference to a section with an entry for that word in the plan 9 manual [9man]. .FE is an agent that provides one or more hierarchical file systems \(em file trees \(em that may be accessed by clients. A server responds to requests by clients to navigate the hierarchy, and to create, remove, read, and write files. The prototypical server is a separate machine that stores large numbers of user files on permanent media; such a machine is called, somewhat confusingly, a .I file .I server . Another possibility for a server is to synthesize files on demand, perhaps based on information on data structures inside the kernel; the .I proc(3) kernel device is a part of the Plan 9 kernel that does this. User programs can also act as servers. .PP A .I connection to a server is a bidirectional communication path from the client to the server. There may be a single client or multiple clients sharing the same connection. .PP 9P2000 is the most recent version of 9P, the Plan 9 distributed resource protocol. It is a typical client/server protocol with request/response semantics for each operation (or transaction). 9P can be used over any reliable, in-order transport. While the most common usage is over pipes .\" (footnote that pipes is a bit of a simplification of channels) on the same machine or over TCP/IP to remote machines, it has been used on a variety of different mediums and encapsulated in several different protocols. .PP 9P has 12 basic operations, all of which are initiated by the clients. Each request (or T-message) is satisfied by a single associated response (or R-message). In the case of an error, a special response (R-error) is returned to the client containing a variable length string error message. The operations summarized in the following table fall into three categories: session management, file operations, and meta-data operations. .DS .TS box, center; cb | cb | cb a | a | a . class op-code desc = session version version & parameter negotiation management auth authentication attach establish a connection flush abort a request error return an error _ file walk lookup files and directories operations open open a file create create and open a file read transfer data from a file write transfer data to a file clunk release a file _ metadata stat read file attributes operations wstat modify file attributes .TE .DE .PP .PP The combined acts of transmitting a request of a particular type, and receiving its reply is called a transaction of that type. .PP The 9P protocol is stateful. This state is represented by an abstraction called fid, a 32-bit unsigned integer that the client uses in a T-message to identify the ``current file'' on the server. Fids are somewhat like file descriptors in a user process, but they are not restricted to files open for I/O: directories being examined, files being accessed by .I stat (2) calls, and so on \(em all filesystem elements being manipulated by the operating system \(em are identified by fids. Fids are chosen by the client. All requests on a connection share the same fid space; when several clients share a connection, the agent managing the sharing must arrange that no two clients choose the same fid. Fids are used as handlers for navigating the hierarchy and in general to multiplex the channel. Each fid has is own state on the server, including its current path, if it is open, an the current offset for an open file or directory. This makes the implementation much easier and faster as only the current operation relative to the fid context needs to be communicated to the server. While statefulness has many benefits, it also has an important downside. If a connection breaks, all the state of the fids is lost. This is essential for garbage collection on the servers. When a connection is broken or a server reboots, all the clients are working on a context which is no longer relevant. Any applications using a filesystem in this state would see an "i/o on hangup channel" message in Plan 9 and would need to be restarted. A diskless machine whose root filesystem was in that filesystem would have the same problem and would need to reboot. .PP 9P is being tasked in environments requiring a higher degree of robustness than originally required in the research environments. As such a mechanism for re-establish broken connections and recovering state (particularly in the face of server errors) is particularly important. The clients must continue being able to run in the face of a network failure o a server reboot. The ability to fail over a connection between redundant file servers is also desirable. .PP We needed a 9P interposer capable of writing down the state of the fids of a connnection and reestablishing that state in the case of any error. The interposer would run on the client side and should be able to run on an unmodified server and client. Ideally the client would just block while the recovery is taking place. In the normal case of no errors, the performance penalty added by the interposer should be negligible. Recover is a program meant to provide this interposer. .SH Architecture .LP As mentioned previously recover interposes itself on a 9P connection, on the side the client side. It uses a net connection and serves another file on /srv where the original filesystem can be mounted through recover. .PP Recover is composed of two processes: .I listensrv and .I listennet. A shared lock arbitrates access to resources. .I Listensrv listens for T messages from the client via the srv file and forwards Tmessages to the server through the connection. .I Listennet on the other side listens for Rmessages from the server via the net connection and sends them to the client through the srv file. Each T-R message corresponds to a Request structure. When a T message arrives, it is processed and the corresponding request is allocated with the tag on the message as identifier. When the response comes, it can be looked up on a hash table based on the tag. .KF .PS copy "arch.pic" .PE .Cs Figure 1. Recover Architecture .Ce .KE .PP There are two different kind of requests, internal and external requests. External requests are generated by the client and forwarded to the server. Internal requests are generated by recover in the event of a connection failure. The functions internalresponse and externalresponse are called by listennet when it reads an Rmessage from the client process. The kind of response is specified by the flag isinternal in the Request structure. .PP When .I listensrv wants to send a request, it calls .I queuereq. .I Queuereq tries to send the request, unless the fid is not ready. This mean the connection is down or a recovery is in process. If the connection is down, .I listennet will find it out eventually and call .I redial. (The function used to reconnect is called .I redial.) When .I redial is called, all the requests queued which could not be transmitted will attempt retransmission. Before the transmission, though, the remote fid state (the state associated to the fid in the connection between recover and the server) will have to be looked up to see it if has been rewalked. If they have not, internal requests will be sent to rewalk the saved fid path and the operation will stay queued. Once the Rwalks are received, all the external requests relative to the fid will be restarted until all operations complete. .PP Normally on .I redial, after the .CW version, .CW attach and .CW auth messages exchanged to initiate the session, the only extra operation needed is to rewalk the fids and open the ones which were open in the lost connection. After that, the outstanding requests can be sent. There is a special case, though. Directories cannot be seeked in 9p. As a consequence, there is an implicit state in the server associated with a fid directory which has to be pushed back into it and that is the point where the last read got to. There are two possible solutions for this. As it is not possible to start the read with an arbitrary offset, one solution would be to read from the start to reset the offset to that point and throw away the read results once the offset is set. This can get complicated if the directory has changed. We opted for the other solution which is to reread the complete directory again. This is done by rewriting the offset from the client to make it zero on the server. The only problem this approach could generate, which is repeated entries for the resent reads is already solved, because the binds in plan 9 already produce duplicates, so it is not a problem. .PP Another interesting exception is ORCLOSE files. If the connection breaks down, they disappear. Instead of having the ORCLOSE files disappear under us, what we did was rewritting the open/create messages in orter to make them normal files. When the clunk for the fid comes, we remove them ourselves. We also forbid opening exclusive files under recover in order to prevent deadlock, which can be a tough problem, for example with mail boxes. .PP 9P admits a specifier for .CW attach operations, which includes a user name and a string for the server. Only one mount specifier works with recover at the moment. In order to support more than one mount specifier, a new .CW attach should be processed, which would need authentication. To implement this is not easy because one would have to push the dialog with factotum into a per-fid state machine. All the authentication is done now at startup, without any client mixing operations with us, we just send the auth, negotiate the keys with factotum, do the auth rpcs over the authentification fid and finally make the attach transaction. Once started, on receiving an .CW attach from a client, we just convert it to a .CW walk to clone the root fid. Doing the .CW attach and .CW auth just on startup is much easier because it is simple to send a request and read the connection for a reply without worrying about transactions from another fids getting in between. For multiple specifiers this has to be multiplexed over the different fids, mixed with other operations or at least run one pair of listensrv an listennet procs per specifier which leaves then creates the problem of managing and communicating the processes. .PP New attach messages for the same specifier are rewritten as .CW walks. This is done in order to create a new fid for the new client. As long as no new specifier is needed, this works well. .SH Debuging .PP Probably, the most difficult part or writing any software is debuging. This is specialy true for something like recover, because the network can fail at any point. If we view the recover server as a state machine, the failure can come in any of the states of the machine and the number of possible states is very big. We kept finding bugs on states the filesystem had not been to. This composes with the fact that recover does its job when things fail and normally things do not fail. Two simple observations helped us develop a debuging test enviroment which has been able to make recover very stable with very little effort. The first one is that the state machine of recover is big, but this state is pushed on to the server by the client. The second one is that the quantum in which this state is pushed is a 9p message. So the only thing we have to do to get an enviroment which goes through many of the representative states of the software is do simple operations on the filesystem and break the connection after every N number of messages. We can do this more than once, breaking the connection by closing the file every (n1, n2, n3...) number of messages where each of this numbers go from one message to the number of messages in the operation we are trying to test. We ended up with a vector of message numbers we can apply consistently to a newly run recover to a test operation. We applied this enviroment with (n1, n2) for every possible value of the tuple for all the filesystem operations, like open, read, stat and some of the operations compounded. We found some bugs which were very easy to correct as they were completely deterministic. .SH Performance .PP We have two versions of recover, which share almost all the code. One is for Plan 9, the other is for Plan9ports [P9ports]. Plan9ports is a port of most of the software running in Plan 9 for Unix-like systems. We measured the performance on both using postmark [Postmark], which we ported to Plan 9. We did all the measures with 16384 transactions. In Plan 9 the results were what we expected. Using recover on the loopback we had roughly a factor of two degradation of latency for each operation. This is because most of the time is spent on context switches between kernel and user space. As the number of this double (recover runs on user space), the performance is divided by two. These measures can be seen in figure 2. Over the network, the latency added by the network hides the latency added by recover. The effect of using recover is around a ten percent or less performance degradation. This is shown in figure 3, which depicts measures for a 100 Mbps ethernet and gigabit. On the other side, on linux, the performance was worse. We got like one third of the performance when we added recover on the loopback. We looked more deeply into the matter and came out with figure 4. The first measure on figure 4 stands for postmark run over a directory mounted through the loopback from a server running in the same host. The second stands for a measure of the same server but mounted over a named pipe which serves the loopback connection. This is the normal way to use the network in Plan 9 ports. It emulates the behaviour of the srv filesystem in Plan 9). The third is the measure of postmark run through recover with recover going directly through the network. The two measures which are equivalent (both go through a named pipe) are the second and the third one, so we take a twenty percent performance loss because of recover. The huge difference between the first and the second measure points to a problem in the way the name pipe is managed. It could be argued that this results (as in the case of Plan 9) would be hidden by a fast network. We did the same measurements over a gigabit ethernet. The results, shown in figure 5 show that this is a problem even over a gigabit ethernet. Recover gets even worse results in this measures too. It has to be taken in account that this performance loss does not appear in Plan 9 which shares almost all the code. This points to a problem in some of the libraries or some of the systems infrastructure of Plan 9 ports. We profiled the system using oprofile [Oprofile] and found that it spend most of the time in locks, so probably it is a problem with the thread library, though this issue has to be followed more throughly. .WS 1 .KF .BP p9_local.eps 2.04i .Cs Figure 2. Postmark with and without recover on the loopback on Plan 9 .CE .KE .WS 1 .KF .BP p9.eps 2.04i .Cs Figure 3. Postmark with and without recover over the network on Plan 9 .CE .KE .WS 1 .KF .BP linux_local.eps 2.04i .Cs Figure 4. Postmark with and without recover over the loopback on Linux .CE .KE .WS 1 .KF .BP linux_giga.eps 2.04i .Cs Figure 5. Postmark with and without recover on a gigabit network on Linux .CE .KE .SH Related Work .PP PhilW's kernel modifications to try to reconnect (what are this, where can I read about them?). .PP .I Aan(8) tunnels traffic between a client and a server through a persistent network connection. If the connection breaks, the aan client re-establishes the connection by redialing the server. Aan uses a unique protocol to make sure no data is ever lost even when the connection breaks. After a reconnection, aan retransmits all unacknowledged data between client and server. Aan requires a modified server to establish the other end of the tunnel. As a consequence, it cannot be run on nonmodified file servers. Aan also works at network level, so it does not understand the meaning of the file operations over it. As a consequence, it does not work in the event the server hanging or rebooting, because the state of the aan connection is lost. It cannot do failover either for the same reason. .PP Redirfs is a program which serves a 9P connection and a mounted filesystem with the same purpose as recover. Some of the applications we are using recover with don't have a Plan 9 kernel on the client side, but just a ligthweight library kernel and a 9P connection to the server, we needed a 9P to 9P interposer, so redirfs did not work as we needed. .SH Future Work .LP Some synthetic filesystems cannot be used with recover as it is now, specially in the event of a server reboot. One example of this is /net (see, for example .I ip(3) ). In /net some operations, like creating a connection are based on many file operations which separated do not mean anything. Also, some files, like ip connections, are not replaceable. We are trying to figure out ways for doing this for the filesystems we use. Some of the ideas behind Plan B's /net [Net] may provide a solution. .PP Recover is now a user space program. It could be integrated in the kernel to make it faster. Given the results obtained in Plan 9, we do not think integrating recover into the kernel would be necessary for normal users. Recover is normally used over the network which has a latency so high that the performance gain would not be worth it. Users who are using recover through the loopback and need a very high performance may be interested in doing it, because it would probably multiply the performance by two. In Linux some other issues have to be dealt with first so that the performance of recover gets to be comparable to that of Plan 9. .PP On some cases, the applications may need to know that a reconnection has happened. How this is done is not clear. One way would be to return an error and maybe write a library wrapper to hide it and wait for it in a specific interface, so that legacy applications work. .SH Conclusion .LP 9P is stateful, which makes it more simple and effective. Recover effectively removes the downside of this approach by providing high availability and failover for filesystems in the case of a server shutdown or a broken connection. It provides a safety layer efectively isolating the client from the server loosing state.