.\"<-xtx-*> refer/refer -n -e -p Refs ugrid.ms | tbl | troff -mpm -mpictures | lp -d stdout .nr dT 4 \" delta tab stop for programs .de EX .nr x \\$1v \\!h0c n \\nx 0 .. .de FG \" start figure caption: .FG filename.ps verticalsize .KF .BP \\$1 \\$2 .sp .5v .EX \\$2v .ps -1 .vs -1 .. .de fg \" end figure caption (yes, it is clumsy) .ps .vs .br \l'1i' .KE .. .FP palatino .TL ...Grids Visible and Invisible Making Infernal Grids Usable .AU C H Forsyth .AI Vita Nuova 3 Innovation Close York Science Park York YO10 5ZF .SP 6 January 2006 .br .CW forsyth@vitanuova.com .AB We regard a `grid' as a specialised application of distributed systems. We describe the construction of several production grids implemented using the Inferno distributed operating system. The underlying technology was originally a convenient means to an end: both systems were developed as commercial projects, designed in close cooperation with the piper-paying end users, and that contributed more than the technology to the results being regarded as eminently usable by them. The technology did, however, help to speed development, and makes even its low-level interfaces easier to document, understand, and maintain. It represents all grid functionality, including the scheduler, as directory hierarchies, accessed through normal file IO operations, and imported and exported through the Internet as required. .AE .NH 1 Introduction .LP Vita Nuova applies novel technology to the construction of distributed systems. The background of the founders includes the successful development during the 1990s of systems for real-time auctions (of pigs and cars, separately) and on-line Internet game playing. These were significant distributed systems produced on commercial terms using conventional technology (C and Unix) in unconventional ways. For instance, the interfaces between the various components of the auction system were specified in the Promela protocol specification language, verified automatically using the SPIN protocol verifier, and the implementation as distributed communicating processes was derived directly from them. We later formed Vita Nuova to develop and exploit some (then) new systems originally developed by the Bell Labs research centre that produced Unix thirty years ago, Plan 9 and Inferno. Unlike Unix (and Unix clones) they were designed from the start with a networked world in mind. .LP In the past few years, Vita Nuova has undertaken several successful projects to build both resource-sharing and computational grids, for end-users. The aim was not to produce a toolkit, but usable deployed systems: `end users' in our case meant the researchers and scientists, not users of a programming interface. Nevertheless, some scientists need a programming interface, and we want that to be straightforward. Various aspects of the underlying technology can help. .LP I shall describe three systems we built and the implementation process, and draw some conclusions. All three systems were implemented using the Inferno system. We do feel that its technological model made the systems easier to design and build, quickly, to good standards, and I shall suggest some reasons why. More important though is a key aspect of the development process: in each case we worked closely with the customers — the scholars and scientists that were the end users — to create a system that met their needs, that they could deem `usable'. .NH 1 Inferno .LP Inferno .[ inferno operating system .] .[ inferno programmers manual .] is an operating system that provides an alternative universe for building distributed systems. The environment presented to applications is adopted from Plan 9, .[ plan 9 name spaces .] and unorthodox. All resources are represented in a hierarchical name space that can be assembled dynamically to per-process granularity. .[ styx architecture distributed systems .] Resources include conventional file systems, but also devices, network interfaces, protocols, windowing graphics, and system services such as resource discovery and name resolution. Inferno runs `native' on bare hardware but unusually, it can also run `hosted' as an application under another operating system, including Windows and Unix-like systems. It provides a virtual operating system for distributed applications, including a virtual network, rather than acting as conventional `middleware'. .LP Inferno applications access all types of resources using the file I/O operations: open, read, write and close. An application providing a service to be shared amongst many clients acts as a file server. A .I "file server" in this context is simply a program that interprets a common network-independent file service protocol, called Styx™ in Inferno and 9P2000 in Plan 9. .[ inferno programmers manual .] .[ ubiquitous file server .] Distribution is achieved transparently by importing and exporting resources from one place to another using that protocol, across any suitable transport. An application runs in a local name space that contains the resources it needs, imported as required, unaware of their location. Resources can readily be securely imported across administrative boundaries. When running hosted, Inferno provides the same name space representation of resources of the host system (including devices and network interfaces) as it would provide running native. Thus for example, one can import the network interfaces of a Linux box into an application on Windows, to use Linux as a gateway. In all cases, Inferno presents the same narrow system call interface to its applications. .LP The Styx protocol itself is language and system independent. It has a dozen operations, mainly the obvious open, read, write and close, augmented by operations to navigate and operate on the name space and its metadata. It has been implemented on small devices, such as a Lego™ programmable brick, .[ styx on a brick locke .] has C, Java and Python implementations amongst others. A Styx server implementation in FPGA is being developed to support pervasive computing. It has been proposed for use in wireless sensor networks. .[ file system abstraction sense respond .] Blower et al. .[ blower styx grid services .] provide Grid Services including work flow using their Java Styx implementation. The .B v9fs module of the Linux 2.6 kernel series provides direct access to 9P2000 and thus Styx through normal Linux system calls. .[ grave robbers outer space .] .NH 1 Infernal grids .LP We applied Inferno and Styx to grid building. Perhaps one impetus was reading papers reporting research into things we had been doing quite casually with Plan 9 and Inferno for years, such as running interactive pipelines with components securely crossing authentication and administration with secure access through the Internet to the resources of many systems at once. I shall describe three: a demonstrator, which is still of interest for the range of resources it accessed; a data-sharing grid, which creates a secure logical data store from a dispersed collection; and a more traditional computational grid, which spreads a computational load across many computing elements, Condor-like. .[ condor .] Despite their different tasks, they are all built using the uniform Inferno mechanisms described above: all the resources are represented by Styx file servers, and can be accessed by traditional commands and system calls. .NH 2 The demonstrator .LP The first system was a small experiment in building a simple `grid' (at least in our own terms), that would later act as a demonstrator, for instance at Grid shows and when touting for business. .[ michael jeffrey inferno grid applications .] We had a collection of resources on different machines running different operating systems, and aimed to give access to them to anyone who connected to the system. `Resources' included data, devices, and services, all connected and accessible remotely (securely). We wanted a good variety to emphasise that such a range could be handled by Inferno's single resource representation, and thus distributed. Amongst other things, we offered the following: .IP • a Lego™ Mindstorm™ programmable brick, built into a working clock .[ styx brick locke .] .IP • a Kodak™ digital camera under computer control .IP • an ODBC™ database .IP • collaborative activities: a shared whiteboard and multi-player card games .IP • a distributed computation service for image processing (`cpu service') .IP • a resource registry .LP It took three people about four weeks to put the system together, including writing the cpu service, the resource registry, the image processing demos, and the user interfaces. With some experience, it was later revised and refined, and some manual pages written, but retained the same implementation structure. Also deployed as part of the demonstration was an Internet Explorer plug-in we had written much earlier, that embedded the Inferno operating system environment on a web page, allowing web browsers to run Inferno to access our little demonstration `grid' without having to install special software explicitly, and without our having to compromise our demonstration by using an HTTP gateway. A PC running native Inferno ran the gateway to the set of services, which were behind our firewall. Inferno used its public-key system to authenticate callers, who could then add their own machines to the pool of cpu servers in the registry to increase the processing power of the demo `grid'. .LP The system was demonstrated at several conferences, including a UK academic Grid conference in June 2003. .[ jeffrey inferno grid applications .] The system was not intended for production use; it was not sufficiently fault-tolerant, for example. Even so, it allowed us to experiment with different ways of representing computational resources. It also showed that a realistic range of resources could be collected and accessed using Inferno's single method for resource representation and access: all the resources and services listed above were represented by file servers, and accessed by clients using open, read, write and close. The resources themselves were provided by different machines running different native operating systems, but all of them were also running Inferno. .NH 2 Data grid for the humanities .LP The Center for Electronic Texts in the Humanities at Rutgers wished to use a distributed computing system to allow disparate collections of literary resources to be represented and accessed as a uniform whole, with the collection spanning different machines, platforms and administrations. The aim was to allow secure ``interactive collaboration toward common goals as if the users were working on a single large virtual computer''. .[ hancock humanities data grid .] Tasks to be supported included not just viewing existing archives, but managing and modifying them, and adding new material, including annotations and links. .LP We were approached by one of the principals and, given the non-profit nature of the organisations involved, agreed to create the system `at cost'. Some preliminary requirements capture and specification was done using e-mail and telephone before one member of staff was despatched to Rutgers to do the work. He spent a few days on site, developed the final specification in discussions with the end-users, implemented the supporting Inferno applications, refined it, installed the system on the relevant machines, and provided some training. The application is still in use today. .LP The implementation is straightforward: one or more archive browser applications run on a client, each building a name space suitable for the given task, incorporating name spaces from other servers or clients as needed to support the current collaboration. Since all resources look like files, the archive browser looks like a file browser, even though no physical file hierarchy corresponds to the view presented. It is important that machines are easy to add, that access is secure, and that a collaborator sees only the files that are part of the collaboration. Furthermore, the applications that use the resources are not available in source, and must remain unchanged. .NH 2 Computational grid .LP The computational grid is perhaps more interesting, because it required significant new components. It was a full commercial development, and much larger scale. Even so, the same approach was followed: requirements capture, system specification, implementation, deployment, testing, and refinement. .LP The customer made extensive use of an existing set of scientific applications. They used them for their own R&D, but also to provide an analysis service to other companies. The application could consume as much computing power as it was offered. Fortunately, as often happens, the input data could be split into smaller chunks, processed independently, and those separate results collated or aggregated to yield the final result. Across its various departments, the company had many hundreds of Windows PCs, and a handful of servers, and they wished to use the spare cycles on the PCs to build a big virtual computational system. .LP Several members of Vita Nuova's development staff, including a project manager, worked with knowledgeable staff of the customer to refine and expand their initial set of requirements. They had previous experience with grid systems, and their initial requirements were perhaps therefore more precise than might otherwise have been the case. .LP Specification and implementation was actually in two stages, with minuted meetings and summary reports throughout. First, a proof-of-concept system was agreed, and produced by modifying the original demonstration software. We demonstrated to customer scientists (and management) that we could run existing Windows applications on a small collection of machines in our office, and that the performance improvement claimed could be achieved. We could then discuss the requirements and specification of the real system in detail, decide deliverable items, design an implementation and carry that out. .LP Of course, had the small demonstrator failed in fundamental ways, the project could easily have ended then, at small cost to the customer. In our experience, that is the usual approach for commercial projects. Of course, it does not account for many uncertainties: for instance, would the `production' system scale up to many hundreds of nodes? It is not unusual for money to be kept back until the customer has received and accepted the deliverables of the contract. (I belabour these points to emphasise that there is considerable pressure on the software team to deliver something the customer does indeed find usable.) .LP The resulting system, now called `Owen', has a simple model, that of a labour exchange, .[ owen labour exchange .] forming the heart of a virtual time-sharing system. Worker nodes with time available present relevant attributes and qualifications to a scheduler at the exchange, which offers work from its queue. If a task is suitable, a worker does it, and submits the results, which are checked at the exchange (eg, for completeness or validity) before the corresponding task is finally marked `done'. The scheduler accounts for faulty nodes and tasks. The largest unit of work submission and management is a .I job , divided into one or more atomic .I tasks . Each job has an associated .I "task generator" that splits the work into tasks based on some specified criteria peculiar to a given application. Usually the task split is determined by properties of the input data, and known before the job starts, but tasks can be created dynamically by the task generator if required. Workers request work as they fall idle, much as processors collect work on shared multiprocessor, rather than work being pushed to them in response to an estimated load. .LP The programming team had three people: one for the scheduler, one for the worker, and one to do the various graphical interface programs for grid control. It took six to eight weeks to produce the final deliverable, and install and test it. Subsequently it has undergone further development, to add capabilities not relevant to this discussion. .NH 1 User and programmer interfaces .LP File-based representation as the lowest visible access level somehow gives access to the resources a solid `feel'. It is easy to build different types of interface on that basis. I shall use the computational grid as an example here. .NH 2 Node and job administration .LP Owen is administered through two simple `point and click' applications. The Node Monitor, shown in Figure 1, shows the status of the computers available (or not) to do computation. Nodes can disconnect after accepting work, and reconnect much later to submit results. The node monitor distinguishes those cases. It allows nodes to be added and removed from the working set, suspended, and organised into named groups that are restricted to certain jobs, job classes or job owners. .LP The Job Monitor, shown in Figure 2, shows the status of jobs submitted by a user, including the progress made thus far. Jobs can be created and deleted, paused and continued, and their priority adjusted. Both graphical applications aim for a simple, uncluttered appearance, reflecting a simple model that hides the more complex mechanics of the underlying system. .NH 2 Fundamental interface to the labour exchange .LP To make this more concrete, and perhaps help justify our choice of Inferno, let us look behind the graphical user interface. The starting point for design was the definition of a name space representation for computational resources. The Exchange's scheduler serves that name space. Each worker node attaches it to its own local name space, in a conventional place. We can look at it using the following commands: .P1 mount $scheduler /mnt/sched cd /mnt/sched; ls -l .P2 The scheduler serves a name hierarchy, with the following names at its root, shown here as the result of the Unix-like .CW "ls -l" above: .P1 d-r-xr-x--- M 8 admin admin 0 Apr 13 19:58 admin --rw-rw-rw- M 8 worker worker 0 Apr 13 19:58 attrs --rw-rw-rw- M 8 admin admin 0 Apr 13 19:58 nodename --rw-rw-rw- M 8 worker worker 0 Apr 13 19:58 reconnect --r--r--r-- M 8 worker worker 0 Apr 13 19:58 stoptask --rw-rw-rw- M 8 worker worker 0 Apr 13 19:58 task .P2 Note that although each worker sees the same names, from the scheduler's point of view, each worker can be distinguished and given data appropriate to it. The files shown as owned by .CW worker are those the worker can access. The worker can write to the .CW attrs files to describe acceptable work and the properties of its node. A worker reads the .CW task file to obtain offers of work; the read blocks until work is available. (In general, that is how publish/subscribe is done in a Styx environment: rather than having a special mechanism and terminology, an application simply opens a file to `subscribe' to that data source on a server, and subsequent reads return data representing the requested `events', as they become available, just as it does for keyboard, mouse or other devices.) Subsequent reads and writes exchange task-specific data with the job's task generator running as a process in the scheduler. That is processed by a task-specific component running in the worker. Amongst other things, the client-side component might receive a set of task parameters, and small amounts of data or a list of resources elsewhere on the network for it to put in the task's local name space for the application's use. The first task generators were specific to programs such as GOLD or CHARMM, so that customers could do productive work immediately. We then abstracted away from them. More recent generators support various common classes of computation (including those applications); the latest generator offers a simple job description language. .LP Job control and monitoring is provided by files in the .CW admin directory: .P1 d-r-xr-x--- M 4 rog admin 0 Apr 14 16:31 3 d-r-xr-x--- M 4 rog admin 0 Apr 14 16:31 4 d-r-xr-x--- M 4 rog admin 0 Apr 14 16:31 7 --rw-rw---- M 4 admin admin 0 Apr 14 16:31 clone ---w--w---- M 4 admin admin 0 Apr 14 16:31 ctl --r--r----- M 4 admin admin 0 Apr 14 16:31 formats --r--r----- M 4 admin admin 0 Apr 14 16:31 group --r--r----- M 4 admin admin 0 Apr 14 16:31 jobs --r--r----- M 4 admin admin 0 Apr 14 16:31 nodes d-rwxrwx--- M 4 admin admin 0 Apr 14 16:31 times .P2 Access is restricted to the .CW admin group. Each job has an associated subdirectory (\c .CW 3 , .CW 4 and .CW 7 here) containing job-specific data: .P1 --rw-rw---- M 4 rog admin 0 Apr 14 17:08 ctl --rw-rw---- M 4 rog admin 0 Apr 14 17:08 data --rw-rw---- M 4 rog admin 0 Apr 14 17:08 description --r--r----- M 4 rog admin 0 Apr 14 17:08 duration --r--r----- M 4 rog admin 0 Apr 14 17:08 group --r--r----- M 4 rog admin 0 Apr 14 17:08 id --r--r----- M 4 rog admin 0 Apr 14 17:08 monitor .P2 To create a new job, a program opens the .CW admin/clone file, which allocates a new directory labelled with a new job number, and yields a file descriptor open on its .CW ctl file. Control requests are plain text. For example, .P1 load \fIjobtype\fP \fIarg ...\fP .P2 selects the task generator for the job and provides static parameters for the job, .CW start sets it going, .CW stop suspends it, and .CW delete destroys it. Reading .CW id returns the job's ID. Reading .CW duration gives a number representing the lifetime of the job. Reading .CW group returns the name of the group of worker nodes that can run the job. The first read of the .CW monitor file returns the job's current status, and each subsequent read blocks until the status changes, following the same idiom as for the .CW task file above. Note that only the grid administrator and the job owner have permission to access these files. Files in the .CW admin directory itself describe or control the whole grid. .NH 2 Levels of interface .LP The Job and Node monitors are graphical applications that present a conventional graphical interface, and they are the ones used by most users. The programs act, however, by opening, reading and writing files in the name space above. Thus the node monitor reads a file to find the current system status, which it displays in a frame. The job monitor kills a job by opening the appropriate control file (if permitted) and writing a .CW delete message into it. This makes them easy to write and test. The worker node's software also acts by opening, reading and writing files. .LP Given the description of the scheduler's name space, and the protocol for using each of its files, the three software components (monitors, worker and scheduler) can be written and tested independently of each other. It is easy to produce either `real' files or simple synthetic file servers that mimic the content of the real scheduler's files for testing. Conversely, when testing the scheduler, one uses the Inferno shell to operate on its name space and script test cases. This encourages independent development. .LP At a level below the graphical interace, the system provides a set of shell functions to be used interactively or in scripts. Few end-users currently work at this level, but it is discussed briefly here to suggest that it is not too demanding. Here is a script to submit a simple job to the grid: .P1 run /lib/sh/sched # load the library mountsched # dial the exchange and put it in name space start filesplit -l /tmp/mytasks test md5sum .P2 The last command , .CW start , is a shell function that starts a new job. In this case, a .CW filesplit job, creating separate tasks for each line of the file .CW /tmp/mytasks , each task doing an MD5 checksum of the corresponding data. The implementation of .CW start itself is shown below: .P1 fn start { id := ${job $*} ctl $id start echo $id } .P2 It calls two further functions .CW job and .CW ctl , that interact with the labour exchange's name space described earlier. .CW Job clones a job directory and loads a task generator. Ignoring some error handling, it does the following: .P1 subfn job { { id := `{cat} result=$id echo load $* >[1=0] } $* <> /mnt/sched/admin/clone # open for read/write } .P2 and .CW ctl sends a control message to a given job: .P1 fn ctl { (id args) := $* echo $args > /mnt/sched/admin/$id/ctl } .P2 .LP Several of the task generators also invoke application-specific shell functions on labour exchange and client, allowing tailoring to be done easily. For instance, on the client a function .CW runtask is invoked to start a task running on a client, and .CW submit yields the result data (if any) to return to the exchange. The .CW test job used above is defined on the client as follows: .P1 load std fn runtask { $* > result # run arguments as command } fn submit { cat result } .P2 The .CW runtask for a Windows application is usually more involved, constraining the code that runs, checking preconditions, and providing its data in the place and format it expects. Here is a version for an image-processing task: .P1 fn runtask { # get files check gettar -v # unpack standard input as tar file # run cd /grid/slave/work check mkdir Output check { os -n $emuroot^/grid/slave/image/process.exe Output/image } } fn submit { check puttar Output } .P2 .LP A corresponding job class specification sits on the exchange. It too is a set of shell functions: .CW mkjob to prepare global parameters for a job; .CW mktask to prepare task-specific parameters; .CW runtask to provide a task's data and parameters to a worker; .CW endtask to check its results; and .CW failedtask to do error recovery. Such specifications range from 70 to 140 lines of text, the largest function being 10 to 20 lines. .LP The task generators mentioned earlier are modules in the Limbo programming language. They range between 200 and 700 lines of code, and require knowledge of interfaces shared with the exchange. In practice, now that we have defined general-purpose generators that invoke shell scripts or interpret our job description notation, it is unlikely that anyone else will ever write one, but the interface is there to do it. .NH 2 Job description .LP One task generator interprets a simple form of job description. It specifies a job and its component tasks declaratively, in terms of its file and value inputs. To avoid fussing with syntax, we use S-expressions. For example, the following trivial job calculates the MD5 checksum of each line of the file .CW myfile , using a separate task for each line: .P1 (job (file (path myfile) (split lines)) (task exec md5sum)) .P2 The little declarative language allows specification of a set of tasks as the cross-product of static parameters and input files or directories, and has directives to control disposition of the output. It can optionally specify shell functions to invoke at various stages of a job or task. .NH 1 Related work .LP The Newcastle Connection .[ newcastle connection .] made a large collection of separate UNIX systems act as a single system (in 1981!), by distributing the UNIX system call layer using either modified kernels or libraries. It was easy to learn and use because the existing UNIX naming conventions, and all the program and shell interfaces familiar from the local system, worked unchanged in the larger system, but still gave access to the expanded range of systems and devices in the network. (Their paper mentions earlier systems that also extended UNIX in various ways to build distributed systems from smaller components.) The deployment and use of their system was, however, complicated by limitations built in to the 1970s UNIX design, notably that device control and user identification had been designed for a self-contained system, and perhaps more important, services provided outside the kernel had no common representation, and were not distributed by the Connection. Plan 9 from Bell Labs .[ plan 9 bell labs computing systems .] was by contrast were designed for distribution from the start, and takes a radical approach to resolve the problems with UNIX: in particular, it represents all resources in a uniform way, regardless of origin, and can use a single mechanism to distribute everything. By virtualising Plan 9, Inferno spreads those benefits throughout a heterogenous networks. The underlying protocol itself can, however, be used independently. .LP Minnich .[ xcpu minnich .] has developed an experimental system, .B xcpu , for building clusters of compute nodes using the .B v9fs Linux kernel module. As usual with 9P2000 or Styx, the lowest-level interface to the system is a hierarchy of names, representing nodes and processes on a set of heterogeneous cpu servers. Instead of submitting a job to a scheduler, as in an old-style `batch' system, .B xcpu 's model is that of an interactive time-sharing system. A user requests a set of nodes of particular type (eg, compute or IO) and imports a name space from each one, which he attaches to his local Linux name space. Files in each node's name space represent attributes of the node (such as architecture and ability), the program to run, and its standard input and output. Now the user can use any desired mixture of ordinary user-level Linux system calls, commands and scripts to interact directly with any or all of those nodes, seeing results as they appear. For example, shell commands could set up a group of nodes to run an MPI application, copy the executable to each node in parallel, and then start them, along the following lines: .P1 for i in $NODES; do cp my-mpiapp-binary /mnt/9/$i/exec & done wait # for all copies to finish for i in $NODES; do echo mpiapp $i $NODES >/mnt/9/$i/argv && echo exec >/mnt/9/$i/ctl done .P2 Output collection, error detection, and conditionals use the command language and commands already familiar to the users from daily use off-grid. Minnich observes that of course higher-level interfaces are possible and often desirable; note, however, that this is the lowest-level interface, and it can be used directly without much training. .LP Legion .[ legion campus .] used a virtual operating system model to build a distributed object system, specifically for grid applications. The naming hierarchy locates objects, which have object-specific operations, and objects communicate (below the name space) using messages carrying method call parameters and results. The system provides support for object persistence, fault-tolerance, scheduling, and more, but the programming and user interface is unique to it, and non-trivial. Grimshaw et. al .[ grimshaw philosophical .] compare Legion and Globus GT2 at many levels, including design principles, architecture, and the implications of the passive object model then used by Globus to Legion's active object model. Differences are, indeed, legion. Most important here, though, is the observation that ``it was very difficult [with Legion] to deliver only pieces of a grid solution to the [grid] community, so we could not provide immediate usefulness without the entire product'' but Globus (by providing short-term solutions to key problems) ``provided immediate usefulness to the grid community''. Architecturally, they contrast a bottom-up approach used by Globus to the top-down design of Legion. They describe the then emerging Open Grid Services Architecture as a collection of ``specialized service standards that, together, will realize a metaoperating system environment'', similar in aim therefore to Legion, with a different RPC mechanism based on SOAP, WSDL, etc. Given the relative complexity of Legion itself, though, and the authors' admission that it was most useful as an `entire product', that could be double-edged. .\"``Legion was intended as a virtual OS for distributed resources, Globus as a `sum of services' infrastructure.'' .LP There are several commercial suppliers of grids, sometimes offshoots of university experience. .[ legion campus .] .[ utopia .] Hume describes specialised software for fault-tolerant data processing clusters, .[ ningaui cluster business .] .[ billing system hume .] processing huge telephony billing data sets to deadlines. Google has produced grid systems that are impressive in scale and resiliency, .[ mapreduce .] based on a general underlying storage subsystem, .[ google file system .] with a specialised high-level language to ease programming. .[ sawzall .] These developers have enthusiastic end-users, and argue that their systems (particularly Google's) are objectively easier to use and more productive than ones they replaced, based on metrics such as size and complexity of programs, and comparative system performance. .NH 1 Discussion .LP Beckles .[ beckles .] lists some properties by which `usable' grid software might be assessed: ``engagement with the intended user community; APIs and other programmatic interfaces; user interfaces; security; documentation; and deployment issues''. .LP Our Owen computational grid software was developed in close consultation with its initial users, but has subsequently been provided to other customers. They have similar application profiles to the original users; all of them continue to do useful work with it. Some are contented users of newer facilities such as its job specification notation. We have also made the software available to a few universities and research organisations. Results there are mixed, limited mainly by documentation. If they wish to use it as originally designed, it works well. Others quite reasonably need much more flexibility, and although that is often supported by the underlying software, the extra documentation to allow them to use it easily is not yet available. .LP At a lower level, however, our systems are undoubtedly distinctive in having clients access and control the grid using file-oriented operations. Consequently, the specification of the programming interface — such as the manual page for the scheduler — defines the hierarchy of names it serves, and the contents of the files therein, including a list of control files, the textual messages that can be written to them, and their effect. What are the advantages of this representation? It is directly accessible through any programming language in which there are constructions (or libraries) providing explicit file access with open, read and write. Even better, in some environments it can be accessed directly by existing commands. As discussed above, a small file of shell functions allow grid jobs to be created and controlled from the Inferno command line as discussed above. That is currently of little interest to many end-users, who largely wish the whole process to be invisible: to have the computational distribution happen without their explicit intervention. Some, however, such as those who already write their own Perl scripts, are able, with only a little training, to provide their own interfaces to the grid. Beckles argues for APIs that support `progressive disclosure' of functionality, and we find the underlying file-based metaphor well supports that. The demonstrator grid mentioned above showed that the metaphor can be applied to a respectable range of resource types; there are many other examples to show it is not limiting. .[ networks presotto .] .[ name spaces pike thompson .] .[ forsyth ubiquitous .] .LP Another possible difference compared to a Globus .[ globus kesselman .] or Web Services .[ web services .] interface is that instead of learning both protocol and mechanism, with the file-oriented interface, such users must learn only the names and the protocol: the file-access mechanism is almost always quite familiar to them, and they can carry across their intuitions about it. As it happens, the protocol for the use of a given name space is also usually simpler and more straightforward to use than language-specific APIs full of complex data structures or objects. Indeed, the implementation of file servers has not been discussed here, but it is notable that although the Styx protocol has messages of a specific structure, there are only 13 operations, most of them obvious (eg, open, close, read, write) with obvious parameters, and every file server implements the same operations. Compared (say) to plug-ins to a web server, there are neither programming interfaces nor data structures imposed on the servers by their surroundings; they can manage their own internals as they like, as self-contained programs. For instance, some are concurrent programs, but many are simple programs that run in a single process. Within Inferno, there is a system primitive .CW file2chan that makes it easy to write applications that serve only a handful of names. Lacking (and a serious omission) are good tutorials on writing `full' file servers from scratch. .LP For wide-scale release, it is easy to underestimate the effort required to document both infrastructure and applications, especially to suit a range of end-users (ie, scientists and programmers). In 1999-2000, for the first public release of Inferno itself, a team of five spent many months preparing its documentation and assembling the distribution (and Inferno is not a huge system). That documentation was intended for programmers (perhaps even system programmers). It can be harder still to prepare things for end users, even when the applications have been kept simple. We do not yet consider our grid documentation to be up to our standards. .LP We originally distributed the client software from a CD using the same mechanism as Inferno, but that was clumsy. The grid client software is now typically installed simply by copying a small hierarchy (a few megabytes), for instance from a USB memory stick or a shared file store. .SH Conclusions .LP Supposing that within the constraints of our time and treasure, we have indeed managed to build grids that customers found usable, and with underlying technology and interfaces that we consider usable, are there any general lessons? They are mainly traditional ones. As implied in the company history in the Introduction, we have successfully applied them in the past, outside Inferno. .LP We had a clear and indeed limited design (``do one thing well''). Rather than provide an abstract interface to a big family of possibilities, as is done by some of the Web Service specifications, we decided on one that was `fit for purpose'. We used mature technology with which we had much design and implementation experience; its definition is compact and stable. The system should help separate concerns and provide a focus for the design. In our case, system-level mechanisms do distribution and station-to-station authentication. The name space provides the focus of discussion during initial design meetings, and its description acts as a high-level contract between components. Minimisation of mechanism is a good engineering principle; we were working with an infrastructure that provides a narrow interface. On the other hand, if the system-provided mechanisms are too primitive (``message passing'' or ``remote procedure call'') then there is little structural support for application development, no common currency, and little chance of either users or programmers being able to re-use knowledge. Using simple metaphors that make sense to them must surely help, especially if the just one can be used throughout. .LP Most important, as Beckles suggests, .[ beckles .] interaction with end users is essential. Our various projects have been successful, and the resulting new systems worked well for their users, partly because the underlying technology worked well and was easy to use, but also because we dealt with them as normal commercial projects, working closely with customers to provide something that suited them. As important as that interaction is an imposed discipline. We have fortunately been able to deal directly with end users who really must get their work done; the amazing politics now associated with the e-science grid world is irrelevant to them and to us. Furthermore, commercial requirements—certainly for a small company—mean that the result must meet deadlines, satisfy customer expectations, and keep to budget. We find this helps to focus mind and energy wonderfully. .SH Acknowledgements .LP Roger Peppé and Chris Locke designed and implemented the labour exchange; Peppé subsequently added the job description language and other improvements, and wrote the documentation. Danny Byrne wrote the graphical interface applications. .SH References .LP .[ $LIST$ .] .FL .de Bp .br .X CMD BP .. .Bp .SP 3i exactly .FG images/image001.ps 2i .ft I Figure 1.\ The Node Monitor shows the status of computers forming the virtual computer .ft P .fg .FG images/image013.ps 2i .ft I Figure 2.\ The Job Monitor shows job status and provides job control .ft P .fg