CGI Handler

LOCATION

$web/etc/handler

DESCRIPTION

CGI handler have been called as execution handler in older Pegasus manual.
The file “$web/etc/handler” is a simple table format that maps requested URL path pattern to the action.

This mechanism is used to define the form of CGI file, SSI (Server Side Include) and auto-indexing service for specific directories.

Path in URI

URL syntax of rfc2616 is
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
According to this syntax, abs_path is
/path/to/document
if the URL is
http://host:port/path/to/document?query
Pegasus httpd is running in a confined name space. The httpd will look the document at
/doc/path/to/document
in httpd name space.
If the URL is to a user (say alice), that is,
http://host:port/~alice/path/to/document?query
then the RFC claims the abs_path is
~alice/path/to/document
For Pegasus, alice's document is not a subset of real host document.
Alice's document is in separated name space from real host document.
The httpd will look the document at
/doc/path/to/document
in httpd name space.

We call the /doc/path/to/document request path, and denote the path by $request.
If the request path ends with “/”, then Pegasus internally appends index.html. We call the resulting path effective request path.
This does not mean two URL

http://host/path/to/foo/
and
http://host/path/to/foo/index.html
makes same result if /path/to/foo is not a directory. (a file or non-existent)

Configuration

The following is the content of my configuration (http://plan9.aichi-u.ac.jp).
# path      mimetype    hctl    execpath arg ...
/netlib/*/index.html text/html 0 /bin/ftp2html
/printenv/*  text/plain   0       /bin/printenv $target
*.http         -         1       $target
*.cgi       text/html    +       $target
*.html      text/html    0       $target

Fig.1: CGI handler of Pegasus.

First field is a path pattern, second field is default mime type, third fields is the control level of http header by the script, and 4th field is the path to a script. The 4th field may be followed by arguments of the script.

Path patterns are compared with effective request path.
The comparison is performed from the top of lines, and stopped if a pattern is matched.

In path pattern, directory separator “/”' is not special. ( Therefore this pattern matching is not same as that of shell. ) There is one exception: we have a rule that pattern “/*/” matches “/”. Therefore the pattern

/netlib/*/README
matches to /netlib/README as well as /netlib/cmd/rit/README for example.

Second field denotes the default value of HTTP header “Content-Type”. If the field is “-”, the script must set the header.

Third field named “hctl” takes values ‘1’,'+', and ‘0’ that means control level to the http headers by the script; the meanings are

1	full control by the script
+	partial control by the script
0	no control by the script
If ‘1’ is specified the script has responsibility to write all http header; the script is called non-parsed CGI in CGI/1.1. HTTP headers must be separated from HTML headers by a single blank line: a line that contains only “\n” code.
If ‘0’ is specified the script must not write http header. The header is provided by httpd. The output style should be
<!DOCTYPE html>
<html>
...
</html>

In Fig.2, the third line starting with /printenv is combined with the script below.

#!/bin/rc
rfork e
echo 'ARGUMENT'
for(x in $*)
  echo $x
echo
echo 'ENVIRONMENT'
for(x in `{ls -p /env}){
  if(test -r /env/$x)
    echo $x `{cat /env/$x}
}

Fig.2: /bin/printenv

This script may be useful to write CGI scripts under Pegasus.

If ‘+’ is specified the script may contain http headers in compliance with CGI/1.1. The typical output style is

Content-Type: text/html
Status 200 OK

<!DOCTYPE html>
<html>
...
</html>

Fig.3: script example.

Note that:


NB: Pegasus has a bug in default mimetype for hctl="+".
That is, Fig.3 is wrong if mimetype="text/html", but OK if mimetype="-".
This script should be OK even if mimetype="text/html".
This bug will be fixed in next release (Pegasus 2.8a).

Another example is shown below.

Set-Cookie: cookie=something; expire=Sun, 6-Aug-2006 11:43:57 GMT; domain=ar.aichi-u.ac.jp; path=/test4; secure

<html>
<head>
<title>Cookie sample</title>
</head>
<body>
...
</body>
</html>

A reserved word $target in or after 4th field denotes absolute path (in httpd space) to the requested document. That is, $target is the path that is prefixed “/doc/” to effective request path.

The 4th field is a path to executable program that handles the request. Note that $target in 4th field means the effective request path is an executable program.

The second line that begin with /netlib in Fig.1 is for

http://plan9.aichi-u.ac.jp/netlib/
to handle my FTP directories1.

Note 1: In old days, the directory was used for FTP service.

Other server such as Apache has an option to show directory index if index.html is absent. ftp2html also does this action but does much more: if README file is present then the content is shown, and if INDEX file is present then the content is shown with appropriate action tag to the index label.

Enhanced control “*”

In supporting WebDAV, a symbol “*” was introduced to the third field of handler for the scripts that must handle all methods. (Pegasus 2.4)

Thus, the following configuration

/dav	-	*	/bin/foo
/dav/*	-	*	/bin/foo
will enable “/bin/foo” to handle all requests that begin with URI:
http://host/dav
where “host” is your host domain name.

The meaning of symbol “*” is same as “+” except the scripts must handle all methods.

Meaning of other symbols ("0", “1” and “+”) are kept as they have been. Only the requests with HEAD, GET and POST methods will go to these script. Other requests will be handled by Pegasus and will be rejected except for OPTIONS. You need not handle HEAD method in these script, because the request is handled by Pegasus.

In summary, difference of meaning of symbols in the third field is listed in the following table.

Table 1: hctl symbols in handler
method limited method all method
simple cgi 0
cgi/1.1 + *
non-parsed cgi 1
where “simple cgi” means that the cgi has not ability to control http headers, and “limited method” means that only GET, HEAD and POST methods are handled by the cgi.

Files that begin with “.”

Dot files (files that begin with “.”) have been specified as “accessible only via CGI”.
Now, the specification is only valid for GET,HEAD and POST method.
WebDAV must be able to handle all files including dot files*. Therefore, “*” in third field of handler also means to accept dot files.

* otherwise Finder of Mac/OSX client becomes somewhat unstable.

Access to dot files from Mac/OSX client is annoying and causes dull response of the client. How to prevent the access? You will find some tips on the topic in next URI:
http://lists.apple.com/archives/Spotlight-dev/2006/Jun/msg00008.html

I don't know how to prevent accessing resource forks (files that begin with “._”).

Ramfs

Ramdisk is always provided to the script, and is automatically vanished as soon as the script is finished or terminated.

A special file “...” is internally used to compute Content-Length of output of CGI.
You need not compute Content-Length in your CGI program for HTTP/1.1.

X-CGI-Pass

An extended CGI header “X-CGI-Pass” is added.
The header is really useful for scripts because it enables scripts to pass the request to the host server.
Writing codes to answer to GET request is bothersome. Why we must write the codes? Servers already have the ability to answer the request!

example

An example.
if(! ~ $request */){
	echo X-CGI-Pass: /doc$request
	echo
	exit
}

The specification

CGI header
X-CGI-Pass: /baz
is a directive to let the server send the file in place of CGI file, where “/baz” is an absolute path name in httpd name space.

If “/baz” is equal to $target, you may omit the name:

X-CGI-Pass:

Comparison with Apache CGI

If “text/html” is specified for mimetype and the hctl value is ‘0’, then the format of CGI file is:
<!DOCTYPE html>
<html>
...
</html>
That is, don't start with “Content-Type:” as Apache requires:
Content-Type: text/html

<!DOCTYPE html>
<html>
...
</html>

Apache type CGI is also supported. The file with suffix “cgi” in Fig.1 will configure CGI/1.1 for the file.

Error handling in CGI program

In case that “text/html” is specified for “mimetype”, Pegasus automatically send HTML headers to the client. Then response header becomes following rule:

It seems this rule is working well, however we can control directly the connection: we can specify “keep” or “close” after “#
exit '403 Forbidden # keep'

Both stdout and stderr are passed to client.

ENVIRONMENT VARIABLES

Pegasus has many environment variables. However most of them are only experimental. Solid variables are shown in the following:
AUTH_TYPE
CONTENT_LENGTH
CONTENT_TYPE
GATEWAY_INTERFACE
PATH_INFO
PATH_TRANSLATED
QS_name		# the name is name part in QUERY_STRING (see Note 1)
QUERY_STRING
REMOTE_ADDR
REMOTE_HOST
REMOTE_USER
REQUEST_METHOD
REQUEST_URI
REQUEST_USER
SCRIPT_NAME
SERVER_NAME
SERVER_PORT
SERVER_PROTOCOL
SERVER_SOFTWARE
and all the attribute in HTTP header such as
HTTP_URI
HTTP_SCHEME
HTTP_HOST
HTTP_REFERER
HTTP_USER_AGENT
with original header
HTTP_HEADER

Additionally we have

request		# requested path (see Note 2)
home		# /doc
query		# same as QUERY_STRING
target		# requested path from document root (see Note 3)
name		# basename of target
hpid		# pid of httpd that invoked the current script

Note 1: Query string is automatically decoded by the httpd. For example, a query
members&children&name=alice&age=16
produces environment variables:
QS_=(members children)
QS_name=alice
QS_age=16
The prefix “QS_” is added for safety.

Note 2: Path of request might end with “/” if it is a directory. On the other hand target is a file that is effectively requested. target is expressed in the notation of rc.

target = $request		# request to a file
target = $request/index.html	# request to a directory

Note 3: The name “target” in environment variable is confusing because the same name is used in handler in different meaning. Therefore this name should be obsolete in future.
Note 4: environment variables starting with “HTTP_” are generated from key:val pair in HTTP request header. Key is case insensitive. Current RFC states that the key may be any printable ASCII but for “:”. However allowing special characters has potential risk in handling incoming requests. Note that all keys that are currently registered to IANA consist of only alpha-numeric and ‘-’. Therefore, in generating environment variables, Pegasus-2.9 allows only keys of IANA form and converts them to uppercase and, in addition, ‘-’ to ‘_’. The latter translation is to make it easy to handle keys in shell script. This conversion rule might be or might not be broken in future.


The current working directory of invoked CGI program is the directory where the target is located.

Other environment variables might be discarded or renamed in future.

INTERNAL FLOW

erpath=$request
if(test -d $erpath){
    if(! ~ $erpath */){
        redirect $erpath/
        # which means we begin from the first by substituting
        # request=$erpath/
    }
}
if(~ $erpath */)
    erpath=$erpath/index.html
access_check $erpath
handler $erpath
send /doc$erpath

handler's first field is compared with $erpath.
$target in handler is /doc/$erpath.

CGI/1.1

If script name is in URI

Let “foo” be an executable file. Then I will make clear values of related variables in case requests are:
http://host/foo/?bar
and
http://host/~alice/foo/?bar
URI related environment variables are listed below with examples.

Table 2: URI related environment variables of Pegasus
request to host document request to user's document decoded? specified by
HTTP URI  http://host/foo/?bar  http://host/~alice/foo/?bar HTTP/1.1
$HTTP_SCHEME http http NO Pegasus
$HTTP_HOST host host NO Apache
$REQUEST_URI /foo/?bar /~alice/foo/?bar NO Apache
$REQUEST_USER alice YES Pegasus
$PATH_INFO / / YES CGI/1.1
$PATH_TRANSLATED /doc/ /doc/ YES CGI/1.1
$SCRIPT_NAME /foo /~alice/foo YES CGI/1.1
$QUERY_STRING bar bar NO Apache
$request /foo/ /foo/ YES Pegasus
NB:

If script name is not in URI

CGI handler, or execution handler, of Pegasus is powerful. For example we can configure like this:
/foo/*	- + /bin/baz
This means: all request to the path that begin with “/foo/” go to the script “/bin/baz”. Note that CGI/1.1 specification supposes only the case that script name is in URI. Environment variables $PATH_INFO, $PATH_TRANSLATED and $SCRIP_NAME are defined on this assumption.
On the contrary, request to Pegasus
http://host/foo/?bar
does not mean “foo” is a script nor even “foo” is existent. The role of “foo” is something like a tag. This characteristic is very useful in supporting WebDAV.

Then, what values of these environment variables should be? The answer is unclear.
CGI/1.1 specification says that concatenation $SCRIP_NAME$PATH_INFO must be a decoded path part in URI.
Therefore these values are assigned as shown below.

Table 3: some variables in case that script name is not in URI.
request to host document request to user's document decoded? specified by
HTTP URI  http://host/foo/?bar  http://host/~alice/foo/?bar HTTP/1.1
$PATH_INFO /foo/ /~alice/foo/ YES CGI/1.1
$PATH_TRANSLATED /doc/foo/ /doc/foo/ YES CGI/1.1
$SCRIPT_NAME YES CGI/1.1

Handling of POST data

If POST'ed data is once received by the server from the client, Content-Length is checked by the server in receiving the data. Then server passes the data to CGI using stdin.

CGI TIMEOUT

Global Setting

Timeout is defined to prevent buggy programs from waiting data so long time. The value can be specified in /sys/lib/httpd.conf. The default is 5 seconds. I think the value is enough because the data is already held by the server.

For Each CGI

Some CGIs take much time to complete the task. The time is CGI dependent.
Therefore I enabled dynamical resetting of CGI timeout for each CGI.

A environment variable “hpid” is introduced for this purpose.
The “pid” is that of Pegasus in service.

example 1

In starting service, execute
echo -n timeout 180 > /proc/$hpid/note
which will reset CGI timeout to 180 seconds.

example 2

The example below written in Python is extracted from a script on my server.
def settimeout(n):
        note="/proc/%d/note"%hpid
        f=open(note, "w")
        if f==None:
                print "unable to open %s"%note
                print "timeout is not set"
                return
        f.write("timeout %d"%n)
        f.close()
e=os.environ
hpid=0
if e.has_key("hpid"):
	hpid=int(e["hpid"])
if hpid:
	settimeout(180)
# continues heavy loaded tasks