============================ seft (search engine for text) ============================ Usage: seft [OPTIONS] "query terms" text_files Seft takes a set of query terms and a set of files as arguments and, using a locality-based similarity heuristic, determines word locations within the files that are of interest with respect to the query. The user is then presented with a sequence of windows of text, the first window surrounding the most relevant location, the second window surrounding the next most relevant location and so on. Both the number of windows presented and the size of the window can be specified as parameters to seft. In addition, the user can specify whether to apply case-folding and/or stemming to the query terms and the text files. [OPTIONS] -f query_file A text file containing query terms -m max_windows Specifies the maximum number of windows to display (default = 5) -w window_size Specifies the number of lines within a window (default = 3) -x Turns off high-lighting of query term locations -n Suppress output -s [0|1|2] 0 = casefolding off, stemming off 1 = casefolding on, stemming off 2 = casefolding on, stemming on (default = 2) -p Print a formfeed character after every window. Useful when piping output through a pager such as more. Examples of usage: ------------------ Consider that the text file Query has the contents "computer industry" then the following seft examples have the same meaning: seft -f Query ~oldk/News/* seft "computer industry" ~oldk/News/* These commands would have the effect of searching through a users News folder for articles relating to "computer" and "industry", and returning windows of text surrounding the most relevant locations of text. Window merging: --------------- If highly ranked query locations lie in close proximity, then it is likely that seft would display either windows which contain the same contents (the highly ranked query terms exist on the same line) or windows which partially overlap. To avoid this, the current version of seft does not display windows whose centre line has already been displayed (anywhere) within a previous window. Further work: ------------- - Piping from stdin: The current implementation of seft does not allow the text files (to be searched) to be piped into seft, as in: cat ~oldk/News/* | seft -q "computer industry" - Document delimiters: Currently, the character "^B" is used as a document delimiter. Such a delimiter could be set as a command line argument or set in a resource file. References: ----------- A detailed discussion of seft can be found in: @inproceedings{dm00:acsc, author = "O. de~Kretser and A. Moffat", title = "Needles and Haystacks: A Search Engine for Personal Information Collections", booktitle = "Proc. 23rd Australasian Computer Science Conference", year = 2000, note = "To appear", } The locality-based ranking heuristic used by seft is described in: @inproceedings{dm99:adc, author = "O. de~Kretser and A. Moffat", title = "Effective Document Presentation with a Locality-Based Similarity Heuristic", booktitle = "Proceedings of the Twenty-Second Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval", pages = "113-120", month = aug, year = 1999, address = "San Francisco, CA", editor = "M. Hearst and F. Gey and R. Tong", } Owen de Kretser, 1/2000