.TH DOC2TXT 1 .SH NAME doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents .SH SYNOPSIS .B doc2txt [ .I file.doc ] .br .B xls2txt [ .I file.xls ] .br .B aux/olefs [ .B -m .I mtpt ] .I file.doc .br .B aux/mswordstrings .I /mnt/doc/WordDocument .br .B aux/msexceltable [ .B -n ] [ .B -t ] [ .B -a ] [ .BI -d delim ] .I /mnt/doc/Workbook .SH DESCRIPTION .I Doc2txt is a shell script that uses .I olefs and .I mswordstrings to extract the printable text from the body of a Microsoft Word document. .I Xls2txt performs a similar function for Microsoft Excel documents. .PP Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft's FAT file system. .I Olefs presents the contents of an Office document as a file system on .IR mtpt , which defaults to .BR /mnt/doc . .I Mswordstrings or .I msexceltables may then be used to parse the files inside, extracting a text stream. .I Msexceltables may be given options to control the formatting of its output. .TP -n Disables field padding to colum width. .TP -t Truncate fields to the colum width. .TP -a Attempt conversion of non-tabular sheets in the workbook. (charts). .TP -d \fIdelim\fR Sets the interfield delimiter to the string \fIdelim\fR, by default a single space. .SH SOURCE .B /sys/src/cmd/aux/mswordstrings.c .br .B /sys/src/cmd/aux/msexceltables.c .br .B /sys/src/cmd/aux/olefs.c .br .B /rc/bin/xls2txt .br .B /rc/bin/doc2txt .SH BUGS .I Msexcelstrings cannot parse files containing rich text field descriptions or Asian phonetic pronunciation hints due to a lack of ducumentation on these formats; It has only been tested on BIFF8 files generated by MS Office 97; Caveat Emptor. .SH SEE ALSO .IR strings (1) .br ``Microsoft Word 97 Binary File Format'', available on line at Microsoft's developer home page. .br ``LAOLA Binary Structures'', .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh .br ``OpenOffice.Org's Excel Documentation'', .I http:\/\/sc.openoffice.org/excelfileformat.pdf