NAME
uhtml – convert foreign character set HTML file to unicode |
SYNOPSIS
uhtml [ –p ] [ –c charset ] [ file ] |
DESCRIPTION
HTML comes in various character–set encodings and has special forms
to encode characters. To make it easier to process HTML, uhtml
is used to normalize it to a Unicode–only form.
Uhtml detects the character set of the HTML input file and calls
tcs(1) to convert it to UTF replacing HTML–entity forms by their
Unicode character representations except for lt, gt, amp, quot,
and apos. The converted HTML is written to standard output. If
no file was given, it is read from standard input. If the –p
option is given, the detected character set is printed and the
program exits without conversion. In case character–set detection
fails, the default (UTF) is assumed. This default can be changed
with the –c option. |
SOURCE
/sys/src/cmd/uhtml.c |
SEE ALSO
tcs(1) |