• Bug#820119: [www.debian.org] validation errors: cannot convert characte

    From Laura Arjona Reina@21:1/5 to All on Wed Sep 20 01:00:01 2017
    XPost: linux.debian.bugs.dist

    Hello all
    Now that we are using the more modern tool onsgmls instead of nsgmls in our "validate" script:

    https://anonscm.debian.org/cgit/debwww/cron.git/tree/scripts/validate

    I've returned to this bug.

    The output of the validate script for the files containing "emojis" didn't change much:

    **** Errors validating
    /srv/www.debian.org/www/international/l10n/po/en_GB.it.html: ***
    Line 122, character 357: cannot convert character reference to number
    128513 because character not in internal character set

    I was a bit surprised that we are still getting these errors, because if I pass the online w3c validator https://validator.w3.org/ or even a manual onsgmls command in the machine that builds the website:

    onsgmls -E0 -s /path/to/dtd /path/to/file

    in both cases I don't get any error.
    So I've looked at the "validate" script and played a bit with the options set there, and I'd like to bring to your attention the lines L363-376:

    # Determine whether we're dealing with HTML or XHTML and set the SP
    # environment accordingly.
    if ($xhtml{$htmlLevel}) {
    $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
    $ENV{'SP_ENCODING'} = 'xml';
    } else {
    $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
    if (defined $charset) {
    $ENV{'SP_ENCODING'} = $charset;
    } else {
    $ENV{'SP_ENCODING'} = "ISO-8859-1";
    }
    }
    $ENV{'SP_CHARSET_FIXED'} = 1

    If I comment this last line (and thus, letting onsgmls run in not fixed mode), I
    get no errors validating the file.

    I've read the documentation about these options:

    http://openjade.sourceforge.net/doc/charset.htm

    but frankly I don't understand it very much.

    I've done:

    larjona@wolkenstein:~$ sudo -u debwww env | grep SP_

    and it returns nothing, so I guess only the environment set in "validate" script
    is taken into account, if we don't set the variables there, defaults rule.

    I've modified and run a copy of the validate script, making it print some values
    when checking a file, and document type is correctly detected (HTML 4.01 Strict), as well as charset (utf-8).

    I'm not sure I can safely comment the line 376

    $ENV{'SP_CHARSET_FIXED'} = 1;

    to avoid the errors, or even comment the whole paragraph, and trust onsgmls to do the right thing.

    Anybody with more experience in this can help?

    Thanks
    --
    Laura Arjona Reina
    https://wiki.debian.org/LauraArjona

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)